Short reads aligner compartment

Introduction

BWA

Burrows-Wheeler Aligner

According to the Documentation

BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. It consists of three algorithms: BWA-backtrack, BWA-SW and BWA-MEM.

  • BWA-backtrack: It is designed for Illumina sequence reads up to 100bp,
  • BWA-MEM: BWA-MEM and BWA-SW share similar features such as long-read support (70bp to 1Mbp) and split alignment. It is generally recommended for high-quality queries as it is faster and more accurate. It also has better performance than BWA-backtrack for 70-100bp Illumina reads.

QS

  • There are three algorithms, which one should I choose?
    For 70bp or longer Illumina, 454, Ion Torrent and Sanger reads, assembly contigs and BAC sequences, BWA-MEM is usually the preferred algorithm. For short sequences, BWA-backtrack may be better. BWA-SW may have better sensitivity when alignment gaps are frequent.
  • What is the tolerance of sequencing errors?
    • Bwa-back is mainly designed for sequencing error rates below 2%.
    • WA-SW and BWA-MEM both tolerate more errors given longer alignment. Simulation suggests that they may work well given 2% error for an 100bp alignment, 3% error for a 200bp, 5% for 500bp and 10% for 1000bp or longer alignment.
  • Does BWA find chimeric reads?
    Yes, both BWA-SW and BWA-MEM are able to find chimera. BWA usually reports one alignment for each read but may output two or more alignments if the read/contig is a chimera.
  • Does BWA work on reference sequences longer than 4GB in total?
    Yes. Since 0.6.x, all BWA algorithms work with a genome with total length over 4GB. However, individual chromosome should not be longer than 2GB.

PS

chimeric reads: Chimeric reads occur when one sequencing read aligns to two distinct portions of the genome with little or no overlap. Chimeric reads are indicative of structural variation. Chimeric reads are also called split reads. After aligning with bwa mem, chimeric reads will have an SA tag as described on page 7 of the SAM format specification. To find them all you have to do is extract them using grep. donfreed, 2014. In the documentation, it mentioned that bwa aln is for find the SA reads.

Bowtie2

Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. It is particularly good at aligning reads of about 50 up to 100s or 1,000s of characters, and particularly good at aligning to relatively long (e.g. mammalian) genomes. Bowtie 2 indexes the genome with an FM Index to keep its memory footprint small: for the human genome, its memory footprint is typically around 3.2 GB. Bowtie 2 supports gapped, local, and paired-end alignment modes.
What isn’t Bowtie 2?
Bowtie 2 is geared toward aligning relatively short sequencing reads to long genomes. That said, it handles arbitrarily small reference sequences (e.g. amplicons) and very long reads (i.e. upwards of 10s or 100s of kilobases), though it is slower in those settings. It is optimized for the read lengths and error modes yielded by typical Illumina sequencers.
Bowtie 2 does not support alignment of colorspace reads. (Bowtie 1 does.) Documentation

Bowtie2 could awareness splicing because it cut reads into even shorter before doing alignment.

PS

What is colorspace reads:

  • Short answer: color space refers to the native format of ABI SoLID technology. Color space is translated to nucleotide, or base space (same thing) so that it can be understood.
  • That technology is not growing in market share, so in the next few years it will become less common. ABI is putting most of their effort behind the Ion Torrent now. swbarnes2, 2012

Test Data

>1_Perfect_match
TAATTCCCAAGATGAAGTTCCTGATCATCCTTGCCCTGGCTGTGGCCGCC
>2_Intron_insertion
TAATTCCCAAGATGAAGTTCCTGATTGCTGATCGATCGTAGCTAGCTAGCTAGCTAGCTAGCTACGTGCATCAGTCGATCAGTACGTCAGATGCTGTCGATCGTAGTCGATCGATGCTAGCTAGCTAGCTGCATAGTAGCTGCATGCTAGCTGCTAGCTCAGTAGCTCGTGCATGCATGCATCATCCTTGCCCTGGCTGTGGCCGCC
>3_Intron_insertion
TAATTCCCATGCTGATCGTGACTGCTGATCGATCGTGCTAGTCGATGCTCGTGCATGCTGCATGCTAGCTAGCTAGCTGACTGATCGTACGTCAGTGCATGCATGCTAGCTAGTAGCTAGCTAGCTAGCTCAGTCAGTCAGTCGATCGATGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGTCGATGCTAGCTAGCTAGCTCAGTCAGTAGCTCAGTCGAGCTGTGTGTGCTAGCACTACGTGTCGATGTGTCAGTAGATGAAGTTCCTGATCATCCTTGCCCTGGCTGTGGCCGCC
>4_Cross_Chrm_L
TAATTCCCAAGATGAAGTTCCTGATTGCTGATCGATCGTAGCTAGCTAGCTAGCTAGCTAGCTACGTGCATCAGTCGATCAGTACGTCAG
>5_Cross_Chrm_R
ATGCTGTCGATCGTAGTCGATCGATGCTAGCTAGCTAGCTGCATAGTAGCTGCATGCTAGCTGCTAGCTCAGTAGCTCGTGCATGCATGCATCATCCTTGCCCTGGCTGTGGCCGCC
>6_Small_deletion_3
TAATTCCCAAGATGAAGTTCCTATCCTTGCCCTGGCTGTGGCCGCC
>6_deletion_10
TAATTCCCAAGATGAAGTTCTTGCCCTGGCTGTGGCCGCC
>6_deletion_20
TAATTCCCAAGATGAAGTTCTGTGGCCGCC
>7_Samll_insertion_3
TAATTCCCAAGATGAAGTTCCTGATGATCATCCTTGCCCTGGCTGTGGCCGCC
>8_SNP
TAATTCCCAAGATGAAGTTCCTGATTATCCTTGCCCTGGCTGTGGCCGCC

Build reference DB (index)

module load trinity/2.8.5
module load bwa
module load bowtie2
module load rsem

for i in $(grep ">" Ref.fa|sed 's/>//'); do grep -A 1 $i Ref.fa > $i.fa; done

for i in $(ls *.fa);
do bwa index $i
bowtie2-build $i $i
bowtie-build $i $i
done

Aligning

for i in $(ls DB/*.fa)
do SAMPLE=$(echo $i| sed 's=DB/==');
bwa mem $i test.fq | grep -v ^@ > $SAMPLE.bwa.sam
bowtie2 -x $i -U test.fq|grep -v ^@ > $SAMPLE.bowtie2.sam
bowtie $i test.fq|grep -v ^@ > $SAMPLE.bowtie.sam
done

grep . *.sam| sed 's/.sam:/\t/'|awk -F"\t" '{print $1"\t"$4}' | awk -F"." '{print $1"\t"$NF}'> Result

Results

Format of sam file:

A00327:224:HW2JVDRXY:1:1101:1253:1000	0	1_Perfect_match	1	60	50M	*	0	0	TAATTCCCAAGATGAAGTTCCTGATCATCCTTGCCCTGGCTGTGGCCGCC	#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF	NM:i:0	MD:Z:50	AS:i:50	XS:i:0
Columns Abbr. Exp Describe
1 QNAME A003… Query (pair) NAME
2 FLAG 0 bitwise FLAG
3 RNAME 1_Perfect_match Reference sequence NAME
4 POS 1 1-based leftmost POSition/coordinate of clipped sequence
5 MAPQ 60 MAPping Quality (Phred-scaled)
6 CIAGR 50M extended CIGAR string
7 MRNM * Mate Reference sequence NaMe (‘=’ if same as RNAME)
8 MPOS 0 1-based Mate POSistion
9 ISIZE 0 Inferred insert SIZE
10 SEQ TAAT… query SEQuence on the same strand as the reference
11 QUAL FFFF… query QUALity (ASCII-33 gives the Phred base quality)
12 OPT NM:i:0 MD:Z:50 AS:i:50 XS:i:0 variable OPTional fields in the format TAG:VTYPE:VALUE
Reference bowtie2 bowtie bwa
1_Perfect_match 1_Perfect_match 1_Perfect_match 1_Perfect_match
2_Intron_insertion * NA *
3_Intron_insertion * NA 3_Intron_insertion
4_Cross_Chrm_L * NA *
5_Cross_Chrm_R * NA *
6_deletion_10 * NA *
6_deletion_20 * NA *
6_Small_deletion_3 6_Small_deletion_3 NA 6_Small_deletion_3
7_Samll_insertion_3 7_Samll_insertion_3 NA 7_Samll_insertion_3
8_SNP 8_SNP 8_SNP 8_SNP
Ref 1_Perfect_match 1_Perfect_match 1_Perfect_match

Aligning Reads with Multiple Hits Using Bowtie2

By default, Bowtie2 reports only one alignment (the best one) for each read. However, you can configure it to report multiple alignments using the -k or -a options:

  • k : Report up to alignments per read (useful for getting multiple alignments if they exist).
  • a: Report all valid alignments (this can produce a large output if there are many alignments).
    Example command to report up to 5 alignments per read:
bowtie2 -x index -U reads.fq -k 5 -S output.sam

TopHat

Unlike BWA and bowtie2, TopHat is a intron awareness aligner. It not only automatically split mapped and unmapped bam, but also given variation results like ‘junction’, ‘deletion’, and ‘insertion’ bed files. But the align performance of TopHat2 is not optimistic from some comparison.

column name of 'junction.bed':

[seqname] [start] [end] [id] [score] [strand] [thickStart] [thickEnd] [r,g,b] [block_count] [block_sizes] [block_locations]

  1. "seqname": chromosome
  2. "start": is the start position of the leftmost read that contains the junction.
  3. "end": is the end position of the rightmost read that contains the junction.
  4. "id": is the junctions id, e.g. JUNC0001
  5. "score": is the number of reads that contain the junction.
  6. "strand": is either + or -.
  7. "thickStart" and
  8. "thickEnd": don't seem to have any effect on display for a junctions track. TopHat sets them as equal to start and end respectively.
  9. "r.g.b": "r","g" and "b" are the red, green, and blue values. They affect the colour of the display.
  10. "block_count": The block_count will always be 2. The two blocks specify the regions on either side of the junction.
  11. "block_sizes": "block_sizes" tells you how large each region is.
  12. "block_locations": it tells you, relative to the "start" being 0, where the two blocks occur. Therefore, the first block_location will always be zero. from: Alex124, 2012

basic use:

tophat -G sample.gtf -p 40 -o output_dir Genome.bowtie2.index Sample.fq.gz

More infor for bed fromat: UCSC

Hisat2

hisat2 -x indexed_genome -p 40 -U sample.fq.gz | samtools view -S -b - > result.bam

Others

Ryan Musich, et al using 48 fubgi’s RNA-seq samples, Erysiphe necator, to compare the results of aligners bowtie2, bwa, hisat2, Mummer4, star, and tophat2. They found that bowtie2 and bwa have less unmapped reads. TopHat2 is the worst which has arround 80% unmapped reads. Bowtie2 and bwa have a similar positive result but bwa consume much less time.

© Ryan Musich, et al., 2021

Grzegorz M. Boratyn, et al. developed a new tool Magic-BLAST for improving the accuracy of RNA-seq reads aligning. They compared the

© Grzegorz M. Boratyn, et al.
Author

Karobben

Posted on

2022-09-25

Updated on

2024-05-19

Licensed under

Comments