Question

BWA gives wrong number of input sequences

0

Entering edit mode

7.0 years ago

agata88 ▴ 870

Hi all!

I was trying to map sequences to genome contigs. To do that I've performed commands:

bwa index ref.fasta
bwa mem ref.fasta input.fasta > aln.sam
samtools view -Sb aln.sam > aln.bam
samtools sort aln.bam > aln_sorted.bam
samtools flagstat aln_sorted.bam > aln_stats.txt

input. fasta include 3158 sequences.

Here is aln_stats.txt output:

 5284 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    2126 + 0 supplementary
    0 + 0 duplicates
    4890 + 0 mapped (92.54% : N/A)
    0 + 0 paired in sequencing
    0 + 0 read1
    0 + 0 read2
    0 + 0 properly paired (N/A : N/A)
    0 + 0 with itself and mate mapped
    0 + 0 singletons (N/A : N/A)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

Why output includs 5284 sequences in total?

I see that 2126 is supplementary, 5284 - 2126 = 3158, which is the number of sequences in input.

Thanks in advance, Best, Agata

bwa • 2.5k views

ADD COMMENT • link updated 7.0 years ago by swbarnes2 15k • written 7.0 years ago by agata88 ▴ 870

2

Entering edit mode

bwa mem ref.fasta input.fasta > aln.sam
samtools view -Sb aln.sam > aln.bam
samtools sort aln.bam > aln_sorted.bam

Note that you could simplify this, speed things up and avoid intermediate files:

bwa mem ref.fasta input.fasta | samtools sort -o aln_sorted.bam -

(requiring that your samtools is not too old, but you should use up-to-date software anyway)

ADD REPLY • link 7.0 years ago by WouterDeCoster 48k

1

Entering edit mode

Hi Agata, these are most likely sequences mapping to more than one location. The BAM doesn't tell you the number of input sequences but the number of alignments against the reference

ADD REPLY • link 7.0 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

Hey, thanks! Do you know how can I find which sequences mapped to different contigs? Best, Agata

ADD REPLY • link 7.0 years ago by agata88 ▴ 870

0

Entering edit mode

Search Biostars for "identify multi-mapped reads". Use an external google search.

ADD REPLY • link 7.0 years ago by GenoMax 152k

score 0 · Answer 1 · 2018-08-13

samtools flagstat is not a very smart program. It's merely counting lines in your bam. If a read is in there twice, because it aligned to multiple places and all are included, samtools flagstat will count it twice. If you omit reads with the 256 binary flag, that will get rid of the secondary alignments.