Hi all!
I was trying to map sequences to genome contigs. To do that I've performed commands:
bwa index ref.fasta
bwa mem ref.fasta input.fasta > aln.sam
samtools view -Sb aln.sam > aln.bam
samtools sort aln.bam > aln_sorted.bam
samtools flagstat aln_sorted.bam > aln_stats.txt
input. fasta include 3158 sequences.
Here is aln_stats.txt output:
5284 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
2126 + 0 supplementary
0 + 0 duplicates
4890 + 0 mapped (92.54% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Why output includs 5284 sequences in total?
I see that 2126 is supplementary, 5284 - 2126 = 3158, which is the number of sequences in input.
Thanks in advance, Best, Agata
Note that you could simplify this, speed things up and avoid intermediate files:
(requiring that your samtools is not too old, but you should use up-to-date software anyway)
Hi Agata, these are most likely sequences mapping to more than one location. The BAM doesn't tell you the number of input sequences but the number of alignments against the reference
Hey, thanks! Do you know how can I find which sequences mapped to different contigs? Best, Agata
Search Biostars for "identify multi-mapped reads". Use an external google search.