Why number of #read1 and #read2 is different in samtools flagstat output?
1
1
Entering edit mode
7.8 years ago

Dear All,

I am using BWA mem algorithm to map my Illumina reads (paired-ends) data against to my assembled genome (diatom species). Afterwards, I performed some statistics using the existing tool samtools flagstat, and here is the output:

38550041 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
1267361 + 0 supplementary
0 + 0 duplicates
38550041 + 0 mapped (100.00% : N/A)
37282680 + 0 paired in sequencing
**18642729** + 0 read1
**18639951** + 0 read2
31686162 + 0 properly paired (84.99% : N/A)
37217518 + 0 with itself and mate mapped
65162 + 0 singletons (0.17% : N/A)
5793736 + 0 with mate mapped to a different chr
3253212 + 0 with mate mapped to a different chr (mapQ>=5)

So my first thought was that maybe this disagreement between the number of #read1 and #read2 (I am pretty sure I had the same number in input data) could be caused by non-uniq mapping reads, so I decided to do a further filtering (mapQ>0, reads which map in more than one part of the genome are reported by mapQ=0) using samtools view -F 4 -q 1, here next samtools flagstat output:

34663143 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
640572 + 0 supplementary
0 + 0 duplicates
34663143 + 0 mapped (100.00% : N/A)
34022571 + 0 paired in sequencing
**17037701** + 0 read1
**16984870** + 0 read2
30710384 + 0 properly paired (90.26% : N/A)
33961505 + 0 with itself and mate mapped
61066 + 0 singletons (0.18% : N/A)
3501452 + 0 with mate mapped to a different chr
3253212 + 0 with mate mapped to a different chr (mapQ>=5)

As the disagreement remains, my second thought was that the reason why this is happening is due to chimeric alignments, so I decided to perform an extra filtering on this issue using cat sorted.sam | grep -v 'SA:' > new.sorted.sam (chimeric reads will have an SA tag as described on page 7 of the SAM format specification). Here the last samtools output:

33209606 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
33209606 + 0 mapped (100.00% : N/A)
33209606 + 0 paired in sequencing
**16610299** + 0 read1
**16599307** + 0 read2
30330354 + 0 properly paired (91.33% : N/A)
33152199 + 0 with itself and mate mapped
57407 + 0 singletons (0.17% : N/A)
3085532 + 0 with mate mapped to a different chr
2894912 + 0 with mate mapped to a different chr (mapQ>=5)

As you can see, although I got better results in pairs that mapped concordant (properly paired), I still have not the same number of #read1 and #read2... I run out off all my hypothesis on this issue so I would be really grateful if someone can guide me a little bit to understand completely what is going on here. Thank you so much in advance!

alignment genome • 5.1k views
ADD COMMENT
5
Entering edit mode
7.8 years ago

It's useful to just look at the samtools code. Secondary (multimappers) and supplementary (chimeric) alignments aren't included in the read1 and read2 metrics. What you're seeing are singletons (i.e., only one read in the pair aligned). Since bwa mem doesn't output unmapped alignments, you'll get mismatches in the read1 and read2 numbers (this isn't a problem).

ADD COMMENT
0
Entering edit mode

Hi Devon Ryan. Thank you for answering. I knew that bwa mem doesn't output unmapped alignments, that is why I also used samtools view -F4 for filtering (-F means Do not output alignments with any bits set in INT present in the FLAG field and 4 means read unmapped), that cannot be the reason why I am getting mismatches in the read1 and read2 numbers right? or Am I missing something here? Many thanks!!

ADD REPLY
0
Entering edit mode

I already wrote that the reason is singletons. Please read my entire reply.

ADD REPLY
0
Entering edit mode

Hi, sorry for misunderstanding. I was confusing the concepts unmapped read and singletons, but indeed you are right. Although I am filtering unmapped read, it makes sense that it may remain some reads that have mapped while its pair don't, being reflecting by the singletons you mentioned. Thank you so much again.

ADD REPLY

Login before adding your answer.

Traffic: 1835 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6