getting the unmapped reads from BAM file
2
1
Entering edit mode
7.8 years ago
mia ▴ 70

Hi,

I have SAM file after alignment and the samtools flagstat produces the following results:

22806328 + 4861610 in total (QC-passed reads + QC-failed reads)
154024 + 1050636 secondary
0 + 0 supplementary
0 + 0 duplicates
22634029 + 4846318 mapped (99.24% : 99.69%)
22652304 + 3810974 paired in sequencing
11326152 + 1905487 read1
11326152 + 1905487 read2
22435338 + 0 properly paired (99.04% : 0.00%)
22467696 + 3780390 with itself and mate mapped
12309 + 15292 singletons (0.05% : 0.40%)
30982 + 232010 with mate mapped to a different chr
13721 + 50746 with mate mapped to a different chr (mapQ>=5)

I then used bam2fastq to extract the unmapped reads:

bam2fastq -o D10_S1_%#_unmapped_reads.fastq -f --no-aligned --unaligned --no-filter D10_S1.bam

However the ouput reveals that only 172299 sequences were exported

27667938 sequences in the BAM file 172299 sequences exported WARNING: 12309 reads could not be matched to a mate and were not exported

This does not correlate with the flagstat results.

Thanks is advance for the help.

genome alignment bam2fastq samtools • 4.6k views
ADD COMMENT
0
Entering edit mode

Was this the output of BWA-meth? This is a very scary looking BAM file, that's for sure. If you could provide more information about the experiment/sequencing type, that could help a lot in diagnosing the best solution.

ADD REPLY
0
Entering edit mode

Hi John,

Yes it is from bwa-meth. I converted the sam file to bam. This sample is extracted from rust infected wheat. This means that it should contain both the rust and wheat genome. The above flagstat is after the sample was aligned to the rust genome. I am assuming that the 4861610 unmapped reads would aligned to wheat. I wanted to extract these unmapped reads and then align them to the wheat genome to see if it this is true.

Thanks

ADD REPLY
0
Entering edit mode

Ah, i see. I didn't know BWA-meth gave secondary alignments like that. hm.

Well, just so you are aware, the best option is probably to also map to the wheat genome and then for each read check if it aligned to rust, wheat, both, or neither, to get a proper statistic for this sort of thing. Or perhaps there are proper metagenomics tools i don't know about that do this.

Anyway, just for the hell of it, it might be a good idea to generate the same statistics as above from the original SAM file (not the BAM file converted back to SAM), to make sure the SAM to BAM conversion didn't somehow go wrong. Also the flagstats for D10_S1.bam would be interesting.

ADD REPLY
1
Entering edit mode
7.8 years ago
mia ▴ 70

John,

The above stats are from the original SAM file BWA-meth generated.

Below are the stats on the converted BAM file:

22806328 + 4861610 in total (QC-passed reads + QC-failed reads)
154024 + 1050636 secondary
0 + 0 supplementary
0 + 0 duplicates
22634029 + 4846318 mapped (99.24% : 99.69%)
22652304 + 3810974 paired in sequencing
11326152 + 1905487 read1
11326152 + 1905487 read2
22435338 + 0 properly paired (99.04% : 0.00%)
22467696 + 3780390 with itself and mate mapped
12309 + 15292 singletons (0.05% : 0.40%)
30982 + 232010 with mate mapped to a different chr
13721 + 50746 with mate mapped to a different chr (mapQ>=5)

Also, weird is that the total number of sequences for the paired-end sequences is 26463278 (13231639 FR and 13231639 REV). But when you add the passed QC and QC-failed reads it is 27480347, greater than the total number of reads.

ADD COMMENT
0
Entering edit mode

The way these numbers are laid out, is that the alignments on the left of the + have passed QC, while the alignments on the right have failed. That is also true for the "total", which is just the total for those passed/failed QC.

So it's (11326152 + 1905487) x2 == 26463278, but total is 22806328 + 4861610 == 27667938

This also means the numbers bam2fastq are giving you are OK.

22806328 - 22634029 == 172299

So it looks good to me.

ADD REPLY
0
Entering edit mode

John,

Thanks again for the explanation. However, I am still not getting the 22806328 + 4861610. Does this mean there a few of the passed QC reads that fail QC as well?

ADD REPLY
0
Entering edit mode

No no, sorry I know this is confusing, it's confusing for everybody. It means that the total number of reads you have is 22806328 (passed QC), plus 4861610 (failed QC), which gives 27667938 alignments in total.

11326152 + 1905487 is the number of paired-end reads (for forward, but also for reverse), which totals 26463278 paired-end reads that both passed and failed QC. As 27667938 - 26463278 is 172299, that means you have 172299 alignments that are not included in the paired-end counts. I don't know what they are, but they're certainly "weird".

Thus, when you run that bam2fastq thing, it tells you 172299 alignments gave a WARNING (the weird 172299), and 12309 alignments out of the 172299 weird alignments were singletons (not matched to a mate).

Bottom line, I think your fastq2bam is working, although the BAM file generated by BWA-meth does look a little weird to me. This is probably just because i don't work with the rust genome, or perhaps the rust genome itself is weird. I wish I could point you towards some sort of documentation about what all this numbers mean, but i'm afraid such a resource simply doesn't exist. The SAM/BAM specification contains how the format should be, and the BWA code contains what is actually going on, and I don't really recommend you read either as it won't make the situation any more clear :P Perhaps reading the SAM spec and understanding that these files store alignments (not reads), and what it means to be a secondary alignment, singleton, etc etc, will help clarify things. But again, it's easier to just label something "weird" and move on, else this will bog you down for weeks if not months.

ADD REPLY
0
Entering edit mode

Awww!!! Thanks John for clearing that up. I really appreciate that.

ADD REPLY

Login before adding your answer.

Traffic: 3619 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6