Question

Select a set of reads from BAM file

1

Entering edit mode

10.8 years ago

pankaj.agarwal.duke ▴ 30

Hi,

I am analyzing an rna-seq data set, which consists of matched tumor/normal samples from two patients. Alignment of the trimmed fastq files were done using bowtie2 with default parameters and iGenomes bowtie2 hg19 index. The alignement BAM files were sorted by name and coordiantes using samtools. Now I am trying to get the counts for further downstream analysis. I have tried several approaches, including htseq-count, countOverlaps, and summarizeOverlaps. Each of them gives slightly different results - i.e., the counts for the same gene and sample are different. My question is that is there a way to extract a subset of alignments from the BAM files and create a "subsetted" BAM file that can be analyzed in depth with different methods so as to get an idea why there are differences. All the methods for getting the counts are really "black boxes" for me, and I am not sure what is going on under the covers to produce different counts. I thought it would be a good idea to just try out different methods with a set of alignments that have mapped to a particular gene (or even exon) and try different methods with that subset so it is more tractable and faster to try out and compare various methods.

Thanks,
-Pankaj

alignment RNA-Seq next-gen • 5.0k views

ADD COMMENT • link updated 3.3 years ago by Ram 44k • written 10.8 years ago by pankaj.agarwal.duke ▴ 30

Ram · Answer 1 · 2014-04-21

Ah yes, getting the same counts with different methods ... the path to the dark side are they. Consume you they will.

Seriously speaking it is unexpectedly difficult to get the "exact" same counts in all circumstances. Tools have a tendency to implement tacit assumptions of what they accept as a valid read/alignment/overlap.

What you can do is slice the bam file into a subset with:

samtools -b input.bam chrom:100-200 > output.bam

Then investigate this smaller file at will.

I will share one startling observation, the command above may also output reads that are unmapped! We once spend a half a day debugging that. Yes really, depending on the aligner there may be reads that do not align yet are listed when querying for region 100-200. That's because the samtool query only looks at the POS column and ignores the FLAG ... The More You Know **

So even this above may give you different results when filtering for mapped reads only.

Ram · Answer 2 · 2014-04-21

You can actually find the reasons for most of the differences in the documentation if you know where to look. I would expect summarizeOverlaps to give counts more similar to htseq-count than countOverlaps does. countOverlaps will double count reads that overlap multiple features, whereas summarizeOverlaps will not. htseq-count will not double count reads and it will also not count secondary alignments or reads with NH tag set and greater than 1. htseq-count is probably as close to a gold standard as you'll get in counting RNAseq reads. Having said that, give featureCounts a try, since it's faster. The differences between it and htseq-count are described in the paper.

BTW, the easiest way to see the differences in how the counting is done is to (1) look at a few multimappers and (2) look at a few reads overlapping multiple genes. If you find a read or two whose 5' coordinate overlaps the 3'-most coordinate of a gene, then you can see differences between htseq-count and featureCounts too.