Question

How do I get the read counts for a specific exon

0

Entering edit mode

6.7 years ago

b10hazard ▴ 30

I have a bam file and a gtf file that contains a single line....

chr5    UCSC    exon      13744354       13744380       .       +     
.       gene_id "UDHG"; transcript_id "NM_033847";  gene_name "UDHG";

The bam file contains paired end reads. How do I get the number of reads that align to this exon?

bam exon • 3.1k views

ADD COMMENT • link updated 6.7 years ago by Ram 44k • written 6.7 years ago by b10hazard ▴ 30

4

Entering edit mode

This question has been asked a gazillion of times. Did you search for it ? What did you find ?

ADD REPLY • link 6.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Of course I did. The gazillion answers I found all sucked. I tried bedtools coverage via...

bedtools coverage -abam mybam.bam -b mygtf.bed > results_bedtools.txt

But just got a 8.6 GB file containing a ton of useless info an none of it was grouped by exon. Then I tried...

htseq-count -f bam -m union -t exon mybam.bam mygtf.gtf

After 2 hours I got a lovely 4.9GB file that was pretty much useless since it also was not broken up by exon.

Both these methods suck and result in a ton of data that is not targeted to my exon of interest. Why on earth these tools parse the entire bam instead of the region of interest I specified in my gtf/bed file is beyond me. I actually don't think any tool exists that can parse only the specified region of my bam and give me an single line output containing the count of my exon of interest.

ADD REPLY • link 6.7 years ago by b10hazard ▴ 30

4

Entering edit mode

how about ?

samtools view -c mybam.bam "chr5:13744354-13744380"

ADD REPLY • link 6.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

When you test a method, it's wise to create a model subset of your data and test on it. Even subsetting 10% of your BAM data would yield a better testing method than running all your data through a tool.

ADD REPLY • link 6.7 years ago by Ram 44k

0

Entering edit mode

Already tried that. The output of htseq-count and bedtools coverage is smaller but it doesn't solve the fundamental problem of the output file being exon specific. I'm trying dexseq right now. Hopefully that tool won't produce gibberish.

ADD REPLY • link 6.7 years ago by b10hazard ▴ 30

1

Entering edit mode

Are you trying each new tool on the 10% subset or the whole dataset?

And everything is gibberish? ¯\_(ツ)_/¯

ADD REPLY • link 6.7 years ago by Ram 44k

0

Entering edit mode

Yes, initially I was trying the tool on the entire bam thinking that the tools would use the provided GTF or BED files to target the specified exon instead of parsing the entire file. Why the heck these tools parse the entire file when the user specifies a targeted region is beyond me. Then I made the bam subset and reran. The bedtools output keeps breaking up my exon into subsets (no idea why). Htseq-count outputs... nothing. I've found a solution though. samtools depth does give me depths at each base which is close enough for me. I should have used that from the beginning.

ADD REPLY • link 6.7 years ago by b10hazard ▴ 30

0

Entering edit mode

How can you get a 4.9Gb file with htseq-count and a gtf with only one line? What is the content of this file?

ADD REPLY • link 6.7 years ago by h.mon 35k

1

Entering edit mode

one line

that marks a region of 26 bp. Coverage has to be in the ... tens of thousands (or a magnitude of scale larger) to get this big a file.

ADD REPLY • link 6.7 years ago by Ram 44k

1

Entering edit mode

Doesn't htseq outputs only count numbers? Output should be:

__no_feature
__ambiguous
__not_aligned
__alignment_not_unique
UDHG

And counts for each one of those. How can such a file be 4.9Gb? What I am missing here?

ADD REPLY • link 6.7 years ago by h.mon 35k

0

Entering edit mode

Yup, the file is populated with repeats of what you posted above. I've given up on htseq-count. I used samtools depth to get my answer. I have no idea what is up with htseq-count. Could be some config/package dependency bug on our cluster.

ADD REPLY • link 6.7 years ago by b10hazard ▴ 30

0

Entering edit mode

Maybe the output is not gibberish and just needs summarizing? Did you try Google-ing how to interpret htseq output?

ADD REPLY • link 6.7 years ago by Ram 44k

0

Entering edit mode

doesn't the GTF file for input to htseq-count needs to have this kind of format in the last column?:

gene_id="UDHG";transcript_id="NM_033847"

so with '=' sign and no spaces ?

ADD REPLY • link 6.7 years ago by lieven.sterck 15k

0

Entering edit mode

Hi! @RamRS

Were you able to find a way out to count number of reads aligning to a specific exon? Please help. I am facing the same problem.

ADD REPLY • link 4.5 years ago by rohitsatyam102 ▴ 920

0

Entering edit mode

I wasn't really looking for one. This thread is 2 years old though, so I don't recall what I was doing at that time.

ADD REPLY • link 4.5 years ago by Ram 44k