Question

exon level quantification using htseq count

0

Entering edit mode

7.6 years ago

komal.rathi ★ 4.1k

Hi everyone,

This is my htseq-count command for gene level counts:

htseq-count -f bam -r name -s yes -a 10 -t exon -i gene_id -m union sample_Aligned.out.bam gencode.v23.annotation_subset.gtf > sample_Aligned.out.table.txt

ENSG00000075624.13  171006
ENSG00000107554.15  379
ENSG00000107796.12  1508
ENSG00000111640.14  61287
ENSG00000112237.12  1299
ENSG00000112592.12  812
ENSG00000131203.12  27
ENSG00000143632.14  27
ENSG00000144713.12  12691
ENSG00000156508.17  58724
ENSG00000159251.6   5
ENSG00000163017.13  142
ENSG00000163631.16  14
ENSG00000184009.9   131168
ENSG00000188676.13  0
__no_feature    84834542
__ambiguous 0
__too_low_aQual 0
__not_aligned   2970335
__alignment_not_unique  12964119

ENSG00000075624.13 is ACTB an actin gene which has 171006 reads mapped to the gene. And this is my command for quantifying exon level counts where I change the -i parameter to exon_id:

htseq-count -f bam -r name -s yes -a 10 -t exon -i exon_id -m union sample_Aligned.out.bam gencode.v23.annotation_subset.gtf > sample_Aligned.out.table.txt

For ACTB, I am getting the following counts for its 39 exons sorted by the counts:

ENSE00001902558.1   ACTB    327
ENSE00001731256.1   ACTB    38
ENSE00001911024.1   ACTB    9
ENSE00001775033.1   ACTB    1
ENSE00001958169.1   ACTB    0
ENSE00003669840.1   ACTB    0
ENSE00003664639.1   ACTB    0
...

So here the total exon counts are 375 which is not even close to the gene level count 171006.

Similarly, at the exon level for ENSG00000131203.12 I am getting 0 counts but at gene level I have 27 counts.

What am I doing wrong? Any help would be much appreciated.

Thanks!

htseq-count RNA-Seq exon quantifications • 4.9k views

ADD COMMENT • link updated 7.6 years ago by Devon Ryan 105k • written 7.6 years ago by komal.rathi ★ 4.1k

score 2 · Accepted Answer · 2017-09-25

2

Entering edit mode

7.6 years ago

Devon Ryan 105k

Have a look at that file in IGV, you'll note that each exon that's shared between transcripts exists multiple times in the file. Consequently, very few exons will have uniquely mappable regions. If you want exon level quantification, then you the scripts that come with DEXseq to prepare the GTF file and then run htseq-count with that.

ADD COMMENT • link 7.6 years ago by Devon Ryan 105k

0

Entering edit mode

Hi Devon, thanks for the suggestion. I used

python dexseq_prepare_annotation.py gencode.v23.annotation_subset.gtf gencode.v23.annotation_subset.gff

and this is how the gff looks:

chr1    dexseq_prepare_annotation.py    aggregate_gene  229431245   229434098   .   -   .   gene_id "ENSG00000143632.14"
chr1    dexseq_prepare_annotation.py    exonic_part 229431245   229431248   .   -   .   transcripts "ENST00000366684.7"; exonic_part_number "001"; gene_id "ENSG00000143632.14"
chr1    dexseq_prepare_annotation.py    exonic_part 229431249   229431642   .   -   .   transcripts "ENST00000366683.3+ENST00000366684.7"; exonic_part_number "002"; gene_id "ENSG00000143632.14"
chr1    dexseq_prepare_annotation.py    exonic_part 229431721   229431862   .   -   .   transcripts "ENST00000366683.3+ENST00000366684.7"; exonic_part_number "003"; gene_id "ENSG00000143632.14"
chr1    dexseq_prepare_annotation.py    exonic_part 229431863   229431902   .   -   .   transcripts "ENST00000366684.7"; exonic_part_number "004"; gene_id "ENSG00000143632.14"
chr1    dexseq_prepare_annotation.py    exonic_part 229431994   229432185   .   -   .   transcripts "ENST00000366684.7"; exonic_part_number "005"; gene_id "ENSG00000143632.14"
chr1    dexseq_prepare_annotation.py    exonic_part 229432270   229432406   .   -   .   transcripts "ENST00000366684.7"; exonic_part_number "006"; gene_id "ENSG00000143632.14"
chr1    dexseq_prepare_annotation.py    exonic_part 229432407   229432431   .   -   .   transcripts "ENST00000366683.3+ENST00000366684.7"; exonic_part_number "007"; gene_id "ENSG00000143632.14"
chr1    dexseq_prepare_annotation.py    exonic_part 229432556   229432880   .   -   .   transcripts "ENST00000366683.3+ENST00000366684.7"; exonic_part_number "008"; gene_id "ENSG00000143632.14"
chr1    dexseq_prepare_annotation.py    exonic_part 229432987   229433127   .   -   .   transcripts "ENST00000366683.3+ENST00000366684.7"; exonic_part_number "009"; gene_id "ENSG00000143632.14"

When I used this in the htseq-count using the command:

htseq-count -f bam -r name -s yes -a 10 -t exon -i exon_id -m union sample_Aligned.out.bam gencode.v23.annotation_subset.gff > sample_Aligned.out.table.txt

I only got the following lines in the output:

__no_feature    85273631
__ambiguous 0
__too_low_aQual 0
__not_aligned   2970335
__alignment_not_unique  12964119

No mapping to any exons. What else am I missing?

ADD REPLY • link 7.6 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

Oh I just realized I don't have the exon_id columns in the gff while I am passing exon_id to -i parameter in htseq-count. How do I get the exon_ids in the gff? According to this thread it is not that straightforward to get exon level counts using htseq-count.

ADD REPLY • link 7.6 years ago by komal.rathi ★ 4.1k

0

Entering edit mode

DEXseq also comes with a wrapper for getting the counts using htseq-count.

ADD REPLY • link 7.6 years ago by Devon Ryan 105k

0

Entering edit mode

Yes - I am giving it a try. I thought I could just use htseq-count - but after reading up on the docs and from Simon himself - it is not possible to get exon level counts using htseq-count.

ADD REPLY • link 7.6 years ago by komal.rathi ★ 4.1k