Hi, Is it possible to detect isoforms for every exact gene in raw counts (or FPKM units) after htseq-count with flag -i gene_id?
Thanks in advance!
Hi, Is it possible to detect isoforms for every exact gene in raw counts (or FPKM units) after htseq-count with flag -i gene_id?
Thanks in advance!
Unless you are doing differential Expression analysis, you should not count the alternate isoform transcripts via htseq-count. The main reason is that htseq-count removes all the ambiguous reads from counting, and any read which are common to two or more transcript will be ambiguous in this case. See how htseq counts the reads in default (union) mode
http://www-huber.embl.de/users/anders/HTSeq/doc/count.html#count
In your case, gene_A and gene_B become transcript_A and transcript_B of the same gene. That said, of course, you can do it and there is a small note on the same page regarding that:
Can I use htseq-count to count reads mapping to transcripts rather than genes?
In principle, you could instruct htseq-count to count for each of a gene’s transcript individually, by specifying --idattr transcript_id. However, all reads mapping to exons shared by several transcripts will then be considered ambiguous. (See second question.) Counting them for each transcript that contains the exons would be possible but makes little sense for typical use cases. (See first question.) If you want to perform differential expression analysis on the level of individual transcripts, maybe ahve a look at our paper on DEXSeq for a discussion on why we prefer performing such analyses on the level of exons instead.
If you are not doing differential Expression, then you may try featureCounts which is more flexible and extremely fast. http://bioinf.wehi.edu.au/featureCounts/
yes, -i transcript_id will do counting on transcript level. But it will grossly underestimate the counts as there will be many common exons among transcripts, and all the reads to common exons will be discarded. You have been warned!
-intersection-nonempty mode will not help. The situation with multiple transcript of same gene is like shown in the last picture (the penultimate figure's situation is very less possible in case of transcripts). And once again, htseq developers warns against counting on transcript level. Your same situation is written in this post; and see the reply by Simon Anders (developer) http://seqanswers.com/forums/showthread.php?t=18068
Why don't you try some of the progs which are specialized to do that (Cufflinks, RSEM...?) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4712774/
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Every row of the file is an isoform. Just open it up in Notepad and take a look. Now you've detected them!
Thanks for replay. But how can I find any information about each isoform? There are many isoforms with adding ".some_number". E.g. "ENSG00000001167.13" So, I could find information on UniProt about "ENSG00000001167" (without .some_number) That's what I found:
And how I should identify the only .13 isoform? How htseq-count appropriate numbers (.some_number) for each isoform?
Thanks in advance!
You can get isoforms expression from Cufflinks in the isoforms.fpkm_tracking file.