Hi,
I aligned a few samples using STAR to the genome provided in the Illumina iGenomes UCSC hg19 bundle (here) -- I used the provided gene feature (gtf2) file as is. Now, my motive is to calculate the gene and isoform expression levels using bedtools multicov (at the same time).
Use of the gtf2 file produces a file containing read counts per exon. I wish to compute gene and isoform read counts too, so I converted the gtf2 file to a gff3 file using using gtf2gff3 script from SO/GAL (here).
My first question is: Is it OK if the alignment is performed with gtf2 file but counted for reads using the gff3 file, keeping in mind that the gff3 file was converted from the gtf2 file?
My second question follows
I have read both these resources (here and here) but do not understand the differences between:
- exon vs CDS
- transcript vs mRNA
I know that with the process I described, it is possible to retrieve gene read count by selecting only the lines where feature=gene from the bedtools multicov output. What must I do for isoforms? I am confused by the semantics.
Thanks ahead of time and let me know if my post was not clear enough.
Is there any particular reason you're trying to do this with bedtools? Using htseq-count or featureCounts would make your life easier for gene-level analyses. Also, you generally don't want to count reads for looking at transcript expression level, but rather use an expectation-maximization alogrithm (normally in the form of cufflinks, rsem, eXpress, etc.) to assign full and fractional reads, since otherwise you'll get almost 0 unique reads (or you end up double/triple/etc. counting half your reads).
I was hoping to obtain gene and isoform levels in a single step. In my experience htseq-count has proven to be quite slow so I decided to do things differently, that is using bedtools, and getting the most out of it. Why isn't counting reads sufficient? I am planning to analyze all of this using programs like DESeq. I'd have at least two files per sample : gene-only read counts and isoform-only read counts.
The slowness of htseq-count can be remedied by simply using featureCounts instead (it's MUCH faster). I'm sure you can use bedtools for looking at gene-level read counts if you muck around enough, but that's not exactly the path of least resistance. While you can attempt to look at isoform usage in DESeq (or DESeq2) the output would be completely meaningless and you would have just wasted large amounts of your time. Count-based methods are simply useless in looking at isoform usage (though you CAN look at exon usage, see the DEXseq package).
Thanks for your words of wisdom. I will look at featureCounts and DEXSeq. Sorry for being a pain, but can you explain (or point me to resources explaining) why count-based methods won't work for isoform usage? Why won't bedtools do the trick? I currently have a the read counts of every gene as per the gff file -- is it related to the ambiguity or multimapping or because htseq-count is more stringent on its counting?
It's related to the multiple assignment issue (you might mean this by "multiple mapping", though that has a slightly different meaning in this context). I vaguely remember some of this being discussed in the cufflinks2 paper from Lior Pachter. The general issue is that isoforms overlap almost completely, so by using simple count-based method you will always drastically over/under-estimate actual levels and thereby tank your power. The image below (from the aforementioned paper and Lior's blog, which is a must-read) gives an example of a few count-based methods and how they lend themselves to producing wrong answers.