Question

comparing the result of featurecounts and cuffdiff

0

Entering edit mode

9.2 years ago

zizigolu ★ 4.3k

Hello,

I calculated the my reads counts by featurecounts and cuffdiff but for the same gene result is different then among them which I should trust on?

An example of featurecounts results

Geneid    Chr    Start    End    Strand    Length    accepted018347_hits.bam
AT1G01010    1;1;1;1;1;1    3631;3996;4486;4706;5174;5439    3913;4276;4605;5095;5326;5899    +;+;+;+;+;+    1688    31
AT1G01020    1;1;1;1;1;1;1;1;1    5928;6437;7157;7564;7762;7942;8236;8417;8571    6263;7069;7450;7649;7835;7987;8325;8464;8737    -;-;-;-;-;-;-;-;-    1774    153
AT1G01030    1;1    11649;13335    13173;13714    -;-    1905    10

Example of cuffdiff

tracking_id    47_count    47_count_variance    47_count_uncertainty_var    47_count_dispersion_var    47_status  
XLOC_000001    33.4838    44    0    43.6465    OK    
XLOC_000002    92.8896    413.175    0    412.407    OK    
XLOC_000003    0    1    0    0    OK   
XLOC_000004    0    0    0    0    OK

RNA-Seq cuffdiff featurecounts • 4.4k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 9.2 years ago by zizigolu ★ 4.3k

Ram · Accepted Answer · 2015-11-10

I can't really answer your question based on the information you've given -- it depends on what you're planning to do with the results. A few observations, though...

First and foremost, featurecounts and cuffdiff are not reporting the same measure and so you will never be able to directly compare the results. Featurecounts does exactly what the name implies: counts reads per feature in the transcript annotations file and reports the raw reads. In order to compare gene expression levels between different experiments (or even between different genes in the same dataset), you will need to apply some normalization to correct for sequencing biases between genes of different lengths, GC contents, etc. -- add in lane and batch effects if you are using data from multiple experiments. All this can get quite complicated and is an active area of research. Cuffdiff, by contrast, reports FPKM (fragments per kilobase per million reads), which are basically read counts normalized to the length of the gene.

Second, cuffdiff will split reads among different isoforms of the same gene. There is a file among the several that cuffdiff produces containing the total reads over all overlapping transcripts from a gene -- I don't recall what it is named off the top of my head, but it is in the documentation: http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/. Sometimes this file will still contain multiple lines for one gene -- this happens when certain transcripts have no overlap in exon usage. It is an open discussion on whether these should be added to the total or considered distinct. It's up to you to decide whether, for your purposes, it makes sense add these to the total.

Finally, if you want to get comparable results, you need to use the same transcript annotations for both cuffdiff and featurecounts. From your results, I can't tell if you are already doing this or not. It should be pretty straightforward that, if you're using different sets of transcripts between the two programs, you may get very different results from each. There is an option to cuffdiff that will make it use the names from the transcript annotations rather than the XLOC_* format.

Summing things up, if I were planning on comparing several datasets from different sources and/or sequencing runs, I would definitely not trust the off-the-shelf FPKM values from cuffdiff. Rather, I would use featurecounts to get the raw reads and apply a uniform normalization across all the datasets. If I were just looking for a quick way to compare gene expression levels within the same sample or between lanes of the same sequencing run (that said, lane effects can be a concern too, but are easier to control for), I would probably go with cuffdiff because it is much less complicated.

Good luck!