Cufflinks / Cuffdiff Output - How Are Tests Different?
4
19
Entering edit mode
13.1 years ago
Stephen 2.8k

I've got two different cellular fractions and I'm looking for genes that are alternatively spliced, alternatively polyadenylated, differentially expressed, etc. I'm running cufflinks/cuffdiff in galaxy and I'm trying to grok what the different tests are doing.

Cuffdiff outputs 11 files (four FPKM tracking files, 7 files of results). Omitting the four FPKM tracking files, here are the 7 results files with a snippet from the the cuffdiff documentation:

  1. Differential expression testing for transcripts: FPKM of one group vs FPKM of the other.
  2. Differential expression testing for genes: This sums the FPKM for transcripts sharing the same gene_id.
  3. Differential expression testing for coding sequence (CDS): This sums the FPKM of transcripts sharing a common p_id, which is the id of the coding sequence that this transcript contains.
  4. Differential expression testing for primary transcripts: This sums FPKM of transcripts sharing a common tss_id (transcription start site).
  5. Differential splicing tests: For each primary transcript, this tests the amount of overloading detected among isoforms, i.e. how much differential splicing exists between isoforms processed from a single primary transcript.
  6. Differential coding output: For each gene, this tests the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples.
  7. Differential promoter use: For each gene, the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples.

My questions are:

  1. How are tests for differential splicing (#5) different from tests for differential coding output (#6)?
  2. How are the tests for differential gene expression summing over gene ids (#2) different that tests for gene expression summing over CDS ids (#3)?
  3. Tests #5-7 above are testing something fundamentally different than the tests for differential gene expression (tests #1-4). I'd like a good explanation of how these groups of tests differ. E.g. how does #3 (differential expression over CDS) differ from #6 (differential coding output).

Thanks very much in advance.

cufflinks cuffdiff galaxy gene • 17k views
ADD COMMENT
7
Entering edit mode
12.9 years ago
Stephen 2.8k

To answer a part of my own question, I drew out a schematic of what tests 1-4 are doing. Each is grouping transcripts at a different level.

  1. Doesn't group any - each is a separate transcript and tested independently.
  2. All are grouped at the gene level.
  3. Transcripts B and C are grouped because they share a common protein coding sequence.
  4. Transcripts A and C are grouped because they share a common primary transcript.

Image: http://i43.tinypic.com/35am6j7.jpg

alt text

ADD COMMENT
5
Entering edit mode
12.9 years ago

Hello, I think I got most of this figured out:

How are tests for differential splicing (#5) different from tests for differential coding output (#6)

  • differential splicing is at the primary transcript level, so you will look at each group of transcripts that share the same TSS (more correct definition: that have the same pre mRNA processing transcript, so you are clustering different splicing isoforms), and test if the mix of splicing isoforms is different. The statistical test is based on the Jensen-Shannon divergence, which is a test on the distribution difference, so it will be sensitive if in one sample there is one (or more) splicing isoform is more representative of that primary transcript output than in the other sample; however, the test is not sensitive to difference in primary transcript total volume (you will have to use differential expression tests for that).

  • different CDS output looks at the different coding sequences you produce after splicing, i.e. the different combinations of exons you can produce; it's a proxy for protein output, but of course it does not take into account anything post-mRNA processing. The test is at the gene level, not at the primary transcript level, so it will also factor in alternative TSS usage and alternative promoter usage; also, if you have differential splicing for one primary transcript, but that primary transcript does not have the lion share's of the gene's transcription output, it will scarcely affect the CDS output difference. However, if you have transcripts that do not differ by their exon sequence but differ by UTRs, this difference will not be factored in (as there is no difference in coding sequence). The statistical test is again based on the Jensen-Shannon divergence, so it won't be sensitive to difference in total gene transcription (you will have to use differential expression tests for that).

I think this also sheds light on the other questions.

In summary: differential CDS and splicing output tests look at difference in distribution over different possible isoforms (of spliced transcripts or coding sequences), whereas differential expression tests look at difference in total level.

ADD COMMENT
0
Entering edit mode

Thanks Daniele. Great answer.

ADD REPLY
0
Entering edit mode

I'm wondering if cufflinks supplies the percent representation of transcripts or CDS's within a gene (equivalent to the field PSI from MISO, or the IsoPct field from RSEM)? I understand that the spilicing.diff and cds.diff supply a differential test based on the differences in relative abundance within the gene, but what about the actual values?

In the same vein, what does the √JS(x,y) actually mean in terms of the change in that transcripts' role in the mix? The manual gives 0.22115 as an example of a significant value for the test stat - but doesn't explain if in this case the tested transcript has increased or decreased its portion out of the total gene expression and by how much.

ADD REPLY
0
Entering edit mode
13.1 years ago
Flashton • 0

Hi Stephen,

I'm afraid I can't help you with your question (other than to suggest there might be two streams of analysis, one for ORFs and another for CDSs).

However, I was hoping you can shed some light on why you used Cuffdiff for your analysis rather than DESeq, EdgeR or BaySeq. I'm about to embark on an RNA-seq analysis project and any input you might have on the relative merits of these programs would be greatly appreciated.

Many thanks,

Phil

ADD COMMENT
0
Entering edit mode

Cuffdiff was just the first thing I tried - I was helping someone with an analysis where all the data was already in Galaxy, and cuffdiff was easy to run. I'm looking at DESeq now as integrated into the ExpressionPlot suite expressionplot.com), which has some nice functionality

ADD REPLY
0
Entering edit mode
13.0 years ago
Josh • 0

I can't help with your analysis but I have been using Expressionplot on our local server for several months and really like it. Just for what it's worth.

ADD COMMENT

Login before adding your answer.

Traffic: 2516 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6