Question

Forum:Benchmarking paper for determining isoforms from RNAseq data

10

Entering edit mode

10.3 years ago

Obi Griffith 20k

This is a really interesting benchmarking paper that compared a bunch of RNAseq methods.

Overview of methods:

They used TopHat, RUM and STAR for alignment. Genome-guided analyses were run without and with annotations (if supported) using Cufflinks, Scripture, CEM, IsoLasso, Casper, IReckon. De-novo analyses were run with Trinity, OASES, SOAPdenovo-Trans, and EBARdenovo. They created "truth" datasets where they predefined the isoforms present using simulated idealized reads, simulated realistic reads, and then a real ("spike-in") dataset of actual RNA sequencing reads for 1,062 in vitro expressed human cDNAs (from the Mammalian Gene Collection). For the simulated data they also defined true negative isoforms by removing exons from known isoforms but not including any simulated reads for those isoforms. They focus on the ability of these algorithms to correctly recapitulate the known isoforms in these datasets. For a true positive they require the joining of exons into a final complete isoform with the identical structure as the known positive but don't require accurate determination of transcription start or stop sites (that is an even harder problem in some ways).

Key results:

Most algorithms performed well with perfect data and a single splice form, but they tend to falter when predicting multiple splice forms. Once there are two splice forms, all algorithms have a > 10% FDR. The application of Cufflinks using a TopHat alignment (Cufflinks+TopHat - in de novo mode?), which is common in practice, results in a 40% FDR. The de novo methods incur substantially higher error rates, with Trinity having a nearly 90% FDR on two-splice-form genes, coupled with an approximately 50% false negative rate. Curiously Cufflinks performs better with a TopHat alignment than with a STAR alignment, even though STAR produced a more accurate alignment. On the more realistic data, the application of Cufflinks + TopHat to perform de novo identification incurs an FDR error rate around 30% with FNR around 25%. This only seems better than the idealized data because their more realistic data (in terms of reads with errors, intron noise, etc) was also much simpler in terms of isoform complexity.

With the real in vitro data, both the FDR and FNR were generally much higher. Cufflinks + TopHat + Annotation performed the best, with a FDR of 20% and a FNR of 26%. This was an order of magnitude worse than for the ideal data, which speaks to the complications introduced by alignment errors, polymorphisms, etc. Remember that this spike-in data was almost exclusively just one isoform per gene and just ~1000 genes in total. For the genes with more than one splice form, the results were worse still. Even with known annotations and a very simple sample (with only 1000 genes and very few isoforms), 20% of isoforms that Cufflinks predicts are incorrect and it misses 26% of the real isoforms. And, cufflinks was best-in-class!

In terms of expression estimates, with realistic simulated data, perfect alignments and annotations available, Cufflinks reports 16.7% of isoforms with an FPKM more than one logarithm off of truth. Without annotations, this number increases to 38.62%. With real (non-perfect) alignments and no annotation available it is even worse at 54.12%. Unfortunately the don't mention the performance with real alignments but annotations available (which is the common real-world situation) but we can assume that more than ~17% of isoforms are going to be a least an order of magnitude off in their estimation. Other algorithms all did as bad or worse than Cufflinks.

Some choice quotes from the authors:

"with a reference genome available with some degree of community annotation, it is hard to imagine any benefit of using a de novo approach"
"The extreme overcalling of forms makes it unclear how to utilize the output of Scripture in a practical way"
"These results are not encouraging"

The take-home-message for me:

TopHat/Cufflinks with a reference genome and annotations is the best current option but far from perfect. Given the reality of noisy data, imperfect alignments and multiple isoforms expressed per gene (expecting most genes to express at least two isoforms), we can possibly expect at least ~20% of real isoforms to be missed, ~20% of predicted isoforms to be wrong, and ~20% of expression estimates to be wrong by at least an order of magnitude. And, this is when you have a reference genome and good annotations available to guide Cufflinks. De novo mode using Cufflinks (and especially some other methods) should only be considered experimental/exploratory given the very high FDR/FNRs. We can expect to miss real biologically important isoforms and any novel isoforms predicted should be validated. This paper strongly emphasizes the importance of continued improvement to transcript isoform discovery and quantization methods and/or improved data quality (e.g., longer reads).

In the author's words, "short reads fundamentally lack the information necessary to build local information into globally accurate transcripts ... Most likely a satisfactory solution will involve an evolution in the nature of the data. Or perhaps some keen insight into how to identify and effectively utilize signals in the genome that inform cellular machinery on what splice forms to generate."

rna-seq benchmark • 5.0k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Obi Griffith 20k

2

Entering edit mode

Lior Pachter's comment when tweeting about this was great:

Transcriptome assembly and quantification: Cufflinks is not great. Others *much* worse. Hard problem http://t.co/MNqXM6YpjW H/T@drchriscole
— Lior Pachter (@lpachter) September 21, 2014

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Devon Ryan 105k

1

Entering edit mode

Thanks for the link, it is very interesting,

Personally I think the quantitative step forward will have to come from new type of instrumentation that produces much much longer read length, short reads even when paired up can only solve so much of the variability.

And that is because there is no underlying mathematical theory that would govern splicing so one can't really reconstruct unless it is actually measured properly

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Istvan Albert 102k

0

Entering edit mode

Yes. I would have to agree with this. You can get excellent gene-level expression estimates from RNAseq and discover many interesting features of a sample's transcriptome, especially when correlated with exome and/or genome data. For example, you can observe fusions and correlated to underlying genome structural variants, identify expressed variants, correlate splice-site variants with transcript changes, and more. So, I don't want to be too harsh on the value of 2x100bp RNAseq data. But, we are certainly not at the point of highly accurate and comprehensive profiling of transcript isoforms and their abundance. I am not strong enough on the algorithm side to say whether we could definitely not get there with improvements to the algorithms alone. But, it seems clear to me that longer reads would help immensely. That is of course assuming you can get abundant and intact (not degraded) RNA samples.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by Obi Griffith 20k

1

Entering edit mode

Have you seen http://genomebiology.com/2014/15/6/R86 by the same authors? I'm actually amazed that they got such "good" results with isoforms with their IVT data in the paper you discuss given how non-uniformly transcripts (i.e. isoforms) seem to be covered in the genome biology paper...

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.3 years ago by dvanic ▴ 250