I'm looking to benchmark some known splicing algorithms against each other (e.g., MISO, MATS) to examine how they handle datasets relative to one another with regards to performance, efficiency, accuracy, etc on biological data. For example, does splicing algorithm #1 pick up a known splicing variant whereas splicing algorithm #2 totally misses the mark? For example, does splicing algorithm #1 get the job done in three hours less time than splicing algorithm #2? As such, I'd like to find and use some "good" previously published alternative splicing (RNA-seq) datasets. Surprisingly, not many such datasets exist (and trust me, I've looked) so I wanted to ask the community for your thoughts... let me define what I mean by "good":
- Multiple samples (obviously)
- Clear splicing events (by clear, I mean very evident as per FPKM values or some other measure)
- I do not have a preference for how these datasets were generated (i.e., what algorithm was used in the original paper)
- They need to be biologically validated (i.e., just a good FPKM measure doesn't cut it)
- Publicly available (obvious, until you realize how frequently fastq files that are supposed to be available in SRA were apparently "not requested by the reviewers" upon follow-up with the corresponding author of the respective paper)
Can anyone help me out here?
I like to test using synthetic data with known answers, but it depends on what you're testing. For example, when generating synthetic reads from the transcriptome and mapping them to the raw, unannotated genome, you not only get a realistic distribution of splice sites, but also you know the correct answer so you can evaluate the results in an objective and unbiased fashion - which, in my opinion, is the most crucial part of a benchmark.
But if you are more interested in seeing how programs behave in the presence of error modes that are specific to different library prep and sequencing procedures, or for some reason you need a realistic ranges of expression for different isoforms, then real data would be better.
Hi Brian, yes both would actually work. If you have suggestions for a synthetic dataset or its generation thereof that you personally use and recommend, that would work. It would be a good performance measure, as you said. Also, I would be interested in catching realistic ranges of expression and, ideally, to be able to say at the end of the day: "this peak caller can catch an important splicing event that has since been biologically validated while this other peak caller falls short." In other words, add a "biological accuracy" metric to a "relative performance" metric. Thanks for your input.
I always use RandomReads [randomreads.sh] in the BBTools package for generating synthetic data. It annotates reads with their genomic origin so you can automatically validate mappings from sam files, and that's what I used in developing BBMap to optimize its sensitivity and specificity for RNA-seq data - it allows you to generate an arbitrary number of arbitrarily long deletions in reads, to simulate unexpected novel introns.
It was not designed specifically to simulate RNA data with variable expression levels since that's irrelevant to determining mapping accuracy. However, it does have an MDA mode which you can activate with the flag
amp=2000
. That's designed for simulating MDA-amplified single cells, but basically, it makes the coverage highly nonuniform, so it should also work well for generating reads to simulate RNA-seq data, from a transcriptome.I know that there are also simulators designed expressly for simulating RNA-seq experiments, though I've never used any and don't know their names.
I'd probably take a look at the SUPPA manuscript and the data they use to test within.