I'm evaluating some options for novel isoform detection and I'm looking for some advice/experience on settings. I finally produced the "spiked" results I expected, but I'm concerned about an explosion of memory, etc. when I run it on a full-sized experiment.
TL;DR: what are typical settings for Trinity on RNA-seq of PE75 or PE100? I had to use --min_contig_length=40
to get Trinity to find a contig that I know exists for simulated PE100 reads.
My setup was--
I randomly selected a handful of gene name/IDs. I then get the sequence of those genes and some intergenic regions to place between them, creating a mock genome. For each of those genes, I create a set of transcripts by splicing the various exons in the appropriate order (e.g. no circRNA). I also create TWO GTF files. One is the full GTF, including all transcripts. The other removes a transcript corresponding to a skipped exon (in my example the gene has 3 exons, and the "spiked" mature transcript would be exon 1 spliced to exon 3, skipping exon 2).
Using BioC's Polyester, I simulated some stranded, paired-end reads. To do the simulation, I use the full GTF and transcript fasta from step 1. This creates a pair of fasta files. I can see that there are ~300 reads that are split across the junction (based on CIGAR strings from a STAR align AND an IGV sashimi plot).
I first tried by TransAbyss since it has the companion PAVFinder tool which will align the assembled contigs to the genome, etc. and report potentially novel splicings. As part of TransAbyss/PAVFinder, you supply a GTF of known annotations along with the sequencing reads. Unfortunately, this produced no results. This was after merging assemblies produced for kmers of 22,24,26,28,30,32.
When I look at the assembled contigs produced by TransAbyss, it produces a perfect match for the full transcript (exon1:exon2:exon3), but does NOT generate any contigs for the skipped exon1:exon3 form. If I manually add a sequence spanning the exon1:exon3 junction to the FASTA, PAVFinder correctly identifies the "novel" isoform.
At this point I figured the problem lies in the assembly itself. I then decided to use Trinity since it has the --min_contig_length
parameter (which naively seemed to be the way to go). Ultimately, I had to reduce the --min_contig_length
to 40 (default is 200) to finally get it to produce a contig corresponding to my exon1:exon3 splice form. Is this typical? I don't have a lot of experience with Trinity, so I'm a bit concerned that setting it this low will explode my memory footprint on a real-sized data set. The simulated reads are quite clean (and real samples will be from cancer samples), so I imagine the real graph will be far more complex.
Thanks for any tips!
The exons in question were 342 bp and 534bp (taken from WBP1 gene). Perhaps I need to dig into the details in the graph algorithms a bit more. Perhaps start default and then do a bit of exploration in parameter space, depending on how the run times are.
I intentionally used rather high coverage to be sure I was not missing anything due to low coverage depth. With several hundred split reads spanning the junction, I will say it surprised me a bit that it couldn't find it under default settings (especially with rather clean simulated reads). If I reduce the
--length
parameter for TransAbyss, it eventually sldo finds a junction-spanning contig; unfortunately, that comes with a bit of an explosion of very similar, shorter contigs which feature the simulated basecall errors. But I suppose that's just the reality of cranking up the sensitivity...