Simulation: Oases Transcriptome Assembly Infers Too Many Isoforms
2
3
Entering edit mode
11.2 years ago

I'm embarking on a RNAseq project and am trying to really figure out what the output of the de novo assembly program is before mapping read counts. Briefly, I've used rlsim to simulated Illumina reads for a set of 100 known mRNA transcripts. I've then run a standard pipeline of trimming-filtering the simulated reads, and then generating the assembly using velvet-oases. I also compared the performance of using all reads with digitally-normalized reads using khmer. To evaluate the assembly, I BLASTed the oases assembled transcripts against the known transcripts.

Of the 100 known, I get 58 reconstructed from the assembly using all reads, and 70 from the assembly using diginorm reads. In both cases, about 90% of the length of the original transcript is mapped. Nice!

Here's the problem - for both approaches, I get many more transcripts (234 and 270 for the 'all' and 'diginorm' approaches respectively) than I should (started with 100 transcripts). Needless to say, this causes problems with mapping reads due to multiple alignments.

My solution is to use cd-hit-est which nicely collapses these to 78 and 102 transcripts, respectively.

But shouldn't this not be happening at all? How to I get oases to avoid making these incorrect isoforms in the first place?

What do I do with real data where I likely do have some true isoforms?

FYI - my scripts for this simulation are available at https://github.com/johnstantongeddes/sim-transcriptome though as always, it's a work in progress.

transcriptome • 5.2k views
ADD COMMENT
0
Entering edit mode

nice pipeline, thanks for putting it all together

ADD REPLY
0
Entering edit mode

Thanks - though hats-off really go to the developers of all the programs.

ADD REPLY
0
Entering edit mode

For future reference, related question here discussing merits problems of CD-Hit: Validation of de novo RNA-seq assembly

ADD REPLY
2
Entering edit mode
11.2 years ago
lzwright ▴ 150

This is a 64K question in my book -- unfortunately I don't really have an answer but it is very common for number of transcripts from de novo assembly to way exceed what is normal. For example I have done a de novo assembly of venom duct tissue and have ~200K contigs from various assemblers which is clearly way in excess of any amount of genes that could be expected. So yes there is redundancy, there are chimeras, there are truncated transcripts, there are flat out misassemblies... it's probably a pretty long list. CD-Hit or CAP3 can resolve some of the redundancy issues and you can also do a contig cut off for bp number (I've seen Velvet Oases return contigs of 2 bp size) but always with the fear you might be losing something valuable. My perspective is we have to wade through these assemblies the best we can knowing they are far from perfect. But if anyone around here with more knowledge than myself who can shed light on why exactly the process is so imperfect, I think it is a very interesting question. Also, I think mapping reads can help you get a better idea of what might be a true isoform and what might be garbage. This is my strategy. Also I read an interesting paper recently on detecting trans-self as well as other types of chimeras. you can probably google chimeras and de novo assembly to find this.

ADD COMMENT
0
Entering edit mode

Glad I'm not alone here.

As for mapping reads - could you expand on your strategy?

My experience: with Tophat-cufflinks, of the ~200 assembled transcripts, I only was getting reads aligned to about 30 genes/isoforms. That is, the program correctly did not find multiple isoforms, but only identified 30 genes. Not sure why this is, but maybe due to multi-mapping of reads? With BWA, something like 56% of reads had multiple alignments (not surprisingly) and this seemed to cause problems getting counts using multiBamCov. Not sure how I'd use this information to infer the 'true' isoform.

Could

ADD REPLY
1
Entering edit mode

I have used Bowtie, Bowtie2, BWA and BWA-mem for mapping. I have gotten the best mapping percentages for my assembly with BWA-mem, which I believe has a higher tolerance for indels and allows gapped alignments. I then visualize the mapping on at tool like Tablet or IGV. I also calculate FPKMs to see which of the purported isoforms have the highest coverage as an indicator of which ones may be real. I also look for gaps in coverage and clean coverage. None of this is foolproof but it's the best I have to go on. Multiple mapped reads are a real bete noir here... cufflinks has one way of assigning them, other tools have other ways, and no one really seems to believe in any of the algorithms. welcome to bioinformatics!

ADD REPLY
2
Entering edit mode
11.2 years ago
Botond Sipos ★ 1.7k

I've looked at the pipeline, and here are my tips for investigating this issue from the simulation side:

  • First, biases (both in simulated and real data) could confuse de novo assemblers and lead to erroneous transcripts.
  • It seems that you are using rlsim's default fragmentation method and GC-dependent efficiency function when simulating fragments. However, the default GC-dependent efficiency profile is quite simplistic and it might introduce biases which are too strong for your purposes. I suggest you run a simulation with the priming biases turned off and fixed efficiencies, just to get a sense of how much the biases affect your results: -f after_noprim_double -e 1.0
  • In order to simulate more realistic GC biases, you could use one of the paramter files estimated from real datasets from https://github.com/sbotond/rlsim-params. You migh have to modify the parameter files if you want ot use your own fragment size distribution.
  • You could also play with the priming bias parameter -p to tune the strenght of priming biases with the default fragmentation method.

On a general note: as far as I know getting many "false positive" transcripts from de novo assembly is not that unexpected, so I think it is more likely that your problem is on the inference side. Experimenting with the Oases parameters with an effect on sensitivity vs.specificity would be a good idea.

ADD COMMENT
1
Entering edit mode

Thanks for the input! I've been thinking about the 'simulation side' but have operated on the assumption that real data is certainly(?) messier so that the issues are further downstream. As noted by @lzwright, real data often infers far more transcripts than expected so this doesn't seem unreasonable.

I'll test the parameters you mention and see what I get...

ADD REPLY

Login before adding your answer.

Traffic: 1623 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6