What to do when transcriptome assembly quality metrics do not concur. How can you decide what assembly is best to annotate and be used as reference?
0
1
Entering edit mode
3.8 years ago
MBarcelo ▴ 40

Hey Biostars!

I am attempting to assembly and annotate a de novo transcriptome from a vertebrate species with no references (transcriptome or genome). I managed to build several transcriptomes using sequencing data from 6 individuals (PE illumina reads). I read all the manuals and online help I was able to find but I need someone with experience to look at my numbers and give an informed opinion on what assembly is the best so I can proceed with annotation.

A very short summary of the pipelines for my “best” 3 assemblies:

Ass1 -> Raw reads – TrimmomaticTrinity (using all 6 individuals together)

Ass2 -> Raw reads – TrimmomaticrcorrectorTrinity (using all 6 individuals together)

Ass3 -> Raw reads – Oyster River Protocol (assembly of separate samples with three different assemblers Trinity, Spades and Shannon, obtaining 6 assemblies, one per individual) – Orthofuse (after assembly, I merged all 6 individual assemblies into one)

All of them have very good (>90%) alignment rates for the reads (bowtie2) and the fragments (Transrate).

Table of quality metrics for each assembly (in blue what is considered the best quality, in red the worst): enter image description here

My questions are:

Q1 – What to do when the quality metrics point to different directions? Are there any settings (in the assembler options) that can help improve some of the metrics, so I can try to directly improve a specific assembly? (E.g. improve Transrate score or busco for Ass1).

Q2 – From all assemblies, I obtained lots of transcripts, more than I would expect a vertebrate to have (>300,000). This means I have a fragmented transcriptome. However, the Transrate output says I have lots of potential bridges (>100,000). Is there a way I can use these potential bridges to enhance the quality of my transcriptome?

Any help of any sort would be greatly appreciated! I am new at this, I learned a lot from this site and the Biostars Handbook but I am in the need of specific advice on my data :)

Thank you so much for your help.

P.S. I cannot thank enough the BioStar team and all people contributing to this site and the Biostars Handbook for all their help they have already given me. As a suggestion, a chapter (or a book) within the Biostars Handbook about de novo transcriptome assembly and annotation would be awesome :P

RNA-Seq de novo assembly quality • 2.3k views
ADD COMMENT
1
Entering edit mode

Not an expert, but a fellow journey-person down this path. Here's what I can share based on my experiences:

From the Transrate website: "Transrate calculates whether the read mappings contain any evidence that different contigs originate from the same transcript. These theoretical links are called bridges, and the number of bridges is shown in the supported bridges metric." Are you perhaps seeing all these bridges because of sequence isoforms (genuine or produced by the assembler(s))? I'm asking because I don't know if Transrate actually calculates these bridges by looking at contig flank overlaps as much as it does by looking at how well the contigs align with one another. I don't know how the ORP collapses redundant (or near-redundant) sequences, but this is also probably why you're seeing the high bridge estimates in that assembly as well.

Also, I believe the Transrate fragments metric is (or should be) equivalent to the read support you calculated from Bowtie2 (I guess it's just the number of read pairs that map concordantly?).

I do also remember seeing it being stated somewhere that Transrate is no longer being developed.

I don't know if > 300,000 assembled contigs is necessarily indicative of anything "bad". For instance I have ca. 1,800,000 contigs from a somewhat similar pooled de novo assembly like yours (same situation, no reference data whatsoever). It's just heavily redundant, and needs to be clustered down (probably to like 90% sequence identity, retaining the longest contigs). Also 300,000 contigs does not equal 300,000 protein coding sequences.

The BUSCO scores look good though (not so much for the third assembly; but it's also a good assembly). IMHO, the first assembly there is probably the "best" once you cluster away some contigs (greater N50 need not necessarily mean more complete ORFs or whatever as far as I know however). I suppose the second assembly would work just as well.

Some questions:

  • What tools are you planning on using for annotation, if I may ask?
  • Did you run any pre-processing on the reads? I ended up with raw reads -> fastp -> Kraken2 -> SortMeRNA -> Trinity -> ???? (annotation) at the moment.

Looking forward to reading your thoughts and inputs!!

ADD REPLY
0
Entering edit mode

Hi Dunois!

It feels good knowing I am not alone in the struggle. However, I feel you are way ahead of me, not sure my thoughts and input will provide much to our discussion. I will try my best!

Regarding the bridges I am not sure how they are calculated or how useful they can be. When learning about the Transrate output in the manual, I came across the concept of bridges and thought it could be a good asset to improve the quality of the assembly by combining different fragmented transcripts into “complete” ones. But you are right that isoforms can mess with this or you can fall into the dangers of chimeras. I just asked in case there was a software well known to deal with this that I missed by not googling the right words.

I am not greatly concerned about having many transcripts, with the 150bp illumina reads it might be the best I can get. I am considering running extra samples on PacBio to solve this “issue”. Regarding redundancy, I ran cd-hit-est on my biggest assembly and reduced the number of contigs by around 100,000 but it also reduced the BUSCO score so I am not sure it is worth it. I am considering trying the EvidentialGene tr2aacds pipeline that apparently identifies and merges high-quality non-redundant transcripts from different assemblies. But I want to look more into it since it appears to merge the assemblies using transfuse and I read somewhere (do not remember where sorry) that transfuse might not be working properly. Again, I am still looking into this, do not take my word for it.

I agree with you that the first assembly looks like the best one. It is just shocking to me since it is the first one I did, using my uninformed judgement to set up the Trimmomatic options and going straight into Trinity with default settings. I was expecting the one from ORP to be the best one but it appears it is not the case, however there is always the possibility that I did not use it properly.

I go step by step and have not look that much into annotation yet. I was going to start with Trinotate and Dammit. Still in this stage of prerunning things where I make myself believe annotation will be fast and easy.

As for read pre-processing I was advised to use Trimmomatic, BBmerge and rcorr. BBmerge did not help, the assembly metrics were clearly worse than the other ones, not by much but enough to dismiss it. Never came across fastp but it looks worth to look into it! I read a bit on SortMeRNA and Kraken2 at the beginning and thought they were specific for metagenomes/metatranscriptomes but just had a look now and saw they might be useful to remove contamination before assembly. I will definitively look into all of these!

Thanks a lot for taking the time to reply!

ADD REPLY
0
Entering edit mode

The feeling is mutual!! That said, I don't think I'm any further ahead in this than you are.

  • Let's see if anybody else provides some input regarding the bridges. Would be cool to find out whether this hypothesis is correct.
  • Regarding the redundancy: does cd-hit-est retain the longest contig as the representative of the cluster? It could be that you're losing the longer contigs and/or you're clustering at too low a threshold and you're losing paralogs (or even entirely unrelated sequences) because of that. I think I saw some papers mention clustering at 96-98% identity (so I guess just enough to account for sequencing errors and whatnot, but no more). And I suppose contig extension-by-merging would be nice, but I haven't heard of a tool that can actually do it (perhaps something from the bbmap suite maybe?).
  • I had a subject expert look at EvidentialGene, and they weren't too impressed with the tool (no offense to the maintainer, who I think is doing a great job all the same). But probably worth giving a shot? And as far as I know, Transfuse--like Transrate--is no longer in development (I couldn't even install it properly when I tried, and I didn't bother with it after that). What I am doing now w.r.t. redundancy is to cluster both at the nucleotide and amino acid levels.
  • I feel (and this is just my opinion) that there isn't all that much more you can squeeze out of a de novo assembly without additional external information (e.g., a genome, identified orthologs, etc.). So I'm not too surprised the ORP didn't do much to improve the assembly--the multi-assembler approach seems to deliver marginal gains at best. I'd be interested in knowing how you ran the ORP, I never got that one going either.
  • I think reliable annotation will be the hardest part (I heard as much in a reply in one of my older posts on this topic). Trinotate, Dammit, and ENTAP are what I am considering at the moment (besides a few simpler ideas of my own). Like some of the tools we've discussed, development on Dammit also appears to be frozen. It was very buggy when I tried installing it, and GitHub issues went unanswered. But fingers crossed for a new release!!
  • Hmm BBmerge wouldn't help much here I guess? (I think Trinity and the other assemblers do run FLASH or something else internally to merge overlapping reads anyway?) What is rcorr? I've only been able to find references to the eponymous R function. Do you mean rcorrector perhaps? I ran SortMeRNA because I am mostly interested in protein coding sequences (and I only have poly-A data anyway). But I suppose for proper annotation, I'd have to assemble the rRNA data and annotate those alongside.

Since we're on the same boat, should we touch bases via email/Skype? It would be nice to collaborate on the annotation work if possible (lab restrictions and whatnot all considered), and potentially help each other out along the way.

ADD REPLY
0
Entering edit mode

Hi! Apparently, cd-hit-est reduces by filtering by longest-transcript. In the EvidentialGene website they advise against it and say that their pipeline avoids this issue. As I said, I haven’t look that much into it but I might give it a try.

As for the ORP it was hell to install. I had to use Docker (and learn how to) and I ran into many problems with system compatibility, but with help from the IT team from my institution I managed to run it. I can give you more specifics if you want.

You are right I meant rcorrector, I changed it on the original post :)

I am interested in generating a transcriptome as complete as my data allows for, but down the line I will be working mainly with protein coding as well.

Sure we can get in touch via mail/skype. However, I am working non-stop in the lab right now, I have the bioinformatics on hold. As I said, I am not sure I can provide much help but it is always nice to know you can contact with someone running into the same issues.

ADD REPLY
0
Entering edit mode

Hi,

I see. Yeah, it seems like clustering away to the longest contigs isn't always the best solution.

I see. I have my own pipeline for now, and I don't plan on using the ORP, so let's see!!

Hmm if you're working with protein coding data in the end, then ENTAP might be a good solution for annotation.

My project will also progress a bit slowly, so we'll have time to go over stuff I think. It'll be nice to be in touch all the same!! You can find me on Skype using this link <snip>. I'll delete the link once you join.

ADD REPLY
0
Entering edit mode

Hi! The link does not work :) It might have expired

ADD REPLY
0
Entering edit mode

That's really strange. Here's a new link (<snip>).

ADD REPLY

Login before adding your answer.

Traffic: 1913 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6