Hey Biostars!
I am attempting to assembly and annotate a de novo transcriptome from a vertebrate species with no references (transcriptome or genome). I managed to build several transcriptomes using sequencing data from 6 individuals (PE illumina reads). I read all the manuals and online help I was able to find but I need someone with experience to look at my numbers and give an informed opinion on what assembly is the best so I can proceed with annotation.
A very short summary of the pipelines for my “best” 3 assemblies:
Ass1 -> Raw reads – Trimmomatic
– Trinity
(using all 6 individuals together)
Ass2 -> Raw reads – Trimmomatic
– rcorrector
– Trinity
(using all 6 individuals together)
Ass3 -> Raw reads – Oyster River Protocol
(assembly of separate samples with three different assemblers Trinity
, Spades
and Shannon
, obtaining 6 assemblies, one per individual) – Orthofuse
(after assembly, I merged all 6 individual assemblies into one)
All of them have very good (>90%) alignment rates for the reads (bowtie2
) and the fragments (Transrate
).
Table of quality metrics for each assembly (in blue what is considered the best quality, in red the worst):
My questions are:
Q1 – What to do when the quality metrics point to different directions? Are there any settings (in the assembler options) that can help improve some of the metrics, so I can try to directly improve a specific assembly? (E.g. improve Transrate
score or busco
for Ass1).
Q2 – From all assemblies, I obtained lots of transcripts, more than I would expect a vertebrate to have (>300,000). This means I have a fragmented transcriptome. However, the Transrate
output says I have lots of potential bridges (>100,000). Is there a way I can use these potential bridges to enhance the quality of my transcriptome?
Any help of any sort would be greatly appreciated! I am new at this, I learned a lot from this site and the Biostars Handbook but I am in the need of specific advice on my data :)
Thank you so much for your help.
P.S. I cannot thank enough the BioStar team and all people contributing to this site and the Biostars Handbook for all their help they have already given me. As a suggestion, a chapter (or a book) within the Biostars Handbook about de novo transcriptome assembly and annotation would be awesome :P
Not an expert, but a fellow journey-person down this path. Here's what I can share based on my experiences:
From the
Transrate
website: "Transrate
calculates whether the read mappings contain any evidence that different contigs originate from the same transcript. These theoretical links are called bridges, and the number of bridges is shown in the supported bridges metric." Are you perhaps seeing all these bridges because of sequence isoforms (genuine or produced by the assembler(s))? I'm asking because I don't know ifTransrate
actually calculates these bridges by looking at contig flank overlaps as much as it does by looking at how well the contigs align with one another. I don't know how theORP
collapses redundant (or near-redundant) sequences, but this is also probably why you're seeing the high bridge estimates in that assembly as well.Also, I believe the
Transrate
fragments metric is (or should be) equivalent to the read support you calculated fromBowtie2
(I guess it's just the number of read pairs that map concordantly?).I do also remember seeing it being stated somewhere that
Transrate
is no longer being developed.I don't know if > 300,000 assembled contigs is necessarily indicative of anything "bad". For instance I have ca. 1,800,000 contigs from a somewhat similar pooled de novo assembly like yours (same situation, no reference data whatsoever). It's just heavily redundant, and needs to be clustered down (probably to like 90% sequence identity, retaining the longest contigs). Also 300,000 contigs does not equal 300,000 protein coding sequences.
The BUSCO scores look good though (not so much for the third assembly; but it's also a good assembly). IMHO, the first assembly there is probably the "best" once you cluster away some contigs (greater N50 need not necessarily mean more complete ORFs or whatever as far as I know however). I suppose the second assembly would work just as well.
Some questions:
fastp
->Kraken2
->SortMeRNA
->Trinity
-> ???? (annotation) at the moment.Looking forward to reading your thoughts and inputs!!
Hi Dunois!
It feels good knowing I am not alone in the struggle. However, I feel you are way ahead of me, not sure my thoughts and input will provide much to our discussion. I will try my best!
Regarding the bridges I am not sure how they are calculated or how useful they can be. When learning about the
Transrate
output in the manual, I came across the concept of bridges and thought it could be a good asset to improve the quality of the assembly by combining different fragmented transcripts into “complete” ones. But you are right that isoforms can mess with this or you can fall into the dangers of chimeras. I just asked in case there was a software well known to deal with this that I missed by not googling the right words.I am not greatly concerned about having many transcripts, with the 150bp illumina reads it might be the best I can get. I am considering running extra samples on PacBio to solve this “issue”. Regarding redundancy, I ran
cd-hit-est
on my biggest assembly and reduced the number of contigs by around 100,000 but it also reduced the BUSCO score so I am not sure it is worth it. I am considering trying the EvidentialGene tr2aacds pipeline that apparently identifies and merges high-quality non-redundant transcripts from different assemblies. But I want to look more into it since it appears to merge the assemblies using transfuse and I read somewhere (do not remember where sorry) that transfuse might not be working properly. Again, I am still looking into this, do not take my word for it.I agree with you that the first assembly looks like the best one. It is just shocking to me since it is the first one I did, using my uninformed judgement to set up the
Trimmomatic
options and going straight intoTrinity
with default settings. I was expecting the one fromORP
to be the best one but it appears it is not the case, however there is always the possibility that I did not use it properly.I go step by step and have not look that much into annotation yet. I was going to start with
Trinotate
andDammit
. Still in this stage of prerunning things where I make myself believe annotation will be fast and easy.As for read pre-processing I was advised to use
Trimmomatic
,BBmerge
andrcorr
.BBmerge
did not help, the assembly metrics were clearly worse than the other ones, not by much but enough to dismiss it. Never came acrossfastp
but it looks worth to look into it! I read a bit onSortMeRNA
andKraken2
at the beginning and thought they were specific for metagenomes/metatranscriptomes but just had a look now and saw they might be useful to remove contamination before assembly. I will definitively look into all of these!Thanks a lot for taking the time to reply!
The feeling is mutual!! That said, I don't think I'm any further ahead in this than you are.
cd-hit-est
retain the longest contig as the representative of the cluster? It could be that you're losing the longer contigs and/or you're clustering at too low a threshold and you're losing paralogs (or even entirely unrelated sequences) because of that. I think I saw some papers mention clustering at 96-98% identity (so I guess just enough to account for sequencing errors and whatnot, but no more). And I suppose contig extension-by-merging would be nice, but I haven't heard of a tool that can actually do it (perhaps something from thebbmap
suite maybe?).ORP
didn't do much to improve the assembly--the multi-assembler approach seems to deliver marginal gains at best. I'd be interested in knowing how you ran theORP
, I never got that one going either.Trinotate
,Dammit
, andENTAP
are what I am considering at the moment (besides a few simpler ideas of my own). Like some of the tools we've discussed, development onDammit
also appears to be frozen. It was very buggy when I tried installing it, and GitHub issues went unanswered. But fingers crossed for a new release!!BBmerge
wouldn't help much here I guess? (I thinkTrinity
and the other assemblers do runFLASH
or something else internally to merge overlapping reads anyway?) What isrcorr
? I've only been able to find references to the eponymousR
function. Do you meanrcorrector
perhaps? I ranSortMeRNA
because I am mostly interested in protein coding sequences (and I only have poly-A data anyway). But I suppose for proper annotation, I'd have to assemble the rRNA data and annotate those alongside.Since we're on the same boat, should we touch bases via email/Skype? It would be nice to collaborate on the annotation work if possible (lab restrictions and whatnot all considered), and potentially help each other out along the way.
Hi! Apparently,
cd-hit-est
reduces by filtering by longest-transcript. In the EvidentialGene website they advise against it and say that their pipeline avoids this issue. As I said, I haven’t look that much into it but I might give it a try.As for the
ORP
it was hell to install. I had to use Docker (and learn how to) and I ran into many problems with system compatibility, but with help from the IT team from my institution I managed to run it. I can give you more specifics if you want.You are right I meant
rcorrector
, I changed it on the original post :)I am interested in generating a transcriptome as complete as my data allows for, but down the line I will be working mainly with protein coding as well.
Sure we can get in touch via mail/skype. However, I am working non-stop in the lab right now, I have the bioinformatics on hold. As I said, I am not sure I can provide much help but it is always nice to know you can contact with someone running into the same issues.
Hi,
I see. Yeah, it seems like clustering away to the longest contigs isn't always the best solution.
I see. I have my own pipeline for now, and I don't plan on using the
ORP
, so let's see!!Hmm if you're working with protein coding data in the end, then
ENTAP
might be a good solution for annotation.My project will also progress a bit slowly, so we'll have time to go over stuff I think. It'll be nice to be in touch all the same!! You can find me on Skype using this link <snip>. I'll delete the link once you join.
Hi! The link does not work :) It might have expired
That's really strange. Here's a new link (<snip>).