Since trinity output assembly ('original') has had a lot of duplicated matches with BLASTx, we decided to try reducing redundancy with tr2aacds script from EvidentialGene project. tr2aacds filters and merges contigs according to their coding potential and % of identity - sounds more legit than blast2cap3 approach or simple duplicates removal.
To compare original and filtered assemblies, we've done some check-ups with BUSCO and BLASTx. Results are - yes, yielding decrease of duplicates (BUSCO), but also increased number of missing and fragmented contigs. Yet these nr-assemblies are giving some, albeit much less, duplicated BLAST results. We're afraid to lose biologically meaningful data, but redundancy also leads to problems in further analysis.
Does anybody use a tr2aacds to reduce redundancy in de novo assemblies?
I have some questions about how do you use that tool. I have looked for a way to send you a private message but I think that is not possible in this forum. As consequence I have to put my question here (sorry). How do you have applied the EG approach? do you have touched several configuration files or not?
Hello Pablo!
We've just used one of the Evigene scripts that are supplied in the project data. We've looked through configs and didn't find anything related to our job, so we just fed the needed options to the script itself on the run.
Thank you for the clarification, I have done the same. In our case we really need to reduce the redundancy of our transcriptome because we have obtained more than 1.000.000 transcripts and CD-hit est didn't help (reduce the dataset but we still had 900k transcripts) for that reason we don't check these effects which you have find. Maybe we had the same issue or maybe not, I'll try to check that but as we have used the same assembler I expect same "problems".
Wow, one million. Just a wild guess - have you changed min contig size in Trinity? We've adjusted this value to a minimum sized protein multiplied on 3 of relative species - mb not very right approach, but the resulting assembly is quite ok except some issues I've described earlier.
No we let the default config, I think it is something like 200 nt of min size. In my opinion your way to do it is fine, but my supervisor its paranoiac about lose biological relevant information, even when we finish the assembly with tha huge amount of high redundant data (and for sure, also a lot of artifacts).
Hi, I am new in the bioinformatics field...
I am trying to remove redundancies and encounter your post.
I used tr2aacds.pl of EvidentialGene and got a problematic fasta file that had transcripts that had additional line after the ">" line as follow:
This "..increased number of missing and fragmented contigs.." could be due to various things,
but two I know or surmise from experience are part of your results:
Trinity, and other assemblers, produce joined genes (fusions, chimera), that can be measured as existing/full genes, by BLASTx or whatever BUSCO-software you use, because of the way those measures work. However a transcript made up of two or more gene loci isn't what Evigene considers accurate. You can instead make protein translations of your transcripts (as Evigene does), then measure with BLASTp against reference proteins to count valid proteins. Or else check your BLASTx results for cases of joined genes (before Evigene reduction).
1b. Using several gene assemblers, such as Velvet/Oases, idba_tran, Soap_Trans, with multi-kmer options, will produce a more complete gene set from your RNA, than using Trinity alone (those others resolve gene joins and fragments better for loci where Trinity fails). That is what Evigene was designed for: reducing many gene assemblies to the best coding gene subset.
Some settings for tr2aacds may be changed to return more of the smaller proteins, if those are what are now missing from your reduced transcript set. You can check what genes are missing, and if they are small ones (e.g. 30 to 60 aminos, or smaller), resetting some of tr2aacds minimum protein size settings will recover those. An alternative to that, you can add reference blast scores to tr2aacds for each input transcript to retain those with good reference alignments. As it sounds like you have blast scores already for your input transcripts, make a aablast table of those (trid <tab> refid <tab> blast_bitscore <newline>), then run tr2aacds with that option tr2aacds.pl -ablastab aablast.tab) [I think it also will read standard blastp/blastx -outformat 7 tables ].
I've got your points for missing prots, but could you comment on excessive annotation which we're trying to reduce? Does tr2aacds really suits for this job?
We've set the configs to Evigene script according to our task, but we didn't know that it's possible to feed blast results to it. Thanks!
Hello crimsontabaq,
I have some questions about how do you use that tool. I have looked for a way to send you a private message but I think that is not possible in this forum. As consequence I have to put my question here (sorry). How do you have applied the EG approach? do you have touched several configuration files or not?
Thank you for your time.
Hello Pablo! We've just used one of the Evigene scripts that are supplied in the project data. We've looked through configs and didn't find anything related to our job, so we just fed the needed options to the script itself on the run.
Thank you for the clarification, I have done the same. In our case we really need to reduce the redundancy of our transcriptome because we have obtained more than 1.000.000 transcripts and CD-hit est didn't help (reduce the dataset but we still had 900k transcripts) for that reason we don't check these effects which you have find. Maybe we had the same issue or maybe not, I'll try to check that but as we have used the same assembler I expect same "problems".
Wow, one million. Just a wild guess - have you changed min contig size in Trinity? We've adjusted this value to a minimum sized protein multiplied on 3 of relative species - mb not very right approach, but the resulting assembly is quite ok except some issues I've described earlier.
No we let the default config, I think it is something like 200 nt of min size. In my opinion your way to do it is fine, but my supervisor its paranoiac about lose biological relevant information, even when we finish the assembly with tha huge amount of high redundant data (and for sure, also a lot of artifacts).
Hi, I am new in the bioinformatics field... I am trying to remove redundancies and encounter your post. I used tr2aacds.pl of EvidentialGene and got a problematic fasta file that had transcripts that had additional line after the ">" line as follow:
I took the file from okayset Did I do something wrong while running it? Thanks!!! Reut
Please do not add comments via the answer field. Use
Add comment/reply
instead. Also please use the code option10101
to highlight code.