Hi everybody,
I'm working on some transcriptomes from non-model organisms coming from Illumina sequences and I'm facing a problem I have't encountered before. To make it short, my data comes from sequencing 3 samples, each one consisting on a pool of 5 entire specimens, in an Illumina sequencer. I checked the quality in FastQC and trimmed with Trimmomatic acordingly. After that, I concatenated the resulting files to make a single assembly of the ~100M reads. Then, I did a standard Trinity assembly (without in-silico normalization). Here starts the strange part:
The assembly resulted in 526860 transcripts (isoform-level) with an N50 of 858, and a median contig length of 377. In addition (and this is what really makes me worry), I run BUSCO to asses completeness and I got the following result: C:98.6%[S:18.0%,D:80.6%],F:1.3%,M:0.1%,n:978.
This duplication level is ridiculously high, but I don't really know what is causing this. I've check the BUSCO documentation and both Biostars and SEQanswers but I haven't found duplications levels like this in a transcriptome. Have you have any similar experience? do you have any suggestion to make this numbers go down?
I'm stuck with this and would really appreciate any help.
Thanks!
Hi!
Thank you very much for your suggestions. I don't know how can I have missed this in the FAQ of Trinity...
Now I'm trying with 2 approaches: 1. Follow Trinity FAQ advice and leave all transcripts there for downstream analysis. 2. Use CD-HIT (which has reduced the D value of BUSCO to 46% without any decrease in the C value) and the select the most expressed isoform per Trinity 'gene' as you suggested.
I will do both analysis for comparison and update here what happens, in case someone finds it useful.