Hi friends. I performed a comparative test between two pipeline (Trimmomatic + Trinity + busco) and compared with the pipeline (Trinity + Busco). I realized that I obtained a substantial gain from Busco groups through the pipeline with the raw sequence (RNAseq, trancriptome, Illumina Hiseq, pairend). Now I am undecided, but trying and proceeding with the analyzes without trimading? What do you suggest?
pipeline (Trimmomatic + Trinity + busco)
Trinity
Total assembled = 63677017 Number of contigs = 78868 Number of Trinity unigenes = 60510 Contigs longer than 1000 = 18429 Contigs longer than 2000 = 7059 Contigs longer than 10000 = 79 Longest contig = 28033 Median = 403 Average = 807.39 N50 = 1451
Busco
C:88.8%[S:58.6%,D:30.2%],F:3.6%,M:7.6%,n:2510
2229 Complete BUSCOs (C)
1471 Complete and single-copy BUSCOs (S)
758 Complete and duplicated BUSCOs (D)
90 Fragmented BUSCOs (F)
191 Missing BUSCOs (M)
2510 Total BUSCO groups searched
pipeline(Trinity + Busco)
Trinity
Total assembled = 69407639 Number of contigs = 84709 Number of Trinity unigenes = 64655 Contigs longer than 1000 = 19883 Contigs longer than 2000 = 7917 Contigs longer than 10000 = 93 Longest contig = 33336 Median = 400 Average = 819.37 N50 = 1506
Busco
C:89.8%[S:58.0%,D:31.8%],F:2.8%,M:7.4%,n:2510
2256 Complete BUSCOs (C)
1457 Complete and single-copy BUSCOs (S)
799 Complete and duplicated BUSCOs (D)
70 Fragmented BUSCOs (F)
184 Missing BUSCOs (M)
2510 Total BUSCO groups searched
If there is any extraneous sequence (that does not belong to the genome you are working with) going into the assembly then that assembly is not correct. No matter what the stats say.
I used a bank of ortholog I am looking for my species (Order level), you believe that even so, there may be redundancies.
Note: my raw data has a very good qualitative profile, so I opted for the test, with the assembly with raw data.
Thanks.
Trimmomatic is being used to remove adapter sequences correct? Those have no place in your de novo assembly. Since you are getting different results compared to when you do not trim there must be some extraneous sequence in your reads. That should not be included in the assembly.
Total assembled difference below. To be fair we don't know if that is what got assembled and there was more sequence that went in. Have you checked stats on the actual input?
69407639 (no trim) - 63677017 (trimmed) = 5730622
I trimmed the tips, indexes, adapters so I got a trimmed sequence of less reads. Wouldn't it be possible that this trimmed part did not remain transcribed, that is, when I trimmed, did I not only remove these "leftovers" plus important transcript data, which Busco recognized?
Thanks for discussion.
If you did extra trimming beyond what was by normal scanning/trimming for adapters then potentially you could have lost data. But that does not mean you can go back and do NO trimming. You still need to trim where you ensure that no extraneous sequence (sorry to harp on it) gets into your assembly.
Would it be possible for me to assess whether it is reductions or really transcript data? Because all quality analyzes (Fastqc) indicated that there was no contamination from another organism, and the metrics were very good, Phred indexes higher than 35
Unless you specifically scan and trim you are not going to remove adapter contamination, if any. FastQC does not look at your entire dataset for all metrics. It sub-samples data for many of the tests it does. That is generally a good approximation for gross quality.
I suggest you take a look at
bbduk.sh
which is an efficient scan/trim program, if you are inclined. A guide is available.