Hi,
I'm assembling tree genotypes from a plant (non-model organism) in different conditions without a reference genome. First, I used Trinity and got many transcripts smaller than 500bp (more than 70000 transcripts), but that is not a problem. Then, I improved the assembly by clustering with cd-hit and improved the BUSCOs score by reducing the number of busco duplications. However, when it come to functional annotation, I could not reach 40% of annotated sequences based on a homology search—blasting my sequences against (SwissProt, TrEMBL, and NCBI_nr).
I need to find Differentially expressed genes within each sample and between each condition.
My raw data is single-end with a length of 50bp. For each genotype, we have three biological replicates of each condition (treated_Nacl, untreated).
To improve and increase the number of annotated transcripts, I used Trinity again to de novo assemble each genotype separately, specifying the minimum length of -l 500
. Then, I used CD-hit to obtain a non-redundant assembly that can be used for Differential expression genes across all samples. Unfortunately, the newly generated "clustered" fasta file (39124 transcripts) has a high number of BUSCOS duplicates even with a high number of functionally annotated transcripts (66.95%, 45%, 82%), respectively (SwissProt, TrEMBL, and NCBI_nr).
There are the commands I used for each run
### First de novo assembly All sample
Trinity --seqType fq --samples_file $sample_file --CPU 64 --max_memory 500G --full_cleanup
### Genotype-specific assembly
Trinity --seqType fq --samples_file $sample_file --CPU 64 --max_memory 500G--full_cleanup --min_contig_length 500
### cd-hit
cd-hit-est -i $inputfile -T 0 -o $outputfile -M 0
My questions are:
- Is the higher number of BUSCO duplications in the de novo transcripts of different genotypes of the same species typical?
- how can we get a consensus sequence from the three assembled genotypes? Alternative to cd.hit
Is it okay to use the predicted ORF from TransDecoder for the Differential gene expression (DGE) analysis since they represent the representative proteins-coding gene with a higher level of confidence?
Any help or suggestion regarding the above is very appreciated
Thank you
Perhaps I am missing it in the text above but did you do a single assembly of all the data from this experiment?
If you refer to "single assembly," give all the data to the Trinity assembler. I did that, actually, in my first run, which then improved by cd-hit but was too fragmented and resulted in low-quality functional annotation. bellow is the sample file input
That is surprising. It may indicate some issue with the way you may have pre-processed the files and done the assembly. I see "filtered" in name so something seems to have been done. In theory, using the entire dataset should give the most comprehensive view of the transcriptome.
I have processed the raw data with fastp with the following param:
fastp -i "${input_file}" -o "${output_file}" -q 35 -D -l 45 -p -w 68 -V --fix_mgi_id --cut_tail --cut_mean_quality 30
in order to improve the quality. I suspect the read length is too short, so I ended up with short assembled transcripts....I will try to tune the Trinity assembly with all samples just with -l and I will come to report these results.Hi GenoMax, first of all, thank you for your comment regarding the subject. but there is this question "Is it okay to use the predicted ORF from TransDecoder for the Differential gene expression (DGE) analysis since they represent the representative proteins-coding gene with a higher level of confidence?" I need your opinion regarding this question in particular.
Unless you are working with something totally exotic, at this point there should be some sequence information (genome and/or transcriptome) available in public databases. Have you tried to compare your transcriptome with what is available for a closely related species?
As a start can you only trim your dataset for adapter sequences and then use that to do a
trinity
run. Don't filter for quality (unless you have really bad data with Q10 or less scores, which you should take out).You probably do not want to hear this but if your libraries are of sub-optimal quality then no amount of informatics is going to address that issue.