Question

Expert Guidance Required for De Novo Transcriptome Analysis

0

Entering edit mode

8 months ago

ben@f ▴ 20

Hi,

I'm assembling tree genotypes from a plant (non-model organism) in different conditions without a reference genome. First, I used Trinity and got many transcripts smaller than 500bp (more than 70000 transcripts), but that is not a problem. Then, I improved the assembly by clustering with cd-hit and improved the BUSCOs score by reducing the number of busco duplications. However, when it come to functional annotation, I could not reach 40% of annotated sequences based on a homology search—blasting my sequences against (SwissProt, TrEMBL, and NCBI_nr).

I need to find Differentially expressed genes within each sample and between each condition.

My raw data is single-end with a length of 50bp. For each genotype, we have three biological replicates of each condition (treated_Nacl, untreated).

To improve and increase the number of annotated transcripts, I used Trinity again to de novo assemble each genotype separately, specifying the minimum length of -l 500. Then, I used CD-hit to obtain a non-redundant assembly that can be used for Differential expression genes across all samples. Unfortunately, the newly generated "clustered" fasta file (39124 transcripts) has a high number of BUSCOS duplicates even with a high number of functionally annotated transcripts (66.95%, 45%, 82%), respectively (SwissProt, TrEMBL, and NCBI_nr).

There are the commands I used for each run

### First de novo assembly All sample

Trinity --seqType fq --samples_file $sample_file --CPU 64 --max_memory 500G --full_cleanup
### Genotype-specific assembly 
Trinity --seqType fq --samples_file $sample_file --CPU 64 --max_memory 500G--full_cleanup --min_contig_length 500
### cd-hit
cd-hit-est -i $inputfile -T 0 -o $outputfile -M 0

My questions are:

Is the higher number of BUSCO duplications in the de novo transcripts of different genotypes of the same species typical?
how can we get a consensus sequence from the three assembled genotypes? Alternative to cd.hit

Is it okay to use the predicted ORF from TransDecoder for the Differential gene expression (DGE) analysis since they represent the representative proteins-coding gene with a higher level of confidence?

Any help or suggestion regarding the above is very appreciated

Thank you

Transcriptome De-novo • 792 views

ADD COMMENT • link updated 8 months ago by GenoMax 147k • written 8 months ago by ben@f ▴ 20

0

Entering edit mode

Perhaps I am missing it in the text above but did you do a single assembly of all the data from this experiment?

ADD REPLY • link 8 months ago by GenoMax 147k

0

Entering edit mode

If you refer to "single assembly," give all the data to the Trinity assembler. I did that, actually, in my first run, which then improved by cd-hit but was too fragmented and resulted in low-quality functional annotation. bellow is the sample file input

Control A1  A1_143_filtred.fastq.gz
Control A5  A5_144_filtred.fastq.gz
Control A6  A6_8_filtred.fastq.gz   
Treated B8  B8_9_filtred.fastq.gz
Treated B9  B9_11_filtred.fastq.gz
Treated B14 B14_12_filtred.fastq.gz
Control C16 C16_13_filtred.fastq.gz
Control C17 C17_14_filtred.fastq.gz
Control C18 C18_15_filtred.fastq.gz
Control E32 E32_19-filtred.fastq.gz
Control E33 E33_20_filtred.fastq.gz
Control E34 E34_21_filtred.fastq.gz 
Treated F38 F38_23_filtred.fastq.gz
Treated F39 F39_24_filtred.fastq.gz
Treated F40 F40_26_filtred.fastq.gz 
Treated D22 D22_16_filtred.fastq.gz
Treated D24 D24_17_filtred.fastq.gz
Treated D25 D25_18_filtred.fastq.gz

ADD REPLY • link 8 months ago by ben@f ▴ 20

1

Entering edit mode

That is surprising. It may indicate some issue with the way you may have pre-processed the files and done the assembly. I see "filtered" in name so something seems to have been done. In theory, using the entire dataset should give the most comprehensive view of the transcriptome.

ADD REPLY • link 8 months ago by GenoMax 147k

0

Entering edit mode

I have processed the raw data with fastp with the following param: fastp -i "${input_file}" -o "${output_file}" -q 35 -D -l 45 -p -w 68 -V --fix_mgi_id --cut_tail --cut_mean_quality 30 in order to improve the quality. I suspect the read length is too short, so I ended up with short assembled transcripts....I will try to tune the Trinity assembly with all samples just with -l and I will come to report these results.

ADD REPLY • link 8 months ago by ben@f ▴ 20

0

Entering edit mode

Hi GenoMax, first of all, thank you for your comment regarding the subject. but there is this question "Is it okay to use the predicted ORF from TransDecoder for the Differential gene expression (DGE) analysis since they represent the representative proteins-coding gene with a higher level of confidence?" I need your opinion regarding this question in particular.

ADD REPLY • link 8 months ago by ben@f ▴ 20

0

Entering edit mode

Unless you are working with something totally exotic, at this point there should be some sequence information (genome and/or transcriptome) available in public databases. Have you tried to compare your transcriptome with what is available for a closely related species?

As a start can you only trim your dataset for adapter sequences and then use that to do a trinity run. Don't filter for quality (unless you have really bad data with Q10 or less scores, which you should take out).

You probably do not want to hear this but if your libraries are of sub-optimal quality then no amount of informatics is going to address that issue.

ADD REPLY • link 8 months ago by GenoMax 147k

score 1 · Answer 1 · 2024-03-18

1

Entering edit mode

8 months ago

Doomhammer ▴ 10

Do you have enough clean data to assemble, for each genotypes independently? I think at first we should find the difference in transcripts among the genotypes, then focous on Control vs. Treated.

ADD COMMENT • link 8 months ago by Doomhammer ▴ 10

0

Entering edit mode

Thank you, Doomhammer , for your response. Could you please guide me to a tutorial or method that helps in identifying "the differences in transcripts among the genotypes"?

Here's how I envision it: I would map each genotype to assembled transcripts (using cd-hit output), then utilize salmon for quantification. After obtaining the Transcripts Per Million (TPM) values for each genotype, I plan to visualize the data using boxplots. However, I'm uncertain if this approach is correct

ADD REPLY • link 8 months ago by ben@f ▴ 20