Question

What are the rules for the basic statistic of de novo assembly in RNAseq?

1

Entering edit mode

3.3 years ago

Riku ▴ 80

Hi all,

I have RNAseq data of non-model organisms, which I'm planning to analyze by de novo assembly with Trinity and write a paper on.

The RNAseq paper contains basic statistics(for example, Total raw reads, Number of gene and Assembly N50 etc). Are there any rules for these parameters? I'm a beginner in bioinformatics, so I don't know what parameters are required. Can you please advise me?

Thank you very much for your much appreciated help on this!

Assembly Trinity RNAseq • 1.5k views

ADD COMMENT • link 3.3 years ago by Riku ▴ 80

score 2 · Accepted Answer · 2021-09-08

2

Entering edit mode

3.3 years ago

Michael 55k

I think there are no fixed rules for these numbers that would determine if you can or cannot publish. Transcriptomes are easier to assemble than genomes due to less repetitive regions, so, N50 is limited by the actual lengths of transcripts. If your organism has different developmental stages and tissues you should rather devise a good sampling strategy to distribute your funds in order to detect tissue and stage-specific transcripts, or make a good pool of everything. I would say, this is more important than just how many paired-end reads you can afford. I think you can already get something decent from >50M fragments per library which is at the lower end of what you get today.

Wrt. the number of transcripts after assembly, how many transcripts including isoforms do you expect from related species? If you get an absurd number of unigenes (100k+ genes) then you might have to sequence deeper.

Here is an example of such a paper in Parasites & Vectors: https://parasitesandvectors.biomedcentral.com/articles/10.1186/s13071-020-04442-2#Sec9 (I noticed the authors report rather low % of database hits)

They have ~460M reads in total, that would mean a cost/effect-ratio of 1 Illumina 4000 lane per publication, but I think one can go with less in some journals, and of course you need to analyze the data properly and write something interesting after that.

ADD COMMENT • link 3.3 years ago by Michael 55k

0

Entering edit mode

Thank you very much for your quick answer.

There are no rules! I see.

According to the paper that use related species, the number of genes is 48,541, and the number of isoforms is 147,621:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7266049/

The number of genes is 40081 in my data set. But I haven't checked the number of isoforms and unigenes, because I don't know how to count these. If it's OK with you, can you tell me how to check these?

I appreciate your cooperation.

ADD REPLY • link 3.3 years ago by Riku ▴ 80

1

Entering edit mode

From the paper you are citing: "The longest isoform per gene was extracted using a utility script bundled with Trinity v2.6.6 (get_longest_isoform_seq_per_trinity_gene.pl)."

That is what I meant with "unigene", then you simply count the number of sequences in the longest_isoform file and the Trinity.fasta. There is also a utility in Trinity called TrinityStats.pl that can calculate N50 and other statistics.

The authors of the nematode paper had 325M reads by the way.

ADD REPLY • link 3.3 years ago by Michael 55k

0

Entering edit mode

Very sorry for reply late.

It seemed that I misunderstood. Following your advices, I had checked my dataset.

The number of isoform:

$ grep -c ">" Trinity.fasta
76200

The number of unigene:

$ grep -c ">" Trinity.longest_isoform.fasta 
40081

In this case, can I run as you are saying? And how can I count the number of read?

I apologize for keep asking. Thank you.

ADD REPLY • link 3.3 years ago by Riku ▴ 80

0

Entering edit mode

This looks ok to me. Just keep asking, if the topic is developing further, you might want to open a new question for a specific problem. You might also profit from finding local bioinformatics support.

I think you should always run FastQC on your fastq files (check for FAQs on Biostars for interpretation of ), then make a more document using MultiQC on the output. Also seqkit stats does a good job to calculate statistics on Fast* files. All can be installed via Bioconda.

You will have to do more analyses to publish a paper, at minimum annotate the transcripts. For that you could attempt to replicate the Methods from Fu et al. step by step as far as they apply to your setting, you don't have to invent a completely new pipeline here. Once you have draft manuscript it makes sense to upload your raw data to SRA, DDBJ, or ENA. Also, once you submit a paper, please provide your assembly file also.

Also, even though some groups still get by without, consider replication, especially for differential expression analysis.

ADD REPLY • link 3.3 years ago by Michael 55k

0

Entering edit mode

I will try to do the analysis based on your advices.

You're right. I apologize for asking so many questions in a row.

I am very grateful for your help. Thank you very much!

ADD REPLY • link 3.3 years ago by Riku ▴ 80