Hi all,
I have RNAseq data of non-model organisms, which I'm planning to analyze by de novo assembly with Trinity and write a paper on.
The RNAseq paper contains basic statistics(for example, Total raw reads, Number of gene and Assembly N50 etc). Are there any rules for these parameters? I'm a beginner in bioinformatics, so I don't know what parameters are required. Can you please advise me?
Thank you very much for your much appreciated help on this!
Thank you very much for your quick answer.
There are no rules! I see.
According to the paper that use related species, the number of genes is 48,541, and the number of isoforms is 147,621:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7266049/
The number of genes is 40081 in my data set. But I haven't checked the number of isoforms and unigenes, because I don't know how to count these. If it's OK with you, can you tell me how to check these?
I appreciate your cooperation.
From the paper you are citing: "The longest isoform per gene was extracted using a utility script bundled with Trinity v2.6.6 (get_longest_isoform_seq_per_trinity_gene.pl)."
That is what I meant with "unigene", then you simply count the number of sequences in the longest_isoform file and the Trinity.fasta. There is also a utility in Trinity called
TrinityStats.pl
that can calculate N50 and other statistics.The authors of the nematode paper had 325M reads by the way.
Very sorry for reply late.
It seemed that I misunderstood. Following your advices, I had checked my dataset.
The number of isoform:
The number of unigene:
In this case, can I run as you are saying? And how can I count the number of read?
I apologize for keep asking. Thank you.
This looks ok to me. Just keep asking, if the topic is developing further, you might want to open a new question for a specific problem. You might also profit from finding local bioinformatics support.
I think you should always run
FastQC
on your fastq files (check for FAQs on Biostars for interpretation of ), then make a more document usingMultiQC
on the output. Alsoseqkit stats
does a good job to calculate statistics on Fast* files. All can be installed via Bioconda.You will have to do more analyses to publish a paper, at minimum annotate the transcripts. For that you could attempt to replicate the Methods from Fu et al. step by step as far as they apply to your setting, you don't have to invent a completely new pipeline here. Once you have draft manuscript it makes sense to upload your raw data to SRA, DDBJ, or ENA. Also, once you submit a paper, please provide your assembly file also.
Also, even though some groups still get by without, consider replication, especially for differential expression analysis.
I will try to do the analysis based on your advices.
You're right. I apologize for asking so many questions in a row.
I am very grateful for your help. Thank you very much!