Entering edit mode
9.0 years ago
kanika.151
▴
160
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 35868
Total trinity transcripts: 54969
Percent GC: 51.52
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 9567
Contig N20: 7769
Contig N30: 6524
Contig N40: 5393
Contig N50: 4511
Median contig length: 1780
Average contig: 2555.95
Total assembled bases: 140497949
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 7964
Contig N20: 6149
Contig N30: 5018
Contig N40: 4126
Contig N50: 3411
Median contig length: 1077
Average contig: 1843.53
Total assembled bases: 66123622
This is what trinityStats.pl
gives me after the assembly...
The total number of genes I was expecting were 12,510 but it is giving me 35,868 when I remove the isoforms it is still giving me 25,747 genes. Why is it giving me extra 13k genes?
Has anyone else stumbled on this trinity problem?
I have gotten the same thing after Trinity. I isolated the longest sequence from output then used them for downstream analysis.
How did you do isolate the longest sequence?
Trinity author Brian Haas has provided a perl script to extract longest isoforms from Trinity assemblies - alongside with this comment:
Initially, I thought that I have not used the "--trimmomatic" or "--normalize_reads" parameters maybe thats why I was getting such a estimate and when I ran it again I am getting even more Trinity Transcripts. I think I will run the analysis for both longest transcripts and all of them. Thank you.
we used a custom perl script. As it was mentioned on Trinity Frequently Asked Questions, you can use all transcripts for your downstream analysis. That is also reasonable.
Hello kanika.151!
Questions similar to yours can already be found at:
We have closed your question to allow us to keep similar content in the same thread.
If you disagree with this please tell us why in a reply below. We'll be happy to talk about it.
Cheers!
I think that is quite normal, most or all transcriptome assemblies will largely overestimate the number of transcripts, because of gaps. A factor or 2-3 is quite good I think. Why don't you map the reads to the genome instead and check for novel transcripts that way?
If I do a genome based trinity how would it give me Novel transcripts?
Why does it overpredicts? How can I explain it?
For the assembly, I had used 3 biological replicates so 3 times and I got 3 times the known genes that made me wonder was it really assembling the reads?