Trinity predicting more number of genes?
1
0
Entering edit mode
9.0 years ago
kanika.151 ▴ 160
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':    35868
Total trinity transcripts:    54969
Percent GC: 51.52

########################################
Stats based on ALL transcript contigs:
########################################

    Contig N10: 9567
    Contig N20: 7769
    Contig N30: 6524
    Contig N40: 5393
    Contig N50: 4511

    Median contig length: 1780
    Average contig: 2555.95
    Total assembled bases: 140497949


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

    Contig N10: 7964
    Contig N20: 6149
    Contig N30: 5018
    Contig N40: 4126
    Contig N50: 3411

    Median contig length: 1077
    Average contig: 1843.53
    Total assembled bases: 66123622

This is what trinityStats.pl gives me after the assembly...

The total number of genes I was expecting were 12,510 but it is giving me 35,868 when I remove the isoforms it is still giving me 25,747 genes. Why is it giving me extra 13k genes?

Has anyone else stumbled on this trinity problem?

RNA-Seq genes trinity • 3.7k views
ADD COMMENT
2
Entering edit mode

I have gotten the same thing after Trinity. I isolated the longest sequence from output then used them for downstream analysis.

ADD REPLY
0
Entering edit mode

How did you do isolate the longest sequence?

ADD REPLY
1
Entering edit mode

Trinity author Brian Haas has provided a perl script to extract longest isoforms from Trinity assemblies - alongside with this comment:

The longest transcript isn't always the 'best' transcript.... but this has been asked for so many times, I'll just write the script and post it shortly.

ADD REPLY
0
Entering edit mode

Initially, I thought that I have not used the "--trimmomatic" or "--normalize_reads" parameters maybe thats why I was getting such a estimate and when I ran it again I am getting even more Trinity Transcripts. I think I will run the analysis for both longest transcripts and all of them. Thank you.

ADD REPLY
0
Entering edit mode

we used a custom perl script. As it was mentioned on Trinity Frequently Asked Questions, you can use all transcripts for your downstream analysis. That is also reasonable.

ADD REPLY
0
Entering edit mode

Hello kanika.151!

Questions similar to yours can already be found at:

We have closed your question to allow us to keep similar content in the same thread.

If you disagree with this please tell us why in a reply below. We'll be happy to talk about it.

Cheers!

Re-opened because it wasn't exactly identical.

ADD REPLY
0
Entering edit mode

I think that is quite normal, most or all transcriptome assemblies will largely overestimate the number of transcripts, because of gaps. A factor or 2-3 is quite good I think. Why don't you map the reads to the genome instead and check for novel transcripts that way?

ADD REPLY
0
Entering edit mode

If I do a genome based trinity how would it give me Novel transcripts?

ADD REPLY
0
Entering edit mode

Why does it overpredicts? How can I explain it?

For the assembly, I had used 3 biological replicates so 3 times and I got 3 times the known genes that made me wonder was it really assembling the reads?

ADD REPLY
0
Entering edit mode
9.0 years ago
kanika.151 ▴ 160

Okay, I got why it is over-estimating and how I can remove similar clusters.

While assembling I added the control and inoculated together which should have been done separately. Also, there is an algorithm called CD-HIT which helps in removing similar clusters to give out the needed assembly.

ADD COMMENT

Login before adding your answer.

Traffic: 3403 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6