Question

De novo transcriptome assembly produce too many transcripts

1

Entering edit mode

5.9 years ago

21afiq ▴ 10

Hi, I just finished my transcriptome assembly using Trinity. However, the transcripts produced by trinity is too many (~300k transcripts) which is not normal for my sample. I believe most of these transcripts are redundant. How can I remove these redundant transcript?

1) I already tried cdhit est. Unfortunately the output still contains many redundant transcript

2) I also already tried corset and follow the tutorial here (https://github.com/Oshlack/Corset/wiki/Example). However, currently I am stuck on how to recover the unigenes sequence from the corset output

3) I planned on trying to use TGICL to further remove redundant sequence from CD-hit output as done by some studies. However, I am a bit not familiar with TGICL and dont know which parameter to use

It would be happy me if somebody could help with my problem. Thanks

assembly transcriptomics RNA-Seq trinity • 3.5k views

ADD COMMENT • link updated 5.9 years ago by Corentin ▴ 610 • written 5.9 years ago by 21afiq ▴ 10

0

Entering edit mode

Which organisme are you working in?

ADD REPLY • link 5.9 years ago by Kristoffer Vitting-Seerup ★ 4.1k

0

Entering edit mode

I always find it helpful to map the transcripts and view them in a genome browser. I find gmap to be the best mapper: Example command - might be out of date: gmap -f gff3_gene -D /lager2/rcug/seqres/HS/gmap/hg19_gmap -d hg19_gmap -B 5 -t 16 --intronlength=150000 --totallength=1000000 --npaths 1 -p 3 in.fa > in.fa.gff3

ADD REPLY • link 5.9 years ago by colindaven 7.0k

score 3 · Answer 1 · 2019-01-23

The Trinity FAQ states that having lot of transcripts is expected (I would advise you to read it if you have not already):

Lots of transcripts is the rule rather than the exception.

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-FAQ#ques_why_so_many_transcripts

If you are still concerned by the number of transcripts, you can filter them based on their abundance. I usually filter transcripts which have a very low expression level in all the samples. They sometimes correspond to artifacts, but you also have the risk of filtering important transcripts that are just expressed at low levels, from the FAQ again:

Biological relevance of the lowly expressed transcripts could be questionable - some are bound to be very relevant.

I wrote a python script that prints the number of transcript against the expression levels (only works on linux): https://github.com/MCorentin/plot_transcripts_filtering.py this can help you find the best threshold.