Question

Merging multiple sets of transcripts (generated by Trinity)

0

Entering edit mode

6.9 years ago

liorglic ★ 1.5k

Hi everyone. Need your advice on an issue I'm having.
So, I have used Trinity in order to create transcriptome assemblies from ~30 data sets of RNA-Seq, downloaded from SRA. All data sets come from the same species (tomato), but from different variants, conditions and tissues. My aim is to produce some type of 'pan transcriptome', and thus I', trying to get very diverse data. As a result, I can't just throw all the reads at Trinity - this is just a huge ammount of data and the data sets are quite different from each other in terms of read length, coverage, library type etc.
Now I have ~30 fasta files, each derived from a single data set. Next, I'd like to merge them in order to get a single, unified transcripts set. I intend to use it in order to annotate multiple other genomes of tomatoes and wild relatives that I have de-novo assembled. Here are my questions:

1) Do you know of a tool that can do the merging I'm looking for? When I say merge, I mean do it in a smart way, for example use overlaps to elongate transcripts, collapse partial transcripts into full ones, if they exist in the data and so on. So far, I've looked into StringTie's merge function, but it requires gff files rather than fasta (as produced by Trinity), and DRAP's runMeta module, but this one requires DRAP assembly outputs, which I do not have.

2) I've read an old post suggesting that an OLC assembler might be useful in this case. Do you think it could work? Can you recommend a good and modern assembler aimed at transcriptomic data?

3) Do you think that the merge step is even necessary? I am planning on using MAKER annotation pipeline, and not sure how it would perform with a messy transcripts set containing duplicates and partial transcripts. But maybe it is not a problem and I'm wasting time over a non-relevant procedure?

Of course, any other advice would be appreciated.
Thanks a lot!

Assembly RNA-Seq • 4.6k views

ADD COMMENT • link 6.9 years ago by liorglic ★ 1.5k

0

Entering edit mode

Hello liorglic Merging or "clustering" of fasta files i think u have mentioned above . Read the following thread CD-HIT.

Hope this serves your purpose.

ADD REPLY • link 6.9 years ago by mks002 ▴ 220

0

Entering edit mode

Thanks @mks002. I am familiar with CD-HIT, and have seen it mentioned in this context before, but I still don't understand how it can solve my problem. CD-HIT clusters transcripts based on a similarity threshold. How can such output be used to actually merge transcripts together? Can you explain in more detail or refer me to such an explanation?

ADD REPLY • link 6.9 years ago by liorglic ★ 1.5k

1

Entering edit mode

yes correct cd-hit is based on similarity and some other threshold.

>Cluster 7
0       30057aa, >Oar_XP_011987898.1... *
1       30057aa, >Oar_XP_011987907.1... at 1:30057:1:30057/99.95%
>Cluster 8
0       30008aa, >Oar_XP_011987916.1... *
>Cluster 9
0       13288aa, >Ssc_NP_001106757.1... *
>Cluster 10
0       9532aa, >Bot_XP_015327298.1... *
1       9531aa, >Bot_XP_015327299.1... at 1:9531:1:9532/100.00%
2       9531aa, >Bot_XP_015327300.1... at 1:9531:1:9532/100.00%
3       9441aa, >Bot_XP_015327309.1... at 1:9441:101:9532/93.28%
4       9440aa, >Bot_XP_015327310.1... at 1:9440:101:9532/93.25%
5       9440aa, >Bot_XP_015327311.1... at 1:9440:101:9532/92.89%
6       9440aa, >Bot_XP_015327312.1... at 1:9440:101:9532/92.82%

Above is the example of cd-hit out put. we are looking below one of the representative sequence.

0       9532aa, >Bot_XP_015327298.1... *

The representative sequence is given as "*" tag, and out file having all the representative sequences will be generated. So u can use these sequences.

Other approach is you try CAP3. This might help u in elongating the transcripts.

all the best.

ADD REPLY • link 6.9 years ago by mks002 ▴ 220

0

Entering edit mode

Thanks I understand now. I decided to try a wrapper tool called DRAP (https://peerj.com/articles/2988/) which has a module called runMeta. It uses CD-HIT-EST to collapse transcripts and the assembler TGICL to elongate transcripts. I'll see how this works for me.

ADD REPLY • link 6.9 years ago by liorglic ★ 1.5k

0

Entering edit mode

It's better not to use CAP3 or methods like that. You have a high chance of generating chimeric transcripts.

See this: http://arthropods.eugenes.org/EvidentialGene/evigene/docs/cdhiterr-arabidopsis-example.txt

A better way would to concatenate multiple assemblies and use evidentialgene (tr2aacds) package on it.

Subsequent to this, I tend to map my normalized pooled read-set used in the assembly and select the transcripts > 0.1TPM. I'll have to write a blog about it one of these days.

I have found that this significantly reduces the chimeric transcripts or assembly issues.

ADD REPLY • link 6.8 years ago by harish ▴ 470