How do you identify the contigs from trinity assembly?
0
0
Entering edit mode
6.2 years ago
MAPK ★ 2.1k

I am trying to get the read counts for DESeq2 analysis from meta-genomic data. I have assembled contigs using Trinity for all organisms and I would like to map my reads for each sample to these contigs and get the read counts for DESeq2 analysis. Normally for RNAseq we would use GFF file to annotate the read and annotate as a loci, but for metagenomic data, I can't use one specific genome, so I wanted to use Trinity assembled contigs as reference for mapping. However, before proceeding with the read mapping, I would like to annotate each contigs from Trinity. I wonder if I can do BLAST search against nr. What would be the easiest way to do this? Thanks for your help!

Trinity Assembly blast • 2.8k views
ADD COMMENT
1
Entering edit mode

To get counts for each, you don't strictly need to identify them up-front. You could identify the DE ones first and only ID those :-)

You could follow these directions from Trinity for identification.

Edit: Since this is a metagenomic dataset these directions are not useful.

ADD REPLY
0
Entering edit mode

That is right, I was planning to do the way you have suggested, but then identifying the DE ones later would be a bit elaborate process. I thought identifying in the beginning would reduce the work later.

ADD REPLY
0
Entering edit mode

So rather than identification per se you are looking to reduce redundancy so you don't have the same sequence represented multiple times?

Did you use TriMetAss (http://microbiology.se/software/trimetass/ ) instead of Trinity? That appears to be for metagenomic data.

ADD REPLY
0
Entering edit mode

No, these are not overlapping sequences so I wanted to map them to the assembled reference. I haven't used TriMetAss, but will give it a try. Thanks!

ADD REPLY
0
Entering edit mode

Additionally, I just wanted to get the loci identified (as which gene,CDS etc) for each cluster of reads after mapping.

ADD REPLY
1
Entering edit mode

Since this is bacterial data you would expect the entire sequence to be coding. It may not be full length or start at the ATG depending on how well the assembly worked.

As suggested it should be ok to search using DIAMOND againsr nr (or RefSeq bacterial database) to identify the contigs. It works well but you would need ~80-100G of RAM for this search. You could also try magicblast from NCBI.

ADD REPLY
0
Entering edit mode

Thanks! I have used Diamond before so yes it makes sense.

ADD REPLY
1
Entering edit mode

Out of sheer curiosity: What was your rationale to use trinity? My apologies in case this is question is merely based on my inexperience with trinity: Why would you blast contigs against nr? Or do you get proteins? Is trinity able to define gene boundaries in prokaryotic RNAseq data? Also I think your gff approach should work - you can handle contigs in a metagenome just like any other genome.

For contig annotation Kraken is an excellent tool (though lacks of a good taxonomic binning algorithm, afaik) and as a faster blastp alternative, I recommend diamond

ADD REPLY
0
Entering edit mode

I just wanted to annotate the contigs and I also don't think BLAST would be the best solution and therefore I was asking this question here. Since it is a metatranscripome data, I am not sure if I would be able to use GFF file(s). I am using Trinity assembled data as a reference genome to get read counts from the metatranscriptome data I have.

ADD REPLY
0
Entering edit mode

Hi, I was just wondering if you ended up finding a way to annotate the contigs from Trinity?

ADD REPLY

Login before adding your answer.

Traffic: 1973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6