Plant Annotation Workflow
6
1
Entering edit mode
10.7 years ago
User000 ▴ 710

Hello, I have ~200000 de novo assembled plant transcripts and have to do a proteome annotation. I have run BLASTX of my transcripts against protein databases. However, I got very poor hits. My final goal is to functionally annotate my transcripts. Any suggestions? A very detailed explanation is very appreciated since I have never worked in this field. Also, any other suggestions on this topic are welcomed.

plants • 6.5k views
ADD COMMENT
2
Entering edit mode
10.7 years ago
Mary 11k

Have you looked around at the iPlant resources? They may have some useful guidance for you: http://www.iplantcollaborative.org/

ADD COMMENT
1
Entering edit mode
10.7 years ago
jackuser1979 ▴ 890

Try to do blastx with adjusting higher e value parameters (may be e-10) to get hits. After that if no hits found, you can go for domain annotation like pfam and then for Gene ontology annotation and then you can proceed for KEGG annotation. You can use blast2go tool which does all above annotations.

ADD COMMENT
1
Entering edit mode
10.7 years ago
SES 8.6k

If you are working with grass genomes then you are in luck because there are several well annotated grass genomes (e.g., rice, maize, sorghum, etc.). A good place to start would be the gramene website, which is a resource for working with plant genomes. It is not possible to fully explain what path you should take without knowing your end goal. For example, it's not clear if you are just trying to annotate a genome or if there is some underlying biological question that you are actually interested in. Whatever your goal is, it may be helpful to know there is an archive of data on the gramene site that allows you to bulk download genes, ontologies, pathway information, etc. That may give you faster access to the data you need, rather than trying to construct these resources yourself.

ADD COMMENT
1
Entering edit mode
10.7 years ago
Ann ★ 2.4k

It seems weird that you are not getting very good hits. What's the size distribution of your transcripts? Maybe they are mostly very short? Also, many of your transcripts might be non-coding. I would expect about half the sequences to be non-coding based on my experience with blueberry and working with a draft genome, which is of course a very different project than yours. Others may have a much better idea - I've only done this type of thing for one plant. Regarding which plant databases to use: I would recommend getting all the fully sequenced annotated plant protein RefSeq databases and maybe supplementing those with proteomes from Phytozome. However, there's a catch - some plant genomes have many more functional annotations than others. Probably Arabidopsis is the most extensively annotated, followed by rice. Tomato also seems pretty well annotated. Another databases you definitely want to use for annotation is the PlantCyc enzyme database. Once you sort out the informatics, you can use it to assign plant pathway accessions, which can be incredibly useful if you're going after medicinal compounds or other metabolic pathways.

ADD COMMENT
0
Entering edit mode

they are short, yes, my wheat is also tetraploid, so problem of homeologs. I did blast against Phytozome as well, I got hits for less than half of the contigs. At the end the only database that gave me more hits was Ensembl. Why is that half of the seq-s are non-coding? May I ask you which workflow did you use to annotate blueberry? thank you for the post

ADD REPLY
0
Entering edit mode
10.7 years ago
nbvasani ▴ 240

Download protein database of Arabidopsis thaliana db from NCBI or uniport, then run blastx against your transcripts data. As Arabidopsis thaliana resemble to many of plant species, you will get lots of hits.

ADD COMMENT
1
Entering edit mode

As Arabidopsis thaliana resemble to many of plant species, you will get lots of hits.

That is not a reasonable statement to make. Arabidopsis thaliana is a species that has a per base pair substitution rate several times higher than the average angiosperm. So, it is not a good choice for finding distant homologies.

ADD REPLY
0
Entering edit mode

I agree with "Arabidopsis thaliana is a species that has a per base pair substitution rate several times higher than the average angiosperm." As per my understanding, best route to start annotation is to start with plant species which is well studied and resemble to your plant species.

ADD REPLY
1
Entering edit mode

Okay, you agree but then you repeat the same thing by saying, "start with plant species which...resemble your plant species." My point was that we don't know what species is being annotated and using Arabidopsis alone is not the best choice.

ADD REPLY
1
Entering edit mode

You are right Arabidopsis alone is not the best choice. But he has to start from some plant db in order to annotate his assembly. If you have any suggestion let User000 know instead of making unnecessary statement.

ADD REPLY
0
Entering edit mode

There's no need to be argumentative. My statement is very relevant. By having discussion about better ways of doing things we are helping OP find a solution. We should focus on that point and not take comments about genome annotation personal.

ADD REPLY
0
Entering edit mode

infact, nbvasani said plant specieS, meaning several plants I guess. Anyway, as I have mentioned above I am using 9 plant species, also I am going to create a database of contaminants, which will include plant pathogens, human, mice. Other ideas are welcomed

ADD REPLY
0
Entering edit mode

I have created a plant protein database, which includes arabidopsis thaliana, rice, barley 9 species in total, however, still very poor hits. OK, let assume, I want to at least annotate those ones that have >90% identity, what would you suggest me to do next?

ADD REPLY
0
Entering edit mode

You can try to run blastx against nr database from NCBI. Are you interested in differential expressed transcripts?

ADD REPLY
0
Entering edit mode

I tried to run blastx against trembl, blast is extremely slow, so I could not go on.. see my related post C: Speeding up the BLAST job

ADD REPLY
1
Entering edit mode

Yap, it generally take a week or so. Instead of concentrating on whole de novo assembly, try to generate DGE list, then run blastx against all db you have. DGE list will be easy to handle compare to de novo assembly and you will get your result faster. Let say you still find less number of hits, you can still try to blast your transcripts sequence manually in blastn NCBI one by one, as number of transcripts with DGE list will be far lesser compare to de novo list.

ADD REPLY
0
Entering edit mode

how to generate DGE? if it was a week..it is going to take me 2 months..

ADD REPLY
0
Entering edit mode

You can generate Differential Gene expression (DGE) list by using R package i.e. edgeR and DEseq.

ADD REPLY
0
Entering edit mode

I dont see a point of blasting something manually, if I can download nr database from NCBI...and if really with DGE it will be faster..no?

ADD REPLY
0
Entering edit mode

It all depends on what you want from your data. With DGE list it will be faster.

ADD REPLY
0
Entering edit mode

Hi User000,

Is any NGS article related to your plant species published? You might get some clue for your anoatation from that article. If it's ok with you, can you tell me your plant species?

ADD REPLY
0
Entering edit mode

I am working with triticum durum, which is tetraploid. The contigs have been de novo assembled before using CLC (I did not do that part). There are 2-3 articles related to triticum, however, I need more information and may be more detailed.

ADD REPLY
0
Entering edit mode

Great! Contact author of that articles they will suggest how they annotated their assembly.

ADD REPLY
0
Entering edit mode
10.7 years ago
rtliu ★ 2.2k

Maker-P has recently been used to annotate Loblolly Pine genome - link

Quoted from Maker-P overview

"Sequencing diverse plant species of evolutionary, agricultural, and medicinal interest is becoming routine for even small groups - genome annotation and analysis is much less so. The MAKER-P pipeline is designed to make the annotation of novel plant genomes tractable for small groups with limited bioinformatics experience and resources, and faster and more transparent for large groups with more experience and resources. The MAKER-P pipeline generates species-specific repeat libraries, as well as structural annotations of protein coding genes, non-coding RNAs, and pseudogen"

ADD COMMENT

Login before adding your answer.

Traffic: 2575 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6