Question

What is an easy-to-use `conda` installable method for annotating proteins for eukaryotes?

0

Entering edit mode

4.3 years ago

O.rka ▴ 740

I have a bunch of fragmented eukaryotic proteins that I would like to annotate. I don't have the genome so I can't use MAKER or FUNANNOTATE to build good gene models. I ran rnaspades for a de-novo transcriptome assembly and then prodigal for fragmented ORFs (yes, I know this is for prokaryotes).

How can I get annotations for this?

Interproscan is basically impossible to install. I've tried for weeks to get this to work but their package is so poorly documented and so confusing that I'm giving up. I'm not sure how to use docker and would rather avoid it if there is another option available.

Maker is extremely confusing to run from a specific point. In particular, the annotation of the proteins without the genome.

Is there ANYTHING I can use that give be decent annotations from these fragmented ORFs?

Something along the lines of the following:

conda install -c bioconda [somepackage]
[somepackage] -i proteins.faa -o output_directory

Is this a fools wish?

annotation genome proteins eukaryotes • 1.9k views

ADD COMMENT • link updated 4.3 years ago by Mensur Dlakic ★ 28k • written 4.3 years ago by O.rka ▴ 740

score 0 · Answer 1 · 2020-08-18

I often tell students, whenever they are unsure whether to develop a new approach, to ask themselves how unique their problem truly is. If it isn't unique, chances are someone has solved it. And if it is a very common problem, it is likely that there are many solutions.

Annotating proteins is a very common problem, so there are many solutions. I will give you two, and others will probably pitch in with more. By the way, these solutions apply to proteins from any organism, not just eukaryotes.

Solution 1:

From the Pfam FTP site, download Pfam-A.hmm.gz and gunzip it (gunzip Pfam-A.hmm.gz). Next you will need a HMMer suite of programs, and either compile it or get the binaries - both types of files are available here. Finally, type these two commands:

hmmpress Pfam-A.hmm
hmmscan -E 0.001 -o output Pfam-A.hmm proteins.faa

I timed it, it took literally 80 seconds with downloading and pressing, though I already had HMMer installed. I strongly suggest that you read up on what various parts of HMMer can do.

Solution 2 (this one takes ~5 minutes because the database is larger):

Download CDD database and unpack it (tar -zxof Cdd_LE.tar.gz). Install BLAST+ suite of programs from here if you don't already have them. Then type:

rpsblast -query proteins.faa -db /path/to/Cdd -evalue 0.001 -out output

Here is a bonus solution if none of the above works, or if you prefer to copy-paste:

https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi

PS For the sake of completeness, I should say that the first solution uses only Pfam for annotation, while the second solution will annotate against a larger group of databases. Bonus solution will be the same as latter.