Question

Using blast or similar tool to annotate ORFs using Bacmet database

0

Entering edit mode

5.2 years ago

Longshotx ▴ 70

Hello all - I am interested in annotating some ORFs I assembled using the 753 experimentally confirmed genes in Bacmet (http://bacmet.biomedicine.gu.se/) as a reference. Basically I want to do a blast search against this database and annotate similarly matched ORFs if they have an evalue less than or equal to 0.00001.

What would be the best tool or strategy for this?

alignment annotation prodigal metagenome • 1.8k views

ADD COMMENT • link updated 5.2 years ago by Mensur Dlakic ★ 28k • written 5.2 years ago by Longshotx ▴ 70

score 1 · Answer 1 · 2019-08-31

1

Entering edit mode

5.2 years ago

Mensur Dlakic ★ 28k

I can think of at least two strategies.

First, split your 753 proteins into single files, do BLASTp searches for each of them against your ORF database using the E-value threshold of 0.00001, and save the output of each search into individual files. BLASTp search files with no matches have the size of about 2500 characters - you can find the exact range of numbers easily by finding out the smallest 5-10 files after the search. After that there will be a jump in file sizes of at least 500, which are the files that contain matches. NOTE: if your ORF database is smaller than 753 proteins, you may want to do a reverse strategy: split your database into individual proteins, and BLASTp against the Bacmet ORFs.

Run jackhmmer, HHpred or some other automated search strategy on your Bacmet proteins and build HMM files for each of them. Concatenate all HMMs into a single file, and annotate your ORFs against this database using Prokka.

ADD COMMENT • link 5.2 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

I tried to do something similar to the hmm profile strategy but got confused. I originally aligned all the the proteins in Bacmet and build a hmm profile, but when I performed a search the query said "Bacmet alignment" for all of them which is not helpful at all. Are you suggesting I build a hmmprofile for each of the 753 individual genes? So hmmbuild Bacmetprotein1.fa Bacmetprotein1.hmm....same for all other proteins. Concatenate the HMMs. Hmmpress them and use in Prokka?

Lastly I was not aware Prokka could annotate existing ORFs . I thought you had to feed it the assemblies, and it would use a specified HMM profile and Prodigal (since Prokka is somewhat of a wrapper for other tools) to create the ORFs and annotate them. Note: (I just looked at the Github...it cannot annotate ORFs from Prodigal...you must rerun it on the contigs).

Thanks!

ADD REPLY • link 5.2 years ago by Longshotx ▴ 70

0

Entering edit mode

Are you suggesting I build a hmmprofile for each of the 753 individual genes?

That is what I am suggesting. From a quick glance at Bacmet database it seems that it contains 753 different proteins. I don't think making an alignment out of them would be productive.

By the same token, making HMMs out of individual proteins is not very productive either. I suggested you do a search of 753 individual proteins using jackhmmer or HHpred, which will return alignments of your starting proteins with their homologs. Those alignments should be used to build HMMs.

HMMs can be named during the building process using the -n switch, or you can edit the NAME field manually since HMMs are simple text files. If they are not named explicitly, you may get a generic name such as Bacmet alignment creep into all of them.

Lastly I was not aware Prokka could annotate existing ORFs

It is meant to annotate (meta)genome assemblies. However, one can skip the Prodigal step and feed it ORFs instead. Commenting out Prodigal section (starts around line 700 in main prokka script) and changing a few lines should do the trick.

ADD REPLY • link 5.2 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

I have never used either of those tools, but it appears I can use the jackhmmer script on the cluster I am working on. Could you suggest a db to use? I am assuming nr? Or I could use the BacMet predicted genes as a database...that has 150,000 genes.

ADD REPLY • link 5.2 years ago by Longshotx ▴ 70

0

Entering edit mode

It will probably be good enough if you use UniProt90 database (available as gzipped file here). That database is considerably smaller than nr.

If you are confident that Bacmet database of predicted genes is comprehensive enough, that will save you quite a bit of time compared to either UniProt90 or nr.

ADD REPLY • link 5.2 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Excellent thank you. I'm going to try the BacMet 735 experimentally confirmed genes against the against BacMet predicted database of 155k gene homologs for now.

ADD REPLY • link 5.2 years ago by Longshotx ▴ 70

0

Entering edit mode

Hi again Mensur

So I hope I am setting this up correctly because I just ran it using parallel and it finished without any output file.

So I have a directory with a .fasta for everyone of the 735 genes and the database containing the 155k genes is in a directory above.

I ran this command:

ls *.fa | parallel --gnu "jackhmmer {} Bacmet_predicted.fa"

I didn't realize I had to specify an output file (very dumb of me). I'm not sure what I need it to output though. Am I looking to get a Stockholm (.sto) alignment file for each query gene by passing the -A option?

Then I build a hmm profile for each...

I'm not familiar with combining the hmm profiles.

Please let me know if this sounds correct. Thanks

Just in case someone wants to run this on therir own I edited my code:

ls *.fa | parallel --gnu "jackhmmer --cpu 16 -o {.}.out -A {.}.sto --tblout {.}.tab {} Bacmet_predicted.fa"

ADD REPLY • link 5.2 years ago by Longshotx ▴ 70