Question

Annotation of massive or large scaffolds with Pfam or other hmm based databases

0

Entering edit mode

3.1 years ago

edward.messick ▴ 60

I've been attempting to annotate large scaffolds (nearly 1 million nucleotides in some cases) with hmmer / hmmscan or hhsuite's tools and I found that there is a limitation on how large a scaffold/sequence is before they break. In order to annotate large scaffolds (500k-1M) with Pfam or a similar protein family database, is there a well suited program to do this?

I'd rather not split my scaffolds up into smaller sequences, which is an approach I think would work with caveats - you'd also be missing segments where you made your cuts and then I'd be dealing with thousands of seq files being split out of these multiple sequence scaffold fasta files I already have which need to be split up for this analysis (most have 20-50 scaffolds above 500kb, so they'd produce 50 files + if I split them into small enough scaffolds we'd be talking hundreds of files per file, and thousands of files to start with... it'd be out of hand fast).

My goal is to screen these scaffolds for specific gene families (like pathogenicity) but generally to be able to do this for any functionality we want to look for... I can't get it to run through no matter how large the compute is (I've tried up to 1.5 TB of RAM, but it caps off at something like 32kb, which is much lower than I'd need). I have no limitations on compute and can procure a cloud environment with up to 256 cores if need be...

Thankful for any advice you guys may have on this topic. I've never dealt with such large scaffolds before

protein Pfam scaffolds annotation hmm • 1.9k views

ADD COMMENT • link updated 3.1 years ago by Mensur Dlakic ★ 29k • written 3.1 years ago by edward.messick ▴ 60

score 1 · Answer 1 · 2022-05-10

1

Entering edit mode

3.1 years ago

Mensur Dlakic ★ 29k

Pfam database can be downloaded here - you will need a file Pfam-A.hmm.gz. The search is done using hidden Markov models (HMMs), and specifically HMMs created with HMMer. This is a mature set of programs with detailed instructions about their use.

You download Pfam database and install HMMer locally, and then you can search with whatever scaffold size is needed. I assume you know that these programs work only with proteins, so you have to run them first through a gene predictor such as prodigal.

ADD COMMENT • link 3.1 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

I am consistently getting a bug where hmmer doesn't allow scaffolds above 100Kb to be run through Pfam, but was able to annotate pure sequences with hmmscan when they were below this limit. I was utilizing hmmer's package to do the gene prediction... so you'd suggest first doing gene prediction with something like prodigal, taking the MSA and feeding it into hmmer + Pfam? Thanks for the information.

ADD REPLY • link 3.1 years ago by edward.messick ▴ 60

1

Entering edit mode

HMMer doesn't predict genes. There may be a program in it that does DNA -> protein translation (translate), but that's not the same as predicting genes. To do it properly, that part has to be done by a dedicated program. And you feed individual proteins into search, not MSA.

HMMer doesn't have a size limit as it reads sequences one by one. It will work whether you have 1 or 1 million proteins, though the latter search will take longer to do.

Let's say you have a scaffold.fna file with your DNA sequences. To predict genes and convert ORFs to proteins:

prodigal -i scaffold.fna -a scaffold_proteins.faa -d scaffold_ORFs.fna -o /dev/null

To search against Pfam:

hmmscan -E 0.01 -o scaffold_results.txt --cpu 10 Pfam.hmm scaffold_proteins.faa

There are some customizations to be done such as what E-value cutoff you prefer and the number of CPUs, and maybe -p meta in prodigal for metagenomic sequences, but the gist of the procedure is in those two commands.

ADD REPLY • link 3.1 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

ok, I appreciate it. Before reading this, I was running hhblits directly on a query fasta/seq and it was producing pfam annotations (I had to increase the maxres limit to run it). It did end up running through and giving some predicted genes but it capped it off at 10. I'll have to do a search against NCBI nt or something because generally I'm looking to annotate all of these scaffolds for any genes available

ADD REPLY • link 3.1 years ago by edward.messick ▴ 60

2

Entering edit mode

You and I seem to be talking about different things.

HHblits also is not intended to run with DNA scaffold sequences. It is meant to search one sequence at a time, but with PROTEIN sequences. Now, it can do a search with multiple protein sequences, but they need to be aligned rather than just random and unrelated sequences one would get typically get from a DNA (meta)genomic scaffold.

If you were feeding DNA sequences to HHblits, no wonder it died at 32 Kb because I think its protein size limit is 32768. Also, it was considering your sequences to be proteins made completely out of ATCG, so any results you got from that are likely to be useless.

Before doing anything else, I suggest you read up about pipelines for genome annotation. It also may be useful to consider prokka.

ADD REPLY • link 3.1 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

great, thanks a lot for the information Mensur. I think my confusion was in that I thought HHblits could actually do the conversion to protein sequences being fed in DNA seq.

ADD REPLY • link 3.1 years ago by edward.messick ▴ 60