Is there a way to allow hmmsearch to search a .gbff file for protein hits? I'm trying HMMs that should have multiple hits in a particular genome, but the search keeps coming back with zero hits.
My guess would be that hmmsearch is searching just the genomic nucleotide sequence, whereas what I would want is for it to search all of the CDS feature sequences. Thoughts? I obtained the gbff from NCBI, and if I download the corresponding Protein FASTA, I'm able to find sequences. However, there's some information that I want from the gbff file that doesn't exist in the protein FASTA, so I'd rather not have to download both, if that makes sense?
Thanks!
It almost does, but if you explain, it will be better.
My guess is hmmsearch doesn't take genbank files as input, so you should extract the proteins from the gbff, parse the information you want to keep, and add it to the fasta headers.
To clarify: My end goal is to compile a database for which bacteria and archaea have/do not have proteins with particular domains in their proteomes, and the specific locus and protein sequence for each protein.
The workflow I'm imagining for Python: Download all prokaryote genomes from genbank. For each genome: 1. obtain the organism name, accession #, taxonomy 2. For each HMM I'm interested in, hmmsearch the genome and return all the hit locus #s and AA sequences 3. concatenate all the into a dataframe.
I know how to do 1 and 3 efficiently, especially because gbffs make obtaining that information easy and one organism can be associated with one file. 2 is the hard part (with my novice-level coding skills) to automate... and since I want to do this for (tens of) thousands of genomes... I would definitely prefer being able to automate it!
Most (all?) bacterial genomes at NCBI have been annotated and proteins are available in a separate file, one file per genome sequenced. You can then easily run hmmsearch on each genome separately.
If, for example, you go to Assembly resources page, and perform a search, you can then download several files, including proteins in fasta format.
Or you can download the protein fastas using the command line, for example modifying 5h3ikki answer from How to download COMPLETE bacterial genomes from NCBI based on list of names? .
Do you need current data or any available ones?
gbk-files do not exist any more,
but before 2014 there were huge collections of bacterial proteins in genebank.
I may send you a link to old NCBI-dataset if you need it.
Here it is - see my answer in this post. where can I get environmental bacteria genome in fasta format (as many as possible)?
By the way, there are about several hundreds of archaeal genomes so far, about 500 or 600.