I've done a bit of searching but haven't seen this specific problem raised (or answered) before. Apologies if it has.
I'm interested in a particular (bacterial) protein family. I have reason to believe the genomic context of these proteins will be interesting as well. So what I'd like to do is obtain every unique genomic context it's found in. That is, whether or not the protein is identical, if the genomic region surrounding the gene isn't the same, consider it unique.
So far, I've obtained a set of uniprot identifiers that match my HMM, and from those extracted protein IDs. Many of these are WP sequences, which means the identical protein is found in multiple genomes. I think I've found a way to use entrez to link from these redundant sequences to genome accessions, but it's sort of slow.
Is there a better way to do this? I feel like there must be, but I haven't been able to put it together.
Roughtly, is this how you are doing it? (note, I am downloading the CDS sequences for just the RefSeq genomic accessions here)
If that's the case, then I am afraid you cannot get it any faster using Entrez Direct. How many protein accessions are you starting with?
Here's an alternate way: 1. Make an IPG report of all of the protein accessions you have. This report has the GCF assembly accession. 2. You can download the entire genomes in fasta format corresponding to those GCF assembly accessions from NCBI FTP site. 3. From the IPG report, make a BED file with seq-id, start, stop and strand information 4. Use
bedtools getfasta
with the BED file from #3 and genome sequences from #2.