I have a local copy of the NR protein database to run blastp. I'm interested in analyzing a specific subset of prokaryotic genes. I've already extracted prokaryotic proteins from the NR database using the following steps:
1. Extracting proteins from bacteria and archaea taxids:
blastdbcmd -db nr -taxids 2,2157 -dbtype prot -out prokaryote_sequences.fasta
2. Creating a BLAST database from the extracted sequences:
makeblastdb -in prokaryote_sequences.fasta -dbtype prot -out nr_prok
Now, I need to further subset this database to include only specific genes of interest, such as rpoB. However, I suspect that a simple grep on the FASTA headers won't be sufficient because not all rpoB sequences might have "rpoB" in their headers.
My Question:
What is the best way to filter my custom BLAST database to include only the proteins of specific genes like rpoB?
You could
blast
with the gene of interest against this custom DB and then extract the ID's you need as fasta to create another subset database.