Entering edit mode
5.9 years ago
mschmidt
▴
80
I need to download all/many sequences of a specific bacterial gene from Genbank nuccore database from entries limited to complete genome sequences. I prefer using R. Querying: 'Bacteria[ORNG] AND gyrB[GENE] AND complete genome[TI] ' in web interface results in >10k hits. I do not want to download whole genome sequences but only extracted gyrB sequences to make a local database. I tried
library(rentrez):
db = "nuccore"
query = "Bacteria[ORGN] AND gyrB[GENE] AND complete[TI]"
found = entrez_search(db, query, config = NULL, retmode = "xml", use_history = FALSE, retmax = 90000)
but this fetch ids for whole genome sequences. Is it possible to get fasta sequences for gryB genes or at least gyrB coordinates however I'm not into downloading whole genome sequences of thousands of genomes.
You can get this data from Ensembl bacteria using the Ensembl Genomes perl API or maybe using the R package biomartr.
It would be a great option but I found that BioMart is not currently available for Ensembl Bacteria. https://support.bioconductor.org/p/82585/