Hi,
I want to get a nucleotide FASTA file of all genes matching a search query. I have tried a couple of things using esearch, such as:
esearch -db nuccore -query "(mecA) GENE AND "bacteria"[porgn:__txid2]" | efetch -format fasta
esearch -db gene -query "(mecA) GENE AND "bacteria"[porgn:__txid2]" | elink -target nuccore | efetch -format fasta
but they both output very long nucleotide sequences, making me think im getting whole genomes in which the gene exists.
Funnily enough, it works fine when using -db protein, apart from obviously giving me protein fastas.
So, what am i doing wrong?
hi, thanks for input,
so mecA was just an example, the eventual purpose is to extract the sequences of any gene for a given name. I see the same with gyrA or lig and so one. But I think a major issue is that im getting whole genomes and/or casettes, as you say.
But the thing is: I can do a search at the NCBI homepage on e.g. mecA (or whatever) in the gene database and then get the FASTAs from each entry, so why cant that be automated? They clearly exist and are correctly linked to the names.
With efetch, you always get the whole sequence. There is no way to download only part of a sequence. You have to download the whole sequence and cut out the region of your interest locally. It may be better to download in Genbank format in order to get the positions of all the annotated genes along with the sequence.
With eutils it's possible to also specify start, end, and strand. It's a shame that this functionality is still not implemented in Entrez Direct..
Alright, so i guess i have to go through genbank and fetch the positions along with the entire sequence. Any clever way of doing that apart from writing a filter manually?
Any other suggestions?