Anyone know a way to use efetch to get only fully sequenced microbial genomes? It is a convenient way to filter by taxonomy and submission date, but When I use 'complete[prop]' I also get expression vectors, eg 'Expression vector mce1' for M.tuberculosis. Alternatively, is there a way to filter out engineered sequences?
I'm just asking for the efetch query, not what to do with the resulting id list.
For the record,according to the ncbi taxonomy browser:
both 28384 and 81077 are tax_ids for 'artificial sequences'
12908 is for 'unclassified sequences' that contains metagenomic and environmental samples
Please be aware that current taxonomy is not in congruence with DNA sequence data for 'Mycobacterium tuberculosis'. The core genomes of M.tuberculosis and M.bovis are nearly 100% identical. Thus M.tuberculosis and M.bovis are clonal groups within a common species, which is called 'Mycobacterium tuberculosis complex' or MTC for short.
Taxonid may change over time. Therefore it is more robust to use taxon names in your query. Use double quotes if a name consists of more than one word.
"Mycobacterium tuberculosis complex"[Organism] AND complete[Properties]
Accession DQ823231.1 (Expression vector mce2, complete sequence) has two source records:
if I am not wrong, filter out Taxonomy ID: 28384; 81077; 12908
Yep, that works in this case at least. If you make it an answer I'll happily accept it.