Question

Efetch For Fully Sequenced Microbial Genomes?

1

Entering edit mode

12.8 years ago

Richard Llewellyn ▴ 180

Anyone know a way to use efetch to get only fully sequenced microbial genomes? It is a convenient way to filter by taxonomy and submission date, but When I use 'complete[prop]' I also get expression vectors, eg 'Expression vector mce1' for M.tuberculosis. Alternatively, is there a way to filter out engineered sequences?

I'm just asking for the efetch query, not what to do with the resulting id list.

entrez taxonomy • 3.3k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 12.8 years ago by Richard Llewellyn ▴ 180

1

Entering edit mode

if I am not wrong, filter out Taxonomy ID: 28384; 81077; 12908

ADD REPLY • link 12.8 years ago by Rm 8.3k

0

Entering edit mode

Yep, that works in this case at least. If you make it an answer I'll happily accept it.

ADD REPLY • link 12.8 years ago by Richard Llewellyn ▴ 180

Ram · Answer 1 · 2012-02-22

3

Entering edit mode

12.8 years ago

Rm 8.3k

if I am not wrong, filter out Taxonomy ID: 28384; 81077; 12908

ADD COMMENT • link 12.8 years ago by Rm 8.3k

0

Entering edit mode

For the record,according to the ncbi taxonomy browser: both 28384 and 81077 are tax_ids for 'artificial sequences' 12908 is for 'unclassified sequences' that contains metagenomic and environmental samples

ADD REPLY • link 12.8 years ago by Richard Llewellyn ▴ 180

0

Entering edit mode

Also beware of chimeras, see http://blastedbio.blogspot.co.uk/2013/11/entrez-trouble-with-chimeras.html

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by Peter 6.0k

Ram · Answer 2 · 2014-07-20

Please be aware that current taxonomy is not in congruence with DNA sequence data for 'Mycobacterium tuberculosis'. The core genomes of M.tuberculosis and M.bovis are nearly 100% identical. Thus M.tuberculosis and M.bovis are clonal groups within a common species, which is called 'Mycobacterium tuberculosis complex' or MTC for short.

Taxonid may change over time. Therefore it is more robust to use taxon names in your query. Use double quotes if a name consists of more than one word.

"Mycobacterium tuberculosis complex"[Organism] AND complete[Properties]

Accession DQ823231.1 (Expression vector mce2, complete sequence) has two source records:

 source          1..24799
                 /organism="Expression vector mce2"
                 /mol_type="other DNA"
                 /db_xref="taxon:393135"
                 /focus
 source          4443..19156
                 /organism="Mycobacterium tuberculosis H37Rv"
                 /mol_type="other DNA"
                 /strain="H37Rv"
                 /db_xref="taxon:83332"

DQ823231.1 is included in the result set due to its second source record. The following query will exclude artifical sequences:

("Mycobacterium tuberculosis complex"[Organism] not "artificial sequences"[Organism]) AND complete[Properties]