Hey guys, I do phylogenetics of viruses and I'm currently working on an outbreak analysis. So I'm doing some phylogeography too. Obviously if there is no country of origin or collection date I have to take the sequence out of my dataset. I have >200 sequences per dataset and I really don't want to waste my precious time by going through genbank manually. I haven't been successful making or editing any biopython scripts to extract the country from the genbank file. Any help would be appreciated! Thanks!
can you give a couple of examples of genbank entries (accession numbers) and the field which contains country annotation? Also are you looking for python-only solution?
Here are a couple accession numbers: KT279761 KC692509 KC692496
The country annotation is in the Features, then source, for example:
FEATURES Location/Qualifiers source 1..10735 /organism="Dengue virus 1" /mol_type="genomic RNA" /serotype="1" /isolate="HNRG14635" /isolation_source="serum" /host="Homo sapiens" /db_xref="taxon:11053" /country="Argentina: Buenos Aires" /collection_date="05-May-2009"
I'm looking for any solution, but i thought python was my best bet, with Biopython and all.