Extracting the Country from a >200 sequences Genbank file
2
0
Entering edit mode
8.0 years ago
tpaisie ▴ 80

Hey guys, I do phylogenetics of viruses and I'm currently working on an outbreak analysis. So I'm doing some phylogeography too. Obviously if there is no country of origin or collection date I have to take the sequence out of my dataset. I have >200 sequences per dataset and I really don't want to waste my precious time by going through genbank manually. I haven't been successful making or editing any biopython scripts to extract the country from the genbank file. Any help would be appreciated! Thanks!

phylogenetics python genbank extract bipython • 6.6k views
ADD COMMENT
1
Entering edit mode

can you give a couple of examples of genbank entries (accession numbers) and the field which contains country annotation? Also are you looking for python-only solution?

ADD REPLY
0
Entering edit mode

Here are a couple accession numbers: KT279761 KC692509 KC692496

The country annotation is in the Features, then source, for example:

FEATURES Location/Qualifiers source 1..10735 /organism="Dengue virus 1" /mol_type="genomic RNA" /serotype="1" /isolate="HNRG14635" /isolation_source="serum" /host="Homo sapiens" /db_xref="taxon:11053" /country="Argentina: Buenos Aires" /collection_date="05-May-2009"

I'm looking for any solution, but i thought python was my best bet, with Biopython and all.

ADD REPLY
4
Entering edit mode
8.0 years ago

I'm adding a Python solution that you may later modify to include more data. You just need to specify the location of a file with the accessions (one per line), where it says accessions.txt:

The output is separated by commas, so you can later read it as a CSV. Using your IDs, I got:

KT279761,Haiti
KC692509,Argentina: Buenos Aires
KC692496,Argentina: Buenos Aires
ADD COMMENT
0
Entering edit mode

What can I add to get the 'date' ?

ADD REPLY
0
Entering edit mode

Hi,

As tpaisie I also want to extract specific information ("collection_date" for my case) from a genbank file. I tried to slightly modify your code in order to get the "collection_date", but it seems to not working. Any idea? Here is my code below :

from Bio import Entrez

# Read the accessions from a file
accessions_file = 'accession_nuber_list.txt'
with open(accessions_file) as f:
    ids = f.read().split('\n')

# Fetch the entries from Entrez
Entrez.email = 'emailadress'  # Insert your email here
handle = Entrez.efetch('nuccore', id=ids, retmode='xml')
response = Entrez.read(handle)

# Parse the entries to get the country
def extract_date(entry):
    sources = [feature for feature in entry['GBSeq_feature-table']
               if feature['GBFeature_key'] == 'source']

    for source in sources:
        qualifiers = [qual for qual in source['GBFeature_quals']
                      if qual['GBQualifier_name'] == 'collection']

        for qualifier in qualifiers:
            yield qualifier['GBQualifier_value']

for entry in response:
    accession = entry['GBSeq_primary-accession']
    for date in extract_date(entry):
        print(accession, date, sep=',')
ADD REPLY
0
Entering edit mode

It seems the key for the date is collection_date, not collection. Try changing that.

ADD REPLY
0
Entering edit mode
8.0 years ago

using this simple XSLT stylesheet:

run:

$ curl -s  "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=KT279761,KC692509,KC692496&retmode=xml" |\
xsltproc --novalid transform.xsl -

KT279761 Haiti
KC692509 Argentina: Buenos Aires
KC692496 Argentina: Buenos Aires
ADD COMMENT
0
Entering edit mode

Thank you! Instead of a link can I replace that with the xml file I have?

Also I am getting this error when I try to run it:

"transform.xsl:1: namespace error : xmlns:xsl: '

ADD REPLY
1
Entering edit mode

yes, I've replaced my text with a gist on github.

ADD REPLY
0
Entering edit mode

Thank you for your help!

ADD REPLY

Login before adding your answer.

Traffic: 2727 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6