Entering edit mode
4.3 years ago
flogin
▴
280
Hey,
I'm studying the Bio.Entrez, to retrivie information from NCBI...
I already made basic scripts to retrieve sequences based on protein or nucleotide IDs, but I'm wondering if I can retrieve all proteins based an specific taxonomy ID....
So I have a 3 column csv file, like this:
Reoviridae,Cardoreovirus,Eriocheir sinensis reovirus
Reoviridae,Mimoreovirus,Micromonas pusilla reovirus
Reoviridae,Orbivirus,African horse sickness virus
Reoviridae,Orbivirus,Bluetongue virus
And I wrote, at the moment, this:
#!/usr/bin/python3
# -*- coding: utf-8 -*-
from Bio import Entrez
import argparse, csv
import xml.etree.ElementTree as ET
parser = argparse.ArgumentParser(description = 'This script a csv file and returns protein information by viral family.')
parser.add_argument("-in", "--input", help="CSV file with 3 columns", required=True)
args = parser.parse_args()
input_file = args.input
with open(input_file,'r') as in_file:
reader_in_file = csv.reader(in_file,delimiter=',')
viral_family_lst = []
for line in reader_in_file:
viral_family = line[2].rstrip('\n')
viral_family_lst.append(viral_family)
for viral_family in viral_family_lst:
handle_id_var = Entrez.esearch(db="Taxonomy", term=viral_family,retmode='xml')
tree = ET.parse(handle_id_var)
root = tree.getroot()
for app in root.findall('IdList'):
for l in app.findall('Id'):
id = l.text
print(id)
So, at the moment, this script returns the taxonomy ID for each "viral specie", and idk how I can use this IDs to retrieve all proteins for each virus....
Using EntrezDirect. Translate into python as needed:
thanks genomax, I'll test a python version from this line!