Entering edit mode
2.4 years ago
pramirez
▴
10
I annotated a list of protein sequences using NCBI. Now I have the list of proteins and their corresponding accession numbers for NCBI. I want to use biopython to search for the taxonomy of the sequences and print the PHYLUM only. I wrote a script that successfully searches for the entries in the protein db. The script returns all the information on the search. Do you know how can I obtain the phylum only? Thanks!
df = pd.read_csv('final.csv', sep='\t', decimal='.')
Entrez.email = ‘#####’
species_list = ['OGI11933.1']
def get_tax_data(taxid):
search = Entrez.efetch(id = taxid, db = "Protein", retmode = "xml")
return Entrez.read(search)
for species in species_list:
taxid = species_list # Apply your functions
data = get_tax_data(taxid)
#lineage = {d['Rank']:d['ScientificName'] for d in data[0]['GBSeq_taxonomy'] if d['Rank'] in ['phylum']}
taxid_list.append(taxid) # Append the data to lists already initiated
data_list.append(data)
print(data)
This returns all the information on the entry:
[{'GBSeq_locus': 'OGI11933', 'GBSeq_length': '230', 'GBSeq_moltype': 'AA', 'GBSeq_topology': 'linear', 'GBSeq_division': 'ENV', 'GBSeq_update-date': '19-OCT-2016', 'GBSeq_create-date': '19-OCT-2016', 'GBSeq_definition': 'MAG: 30S ribosomal protein S3 [Candidatus Micrarchaeota archaeon RBG_16_36_9]', 'GBSeq_primary-accession': 'OGI11933', 'GBSeq_accession-version': 'OGI11933.1', 'GBSeq_other-seqids': ['gb|OGI11933.1|', 'gnl|WGS:MFRR|A3K64_00470', 'gi|1083728961'], 'GBSeq_project': 'PRJNA288027', 'GBSeq_keywords': ['ENV', 'Metagenome Assembled Genome', 'MAG'], 'GBSeq_source': 'Candidatus Micrarchaeota archaeon RBG_16_36_9 (subsurface metagenome)', 'GBSeq_organism': 'Candidatus Micrarchaeota archaeon RBG_16_36_9', 'GBSeq_taxonomy': 'Archaea; Candidatus Micrarchaeota', 'GBSeq_references':}]
Thanks!