Hey guys,
I'm in the following situiation:
I have a tab-delimited file like this:
aag2_ctg_1000 YP_009342179.1 neg 104655 105278
aag2_ctg_1000 YP_009666655.1 pos 257933 258349
aag2_ctg_1000 YP_073558.1 neg 15802 16581
aag2_ctg_1000 YP_073558.1 neg 61637 62266
aag2_ctg_1000 YP_073558.1 pos 43922 44227
aag2_ctg_1000 YP_073558.1 pos 334345 334659
aag2_ctg_1000 YP_073558.1 pos 481376 482059
aag2_ctg_1001 NP_043933.1 pos 123492 123662
aag2_ctg_1001 NP_077596.1 pos 422453 422671
aag2_ctg_1001 YP_001426837.1 pos 43143 43544
Where the 1st column has a contig name and the second column a protein id from ncbi. So all this proteins are viral proteins, in a manner to retrieve the taxonomy of each virus from each protein, I wrote the following script:
Code modified after genomax suggestions:
#!/usr/bin/python3
from Bio import Entrez
import pandas as pd
import csv
Entrez.email = ***@***
Entrez.api_key = *****************************************************
error_tax = []
prot_id = []
count = 0
with open(input_file+'.fmt.blastx.filtred.annot.bed.merged.tax.tmp','r') as tax_out_tmp, open(input_file+'.fmt.blastx.filtred.annot.bed.merged.tax.tmp2','w+',newline ='') as tax_out_tmp2:
tax_tmp_reader = csv.reader(tax_out_tmp, delimiter='\t')
tax_tmp2_writer = csv.writer(tax_out_tmp2)
for line in tax_tmp_reader:
prot_id.append(line[1])
prot_id = list(dict.fromkeys(prot_id))
count = 1
tax_reference_list = []
tax_levels_list = []
for var_prot_id in prot_id:
handle_code = Entrez.efetch(db = 'protein', id = var_prot_id, rettype = 'gb')
handle_code_var = str(re.findall(r'\/db_xref="taxon:.*',handle_code.read()))
handle_code_var = re.sub(r"\[.*:", '', handle_code_var)
handle_code_var = re.sub(r'".*\]', '', handle_code_var)
tax_reference_list.append([var_prot_id,handle_code_var])
print(f'{count}/{len(prot_id)}')
count += 1
time.sleep(0.15)
count = 1
for i in tax_reference_list:
sk = ''
od = ''
fm = ''
gn = ''
sp = ''
prot_id = i[0].rstrip('\n')
tax_id = i[1].rstrip('\n')
try:
handle_tax_data = Entrez.efetch(db="Taxonomy",id=tax_id, retmode="xml")
record_tax_data = Entrez.read(handle_tax_data)
for x in record_tax_data[0]["LineageEx"]:
if 'superkingdom' in x.values():
sk = x['ScientificName']
if 'order' in x.values():
od = x['ScientificName']
if 'family' in x.values():
fm = x['ScientificName']
if 'genus' in x.values():
gn = x['ScientificName']
if 'species' in x.values():
sp = x['ScientificName']
except:
print(f'effor with: {tax_id}')
continue
tax_levels_list.append([prot_id,tax_id,sk,od,fm,gn,sp])
print(f'{count}/{len(tax_reference_list)}')
count += 1
time.sleep(0.15)
tax_tmp2_writer.writerows(tax_levels_list)
tax_out_tmp.close()
tax_out_tmp2.close()
tax_complete_list = []
with open(input_file+'.fmt.blastx.filtred.annot.bed.merged.tax.tmp','r') as tax_out_tmp:
tax_tmp_reader = csv.reader(tax_out_tmp, delimiter='\t')
for row_prot in tax_tmp_reader:
with open(input_file+'.fmt.blastx.filtred.annot.bed.merged.tax.tmp2','r') as tax_out_tmp2:
tax_tmp2_reader = csv.reader(tax_out_tmp2, delimiter=',')
for row_tax in tax_tmp2_reader:
if row_prot[1] == row_tax[0]:
tax_complete_list.append([row_prot[0]+':'+row_prot[3]+'-'+row_prot[4],row_prot[1],row_tax[1],row_tax[2],row_tax[3],row_tax[4],row_tax[5],row_tax[6],row_prot[2],row_prot[3],row_prot[4]])
df_tax = pd.DataFrame(tax_complete_list, columns=["Element-ID","Protein-ID","Virus-ID","Super-Kingdom","Order","Family","Genus","Species","Sense","Start","End"])
df_tax = df_tax.replace(r'^\s*$', np.nan, regex=True)
df_tax.to_csv(input_file+'.fmt.blastx.filtred.annot.bed.merged.tax.csv', sep=',', index = False, na_rep='NULL')
So, for small datasets ( < 10k lines) this script works perfectly, but for some datasets (with 40k, 80k of lines) this script just hangs in one step to send de protein ID to ncbi protein database, I note this because a protein ID is printed, but the script don't move on. e.g.:
Protein ID: YP_009465730.1
DONE:2079134
Protein ID: YP_009507247.1
DONE:2083181
Protein ID: YP_009507248.1
DONE:2083181
Protein ID: YP_009507248.1
No errors are reported, just pass some hours in the same protein. I think that can be some problem with the connection to NCBI, but I have no idea how I can solve this. Can anyone help?
Thanks !
So if I understand correctly, your code works fine for a small subset but not for a larger chunk of ids (lines), so then your code should be technically okay. In your code you are requesting for information from severs (here NCBI). I do not know how biopython does it internally, but these should be more or less like API requests (just masked by biopython). For such requests usually these servers have set a limit on how many requests can be made (per minute/per hour) from a given IP (of your machine), I am not sure how much does NCBI set its limit to. What I am guessing is you do not cross this threshold for 10k lines, but do so for 40k and above lines. What you could try doing, in my opinion is the following (like a dirty hack) : since you know 10k lines are not a problem, send requests to the server in batches of 10k and after every batch put your code to sleep, you will have to determine the sleep time by doing trial and errors. What sleep does is it hibernates your code for set time, so you do not make requests to ncbi for that time and hopefully do not run over the hourly(?) threshold. I do not know how sleep / hibernate works in python but maybe this might be helpful. I had the same problem with making requests to ensembl and the sleep trick works nicely for me.
Nice, i'm gonne try that
Have you signed up for NCBI_API_KEY and are you using it? If you have not you should do that first. When you are doing massive queries like this you need to build in some delay to prevent overloading a public resource.
I use the entrez.email, and put a sleep time of 0.4 seconds for each query, should be a time greater than this?
You should sign up for NCBI API KEY. See
Frequency/Timing
section for the EntrezDirect page to see the query limits.I am sure there is a taxdump file that has an association of accession numbers with taxID's. Will look for it later. It would be ideal to parse that file rather than use Eutils in this case.
Right, thanks genomax, I create an API key and put it inside my script with Entrez.api_key. So I'm gonna test now if everything runs normal.
While this answer is not addressing the problem described in the original post (that has been dealt with some comments above) this is a preferred way to achieve the same end, especially with a large number of proteins involved.
Parsing this protein accessions to taxonomy (Note: 5.6G download) file locally would be much faster than doing Endtredirect queries.
Thanks for your suggestions genomax, in fact, make the search locally can be a good choice, I'm gonna test this. Moreover, your suggestion to run this locally, allow me to think in to reduce the number of queries to online search, because I was sending all protein IDs from a blast output (circa of 72k of protein IDs), but when I remove the redundancy of protein IDs, rest only 1644 IDs, in this way I reduce a lot of the online search, and make a table of taxonomy related to each protein ID, and after this, I crossed the data from blast output and the data of taxonomy related to each protein ID, this quick fix avoid this huge online search.
Excellent.
I got complete this search, but some times still have a bug, the script just hangs when send one protein ID, or taxonomy ID. I put the ncbi e-mail and key for API, and put time.sleep(0.15) after each search (because with API key we can send 10 queries per second). I'll update the code present on the original answer, maybe have some library on python that can be used to avoid this problem?
I still think you should do the search locally with the file I linked above.