Hello all i am currently using the latest blastn: 2.16.0+ i have a protein sequence and i want to tblastn it aganist refseq reference genome using the remote function in the offline Blast+ that i have I am building a tool and i need to see from what reference genome(basically the start and end postion) this protein sequence is mapped onto here is the following command: tblastn -query query.fasta -db refseq_representative_genomes -remote -out results.txt -max_target_seqs 5 -entrez_query "txid4932[Organism]"
please tell me what exactly should i use in db?
I have also tried using the API service but it doesn't work, it only works for nt
here's the sample code:
import requests
import time
accession="AJR77291.1"
taxid='4932'
subj_seq_url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&retmode=text&id={accession}"
response = requests.get(subj_seq_url)
if response.status_code == 200 and response.text:
sequence = "".join(line.strip() for line in response.text.splitlines() if not line.startswith(">"))
print(f"Retrieved sequence for accession {accession}:\n{sequence[:60]}...") # Show a snippet
else:
print(f"Failed to fetch sequence for accession {accession}. Skipping to the next accession...")
tblastn_url = "https://blast.ncbi.nlm.nih.gov/Blast.cgi"
tblastn_params = {
"CMD": "Put",
"PROGRAM": "tblastn",
"DATABASE": "ref_euk_rep_genomes", #prok_complete_genomes--> for prokaryotes
"QUERY": sequence,
"ENTREZ_QUERY": "txid4932 [organism]", # Replace with actual taxid
"HITLIST_SIZE": 5,
"FORMAT_TYPE": "Text"
}
tblastn_response = requests.post(tblastn_url, data=tblastn_params)
with open("tblastn_1.txt", "w", encoding='UTF-8') as file:
file.write(tblastn_response.text)
if "RID" in tblastn_response.text and "RTOE" in tblastn_response.text:
rid_start = tblastn_response.text.find("RID = ") + len("RID = ")
rid_end = tblastn_response.text.find("\n", rid_start)
rid2 = tblastn_response.text[rid_start:rid_end].strip()
rtoe_start = tblastn_response.text.find("RTOE = ") + len("RTOE = ")
rtoe_end = tblastn_response.text.find("\n", rtoe_start)
rtoe2 = int(tblastn_response.text[rtoe_start:rtoe_end].strip())
print(f"tBLASTn job submitted successfully for {accession}. RID: {rid2}, Estimated time: {rtoe2} seconds.")
else:
print(f"Failed to submit tBLASTn for {accession} using organism and taxid. Trying with genus...")
print(f"Waiting for tBLASTn for {accession} to complete...")
time.sleep(rtoe2)
while True:
result_params = {"CMD": "Get", "FORMAT_OBJECT": "SearchInfo", "RID": rid2}
result_response2 = requests.get(tblastn_url, params=result_params)
if "Status=WAITING" in result_response2.text:
print("Searching...")
time.sleep(60)
continue
if "Status=FAILED" in result_response2.text:
print(f"tBLASTn search failed for {accession}. Trying next accession...")
break
if "Status=READY" in result_response2.text:
print(f"tBLASTn search complete for {accession}. Retrieving results...")
break
result_params = {"CMD": "Get", "FORMAT_TYPE": "Text", "RID": rid2}
result_response2 = requests.get(tblastn_url, params=result_params)
output_file = f"tblastn_results1_{accession}.txt"
with open(output_file, "w") as file:
file.write(result_response2.text)
print(f"tBLASTn results for {accession} saved to {output_file}.")
small comment already : you might be better of using the tabular output format , it's much easier to parse afterwards
Thanks,i'll keep that in mind. but i want it to work first, when i try the remote blast offline command i'm getting a:
NCBI remote blast is meant as a convenience for use with a small number of sequences. It should not be built into a tool or anything that you need to reliably work.
Web blast also has the following limits in place that you should keep in mind ( New web BLAST default parameters and search limits are now in place at NCBI ) It would be best to download the refseq_pro/eu-karyotic genomes locally and run your search against the database, if you want this to reliably work. NCBI makes preformatted blast databases available for these.