Question

Blast+ remote search for refseq reference genome

0

Entering edit mode

5 weeks ago

Chiranth • 0

Hello all i am currently using the latest blastn: 2.16.0+ i have a protein sequence and i want to tblastn it aganist refseq reference genome using the remote function in the offline Blast+ that i have I am building a tool and i need to see from what reference genome(basically the start and end postion) this protein sequence is mapped onto here is the following command: tblastn -query query.fasta -db refseq_representative_genomes -remote -out results.txt -max_target_seqs 5 -entrez_query "txid4932[Organism]"

please tell me what exactly should i use in db?

Here is the database that i actually wanna tblastn it aganist

I have also tried using the API service but it doesn't work, it only works for nt

here's the sample code:

import requests
import time
accession="AJR77291.1"
taxid='4932'

subj_seq_url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=fasta&retmode=text&id={accession}"
response = requests.get(subj_seq_url)
if response.status_code == 200 and response.text:
            sequence = "".join(line.strip() for line in response.text.splitlines() if not line.startswith(">"))
            print(f"Retrieved sequence for accession {accession}:\n{sequence[:60]}...")  # Show a snippet
else:
            print(f"Failed to fetch sequence for accession {accession}. Skipping to the next accession...")


tblastn_url = "https://blast.ncbi.nlm.nih.gov/Blast.cgi"
tblastn_params = {
    "CMD": "Put",
    "PROGRAM": "tblastn",
    "DATABASE": "ref_euk_rep_genomes",  #prok_complete_genomes--> for prokaryotes
    "QUERY": sequence,
    "ENTREZ_QUERY": "txid4932 [organism]",  # Replace with actual taxid
    "HITLIST_SIZE": 5,
    "FORMAT_TYPE": "Text"
}


tblastn_response = requests.post(tblastn_url, data=tblastn_params)
with open("tblastn_1.txt", "w", encoding='UTF-8') as file:
        file.write(tblastn_response.text)
if "RID" in tblastn_response.text and "RTOE" in tblastn_response.text:
            rid_start = tblastn_response.text.find("RID = ") + len("RID = ")
            rid_end = tblastn_response.text.find("\n", rid_start)
            rid2 = tblastn_response.text[rid_start:rid_end].strip()

            rtoe_start = tblastn_response.text.find("RTOE = ") + len("RTOE = ")
            rtoe_end = tblastn_response.text.find("\n", rtoe_start)
            rtoe2 = int(tblastn_response.text[rtoe_start:rtoe_end].strip())
            print(f"tBLASTn job submitted successfully for {accession}. RID: {rid2}, Estimated time: {rtoe2} seconds.")

else:
            print(f"Failed to submit tBLASTn for {accession} using organism and taxid. Trying with genus...")  

print(f"Waiting for tBLASTn for {accession} to complete...")
time.sleep(rtoe2)



while True:
        result_params = {"CMD": "Get", "FORMAT_OBJECT": "SearchInfo", "RID": rid2}

        result_response2 = requests.get(tblastn_url, params=result_params)

        if "Status=WAITING" in result_response2.text:
            print("Searching...")
            time.sleep(60)
            continue

        if "Status=FAILED" in result_response2.text:
            print(f"tBLASTn search failed for {accession}. Trying next accession...")
            break

        if "Status=READY" in result_response2.text:
            print(f"tBLASTn search complete for {accession}. Retrieving results...")
            break


result_params = {"CMD": "Get", "FORMAT_TYPE": "Text", "RID": rid2}
result_response2 = requests.get(tblastn_url, params=result_params)

output_file = f"tblastn_results1_{accession}.txt"
with open(output_file, "w") as file:
      file.write(result_response2.text)
print(f"tBLASTn results for {accession} saved to {output_file}.")

API refseq • 322 views

ADD COMMENT • link updated 4 weeks ago by GenoMax 150k • written 5 weeks ago by Chiranth • 0

0

Entering edit mode

small comment already : you might be better of using the tabular output format , it's much easier to parse afterwards

ADD REPLY • link 5 weeks ago by lieven.sterck 15k

0

Entering edit mode

Thanks,i'll keep that in mind. but i want it to work first, when i try the remote blast offline command i'm getting a:

Error: NCBI C++ Exception:
    T0 "E:\RemoteApp\tmp64_19\2_1719320056\c++\src\connect\ncbi_conn_streambuf.cpp", line 763: Error: (CIO_Exception::eIO_NotSupported) ncbi::CConn_Streambuf::showmanyc() - I/O error
    T0 "E:\RemoteApp\tmp64_19\2_1719320056\c++\src\serial\objistrasnb.cpp", line 1231: Error: (CIO_Exception::eIO_NotSupported) ncbi::CObjectIStreamAsnBinary::ReadClassSequential() -  Frame type= eFrameClassMember, Member name= description
    T0 "E:\RemoteApp\tmp64_19\2_1719320056\c++\src\serial\objistrasnb.cpp", line 1237: Error: (CIO_Exception::eIO_NotSupported) ncbi::CObjectIStreamAsnBinary::ReadClassSequential() -  Frame type= eFrameClass, Object type= Blast4-database-info

ADD REPLY • link updated 4 weeks ago by GenoMax 150k • written 4 weeks ago by Chiranth • 0

0

Entering edit mode

NCBI remote blast is meant as a convenience for use with a small number of sequences. It should not be built into a tool or anything that you need to reliably work.

Web blast also has the following limits in place that you should keep in mind ( New web BLAST default parameters and search limits are now in place at NCBI ) It would be best to download the refseq_pro/eu-karyotic genomes locally and run your search against the database, if you want this to reliably work. NCBI makes preformatted blast databases available for these.

ADD REPLY • link 4 weeks ago by GenoMax 150k