How to get the genome list output table programmaticaly in PATRIC Database ?
2
0
Entering edit mode
8.8 years ago

Hi!

It's my first post,

I'm a French bioinformatician, in a internship for my second years of Master Degree.

I would like to get the output table (txt file tab delimited) you can get manually when you display a list of genome after a request in the PATRIC Database. I would like to have this table programmatically, but you don't show anything when you show the HTML code (with your brother for example).

I know I can have the list if I choose the NCBI taxID in the request to show all the genome related to the taxID Phylum (for example I show "57624 genomes found" if I put the taxID "2" (Bacteria) in PATRIC)

https://www.patricbrc.org/portal/portal/patric/GenomeList?cType=taxon&cId=2&dataSource=&displayMode=&pk=&kw=

I code in Python since 2 years, I know lot of things to retrieve, parse or getting data, but maybe I miss something I don't know...

Maybe I can modified this URL to show my result, I don't know, in XML or in text code, and have all the table , in this example, all the 57624 genome. I just need one important data in the list : the TaxID of the genome species , and the ID Genome of Patric DATABASE (it's the taxID with a dot and the assembly version)

PATRIC don't have a file with all the list of the genome like in NCBI

Thanks for your help!

Sorry for my bad English and the mistakes.

Regards,
Yoan

PATRIC python genome database • 3.3k views
ADD COMMENT
0
Entering edit mode

Hi Yoan,

I am trying to get whole genomes data from PATRICT database using your script.

After running I got this problem:

Give the NCBI Tax ID of your rank or taxon : 396598
Make POST Request to PATRIC Server to have the number of genome
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
Traceback (most recent call last):
  File "DownloadgenomesPATRICT.py", line 42, in <module>
    genomeNumber = r.json()[u'total']
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/models.py", line 892, in json
    return complexjson.loads(self.text, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Do I need to change something for now to get there

I know that is quite a long time from you posted but I hope to hear from you soon.

Thank you very much

Thanh

ADD REPLY
2
Entering edit mode
8.8 years ago

looking at the http requests sent by the browser (firefox is cool for this) , I found a POST query that could answer your needs . Answer is JSON.

for PAGE in 1 2 3 4 5 6 7 8 9 10
do
    START=`echo "((${PAGE} -1 )* 20) + 1" | bc`
    curl -s --data \
        "pk=-1181405400096836539&need=0&taxonId=2&genomeId=&keyword=*:*&facet={\"facet\":\"public,genome_status,reference_genome,antimicrobial_resistance,antimicrobial_resistance_evidence,isolation_country,host_name,disease,collection_date,completion_date\",\"facet_text\":\"Public,Genome Status,Reference Genome,Antimicrobial Resistance,Antimicrobial Resistance Evidence,Isolation Country,Host Name,Disease,Collection Date,Completion Date\",\"field_facets\":\"genome_status,reference_genome,antimicrobial_resistance,antimicrobial_resistance_evidence,isolation_country,host_name,disease,collection_date,public\",\"date_range_facets\":\"completion_date\"}&page=${PAGE}&start=${START}&limit=20&sort=[{\"property\":\"genome_name\",\"direction\":\"ASC\"}]" \
    "https://www.patricbrc.org/portal/portal/patric/GenomeFinder/GenomeFinderWindow?action=b&cacheability=PAGE"

done
ADD COMMENT
0
Entering edit mode
8.8 years ago

Hello ! :)

Thank you very much !

I'm late for the answer, because I writting and testing a script to download the genome in Python from PATRIC and the time to understand your answer and learn more about that :).

import requests
import json
import math
import time
import os
import sys
import subprocess

#function

def execute(command,path):
    """
    print the command line output in the console
    """
    process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, cwd=path)

    # Poll process for new output until finished
    while True:
        nextline = process.stdout.readline()
        if nextline == '' and process.poll() != None:
            break
        sys.stdout.write(nextline)
        sys.stdout.flush()

    output = process.communicate()[0]
    exitCode = process.returncode

    if (exitCode == 0):
        return output
    else:
        raise ProcessException(command, exitCode, output)

#commands

requests.get('https://guest.ulg.ac.be/welcome',verify=True)

taxonID = str(input('Give the NCBI Tax ID of your rank or taxon : '))
limit = '10000'
print "Make POST Request to PATRIC Server to have the number of genome"
r = requests.post('https://www.patricbrc.org/portal/portal/patric/GenomeFinder/GenomeFinderWindow?action=b&cacheability=PAGE&need=0&taxonId='+taxonID+'&keyword=*:*&facet={"facet":"public,genome_status,reference_genome,antimicrobial_resistance,antimicrobial_resistance_evidence,isolation_country,host_name,disease,collection_date,completion_date","facet_text":"Public,Genome Status,Reference Genome,Antimicrobial Resistance,Antimicrobial Resistance Evidence,Isolation Country,Host Name,Disease,Collection Date,Completion Date","field_facets":"genome_status,reference_genome,antimicrobial_resistance,antimicrobial_resistance_evidence,isolation_country,host_name,disease,collection_date,public","date_range_facets":"completion_date"}&sort=[{"property":"genome_name","direction":"ASC"}]',verify=False)
genomeNumber = r.json()[u'total']
print "Number of genome found : ", genomeNumber
if genomeNumber < int(limit):
    pageNumber = 1
else:
    pageNumber = int(math.ceil(genomeNumber/int(limit)))

dicoFTP = dict()
with open("patricGenome_"+taxonID+".txt","w") as f:
    pageList = list()
    for page in range(pageNumber):
        start = page*int(limit)+1
        page+=1
        print "page ",page,"/",pageNumber
        t0 = time.time()
        req = requests.post('https://www.patricbrc.org/portal/portal/patric/GenomeFinder/GenomeFinderWindow?action=b&cacheability=PAGE&need=0&taxonId='+taxonID+'&keyword=*:*&facet={"facet":"public,genome_status,reference_genome,antimicrobial_resistance,antimicrobial_resistance_evidence,isolation_country,host_name,disease,collection_date,completion_date","facet_text":"Public,Genome Status,Reference Genome,Antimicrobial Resistance,Antimicrobial Resistance Evidence,Isolation Country,Host Name,Disease,Collection Date,Completion Date","field_facets":"genome_status,reference_genome,antimicrobial_resistance,antimicrobial_resistance_evidence,isolation_country,host_name,disease,collection_date,public","date_range_facets":"completion_date"}&page='+str(page)+'&start='+str(start)+'&limit='+limit+'&sort=[{"property":"genome_name","direction":"ASC"}]',verify=False)
        pageList.append(req)
        t1 = time.time()
        t = t1-t0
        print t,"seconds"
    cpt = 0
    for page in pageList:
        cpt +=1
        t0 = time.time()
        nbGenome = len(req.json()[u'results'])
        for i in range(nbGenome):
            t1 = time.time()
            dico = req.json()[u'results'][i]
            taxID, genomeName, genomeID = dico[u'taxon_id'],dico[u'genome_name'],dico[u'genome_id']
            dicoFTP[genomeID]="ftp://ftp.patricbrc.org/patric2/genomes/"+str(genomeID)+"/"+str(genomeID)+".PATRIC.faa"
            f.write(str(taxID)+"\t"+str(genomeName)+"\t"+str(genomeID)+"\n")
            t2 = time.time()
            print "parsed genome",genomeName,"in",t2-t1,"seconds"
        t3 = time.time()
        print "parsed page",cpt,"in",t3-t0,"seconds"

#download founded genome
patricDir = os.getcwd()+"/patricGenome"
if os.path.isdir(patricDir) == False:
    os.mkdir(patricDir)
#logFile
with open("logFilePatric.txt","w") as log:
    #download genome
    for key in dicoFTP.keys():
        print dicoFTP[key]
        name = os.path.split(dicoFTP[key])[1]
        call = 'wget '+dicoFTP[key]
        execute(call,patricDir)
        #process = subprocess.Popen(call, shell=True, cwd=patricDir)
        #process.communicate()
        if os.path.isfile(name) == False:
            print name," not found"
            log.write(dicoFTP[key]+"\n")

Maybe this code can help someone :)

Thank you !

ADD COMMENT

Login before adding your answer.

Traffic: 2126 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6