Question

How to get the genome list output table programmaticaly in PATRIC Database ?

0

Entering edit mode

8.8 years ago

yoan.bouzin • 0

Hi!

It's my first post,

I'm a French bioinformatician, in a internship for my second years of Master Degree.

I would like to get the output table (txt file tab delimited) you can get manually when you display a list of genome after a request in the PATRIC Database. I would like to have this table programmatically, but you don't show anything when you show the HTML code (with your brother for example).

I know I can have the list if I choose the NCBI taxID in the request to show all the genome related to the taxID Phylum (for example I show "57624 genomes found" if I put the taxID "2" (Bacteria) in PATRIC)

https://www.patricbrc.org/portal/portal/patric/GenomeList?cType=taxon&cId=2&dataSource=&displayMode=&pk=&kw=

I code in Python since 2 years, I know lot of things to retrieve, parse or getting data, but maybe I miss something I don't know...

Maybe I can modified this URL to show my result, I don't know, in XML or in text code, and have all the table , in this example, all the 57624 genome. I just need one important data in the list : the TaxID of the genome species , and the ID Genome of Patric DATABASE (it's the taxID with a dot and the assembly version)

PATRIC don't have a file with all the list of the genome like in NCBI

Thanks for your help!

Sorry for my bad English and the mistakes.

Regards,
Yoan

PATRIC python genome database • 3.3k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 8.8 years ago by yoan.bouzin • 0

0

Entering edit mode

Hi Yoan,

I am trying to get whole genomes data from PATRICT database using your script.

After running I got this problem:

Give the NCBI Tax ID of your rank or taxon : 396598
Make POST Request to PATRIC Server to have the number of genome
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)
Traceback (most recent call last):
  File "DownloadgenomesPATRICT.py", line 42, in <module>
    genomeNumber = r.json()[u'total']
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/models.py", line 892, in json
    return complexjson.loads(self.text, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Do I need to change something for now to get there

I know that is quite a long time from you posted but I hope to hear from you soon.

Thank you very much

Thanh

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 7.3 years ago by trantrungthanh3t • 0

0

Entering edit mode

8.8 years ago

yoan.bouzin • 0

Hello ! :)

Thank you very much !

I'm late for the answer, because I writting and testing a script to download the genome in Python from PATRIC and the time to understand your answer and learn more about that :).

import requests
import json
import math
import time
import os
import sys
import subprocess

#function

def execute(command,path):
    """
    print the command line output in the console
    """
    process = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, cwd=path)

    # Poll process for new output until finished
    while True:
        nextline = process.stdout.readline()
        if nextline == '' and process.poll() != None:
            break
        sys.stdout.write(nextline)
        sys.stdout.flush()

    output = process.communicate()[0]
    exitCode = process.returncode

    if (exitCode == 0):
        return output
    else:
        raise ProcessException(command, exitCode, output)

#commands

requests.get('https://guest.ulg.ac.be/welcome',verify=True)

taxonID = str(input('Give the NCBI Tax ID of your rank or taxon : '))
limit = '10000'
print "Make POST Request to PATRIC Server to have the number of genome"
r = requests.post('https://www.patricbrc.org/portal/portal/patric/GenomeFinder/GenomeFinderWindow?action=b&cacheability=PAGE&need=0&taxonId='+taxonID+'&keyword=*:*&facet={"facet":"public,genome_status,reference_genome,antimicrobial_resistance,antimicrobial_resistance_evidence,isolation_country,host_name,disease,collection_date,completion_date","facet_text":"Public,Genome Status,Reference Genome,Antimicrobial Resistance,Antimicrobial Resistance Evidence,Isolation Country,Host Name,Disease,Collection Date,Completion Date","field_facets":"genome_status,reference_genome,antimicrobial_resistance,antimicrobial_resistance_evidence,isolation_country,host_name,disease,collection_date,public","date_range_facets":"completion_date"}&sort=[{"property":"genome_name","direction":"ASC"}]',verify=False)
genomeNumber = r.json()[u'total']
print "Number of genome found : ", genomeNumber
if genomeNumber < int(limit):
    pageNumber = 1
else:
    pageNumber = int(math.ceil(genomeNumber/int(limit)))

dicoFTP = dict()
with open("patricGenome_"+taxonID+".txt","w") as f:
    pageList = list()
    for page in range(pageNumber):
        start = page*int(limit)+1
        page+=1
        print "page ",page,"/",pageNumber
        t0 = time.time()
        req = requests.post('https://www.patricbrc.org/portal/portal/patric/GenomeFinder/GenomeFinderWindow?action=b&cacheability=PAGE&need=0&taxonId='+taxonID+'&keyword=*:*&facet={"facet":"public,genome_status,reference_genome,antimicrobial_resistance,antimicrobial_resistance_evidence,isolation_country,host_name,disease,collection_date,completion_date","facet_text":"Public,Genome Status,Reference Genome,Antimicrobial Resistance,Antimicrobial Resistance Evidence,Isolation Country,Host Name,Disease,Collection Date,Completion Date","field_facets":"genome_status,reference_genome,antimicrobial_resistance,antimicrobial_resistance_evidence,isolation_country,host_name,disease,collection_date,public","date_range_facets":"completion_date"}&page='+str(page)+'&start='+str(start)+'&limit='+limit+'&sort=[{"property":"genome_name","direction":"ASC"}]',verify=False)
        pageList.append(req)
        t1 = time.time()
        t = t1-t0
        print t,"seconds"
    cpt = 0
    for page in pageList:
        cpt +=1
        t0 = time.time()
        nbGenome = len(req.json()[u'results'])
        for i in range(nbGenome):
            t1 = time.time()
            dico = req.json()[u'results'][i]
            taxID, genomeName, genomeID = dico[u'taxon_id'],dico[u'genome_name'],dico[u'genome_id']
            dicoFTP[genomeID]="ftp://ftp.patricbrc.org/patric2/genomes/"+str(genomeID)+"/"+str(genomeID)+".PATRIC.faa"
            f.write(str(taxID)+"\t"+str(genomeName)+"\t"+str(genomeID)+"\n")
            t2 = time.time()
            print "parsed genome",genomeName,"in",t2-t1,"seconds"
        t3 = time.time()
        print "parsed page",cpt,"in",t3-t0,"seconds"

#download founded genome
patricDir = os.getcwd()+"/patricGenome"
if os.path.isdir(patricDir) == False:
    os.mkdir(patricDir)
#logFile
with open("logFilePatric.txt","w") as log:
    #download genome
    for key in dicoFTP.keys():
        print dicoFTP[key]
        name = os.path.split(dicoFTP[key])[1]
        call = 'wget '+dicoFTP[key]
        execute(call,patricDir)
        #process = subprocess.Popen(call, shell=True, cwd=patricDir)
        #process.communicate()
        if os.path.isfile(name) == False:
            print name," not found"
            log.write(dicoFTP[key]+"\n")

Maybe this code can help someone :)

Thank you !

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by yoan.bouzin • 0

Ram · Accepted Answer · 2016-02-06

looking at the http requests sent by the browser (firefox is cool for this) , I found a POST query that could answer your needs . Answer is JSON.

for PAGE in 1 2 3 4 5 6 7 8 9 10
do
    START=`echo "((${PAGE} -1 )* 20) + 1" | bc`
    curl -s --data \
        "pk=-1181405400096836539&need=0&taxonId=2&genomeId=&keyword=*:*&facet={\"facet\":\"public,genome_status,reference_genome,antimicrobial_resistance,antimicrobial_resistance_evidence,isolation_country,host_name,disease,collection_date,completion_date\",\"facet_text\":\"Public,Genome Status,Reference Genome,Antimicrobial Resistance,Antimicrobial Resistance Evidence,Isolation Country,Host Name,Disease,Collection Date,Completion Date\",\"field_facets\":\"genome_status,reference_genome,antimicrobial_resistance,antimicrobial_resistance_evidence,isolation_country,host_name,disease,collection_date,public\",\"date_range_facets\":\"completion_date\"}&page=${PAGE}&start=${START}&limit=20&sort=[{\"property\":\"genome_name\",\"direction\":\"ASC\"}]" \
    "https://www.patricbrc.org/portal/portal/patric/GenomeFinder/GenomeFinderWindow?action=b&cacheability=PAGE"

done