Entering edit mode
7.2 years ago
ypriverol
•
0
Hi all: Does any body know the way to know all Bioprojects Ids from NCBI?
Regards Yasset
Hi all: Does any body know the way to know all Bioprojects Ids from NCBI?
Regards Yasset
With NCBI eUtils:
esearch -query "P*" -db bioproject | efetch -format docsum | xtract -pattern DocumentSummary -element Project_Acc Project_Title
produces
PRJNA403305 Penicillium aculeatus Gene Expression Profiling - P-Pe223 Fe 3 transcriptome
PRJNA403304 Penicillium aculeatus Gene Expression Profiling - P-Pe223 Fe 2 transcriptome
PRJNA403303 Penicillium aculeatus Gene Expression Profiling - P-Pe223 Fe 1 transcriptome
PRJNA403302 Penicillium aculeatus Gene Expression Profiling - P-Pe223 Al 3 transcriptome
PRJNA403301 Penicillium aculeatus Gene Expression Profiling - P-Pe223 Al 2 transcriptome
All (ID, organism, date...) is available in ftp://ftp.ncbi.nlm.nih.gov/bioproject/summary.txt
228784 summary.txt
Seems to match the number obtained from browser.
But information about Bioprojects databases gets you this
<DbInfo>
<DbName>bioproject</DbName>
<MenuName>BioProject</MenuName>
<Description>BioProject Database</Description>
<DbBuild>Build170911-0610.1</DbBuild>
<Count>246934</Count>
<LastUpdate>2017/09/11 07:02</LastUpdate>
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hi @genomax thanks for your quick answer. Do you know a way to do it programmatically. I found this one https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=bioproject&term=all%5Bfilter%5D&retmax=999999 but I don't know if is the best one.
Regards Yasset
I wondered just how many bioprojects are there in total. Running the search on its own tells us that:
prints:
so there are
10454
bioprojects at NCBI.Amusingly after doing some investigation, I came to believe that a wildcard search at NCBI does not do what you and I and most people think that a wildcard search should be doing.
What it does instead is that it creates an expanded search query that includes all terms that match the wildcard. So
P*[Project Accession]
will create and run the search:and so on and on until a predefined string limit size is reached. That's why it returns only a subset of results.
To more we know ...
According to this page there are 228784 entries (as of today). So perhaps there are some that are not being captured by this query. Every project ID does appear to start with
PR*
. Mysteries of eUtils.Interesting, the perils of matching on names. Good to know.
Does not make complete sense. Every project name starts with
P
but there are different answers depending on where/how we look. See my comment below @Pierre's answer.This is strange:
Your results are: 10454
My results with the url are: 246934 (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=bioproject&term=all%5Bfilter%5D&retmax=999999)
The results in their browser are: 228784 (https://www.ncbi.nlm.nih.gov/bioproject/browse/)