Question

SRA and Bioproject IDs

0

Entering edit mode

15 months ago

mrashad ▴ 80

Dears, I have a group of Bioproject IDs and need to retrieve their corresponding SRA IDs. I tried to retrieve the whole data from SRA using

kywrds <- entrez_search(db = "sra", retmax = 20000,                   
                           term = "Homo sapiens[ORGN] AND Homo sapiens[orgn:__txid9606]")

However, the result of the whole homosapien is more than 4 million records, so I should use "retstart" with the "web_history" arguments with the retmax argument, but unfortunately, I couldn't do that.

The result I want to obtain is data frame of SRA IDs with their corresponding bioproject IDs

Could you help me to do that?

Thanks

Bioproject GEO SRA • 1.1k views

ADD COMMENT • link updated 15 months ago by GenoMax 147k • written 15 months ago by mrashad ▴ 80

GenoMax · Answer 1 · 2023-08-28

3

Entering edit mode

15 months ago

vkkodali_ncbi ★ 3.8k

You can search SRA directly using a BioProject ID. Shown below are EntrezDirect commands that you should be able to change the syntax to match that of BioPython.

esearch -db sra -query 'PRJEB4337[bioproject]'

You can then pass those results along to esummary and extract relevant information from the output XML. For example,

esearch -db sra -query 'PRJEB4337[bioproject]' | esummary | xtract -pattern DocumentSummary -element Bioproject Biosample Run@acc

will give you a 3-column, tab-delimited table with BioProject, BioSample and SRA Run accessions.

ADD COMMENT • link 15 months ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

Thank you for your informative answer. I got the XML result from esummary but I need to access study ACC in ExpXml as attached I tried to make as following

esearch -db sra -query 'PRJEB4337[bioproject]' | esummary | xtract -pattern DocumentSummary -element Bioproject ExpXml@Study acc Run@acc

But it doesn't work

enter image description here

ADD REPLY • link updated 15 months ago by GenoMax 147k • written 15 months ago by mrashad ▴ 80

0

Entering edit mode

No result should be a simple table not XML that you show. Do not change the command posted by vkkodali_ncbi .

Three columns produced are bioproject ID, biosample ID and SRA Accession.

$ esearch -db sra -query 'PRJEB4337[bioproject]' | esummary | xtract -pattern DocumentSummary -element Bioproject Biosample Run@acc
PRJEB4337       SAMEA2145774    ERR315468
PRJEB4337       SAMEA2154125    ERR315343
PRJEB4337       SAMEA2145893    ERR315339
PRJEB4337       SAMEA2156266    ERR315348

ADD REPLY • link 15 months ago by GenoMax 147k

1

Entering edit mode

I got what I want by the following command:

esearch -db sra -query 'PRJEB4337[bioproject]' | esummary | xtract -pattern DocumentSummary -element Study@acc Bioproject Biosample Run@acc

and produced:

ERP003613      PRJEB4337       SAMEA2145774    ERR315468

Thank you for your help :)

ADD REPLY • link updated 15 months ago by GenoMax 147k • written 15 months ago by mrashad ▴ 80

0

Entering edit mode

Please go ahead and accept the original answer (green check mark) to provide closure to this thread.

ADD REPLY • link 15 months ago by GenoMax 147k