How to download raw data in batch from NCBI based on Series Accession number or Platform ID
1
0
Entering edit mode
7.5 years ago
biolab ★ 1.4k

Dear all,

I have a list of NCBI GEO Series Accession numbers and Platform IDs, and want to download the raw data in batch. A previous post on Biostars presents a good example of batch download (How to download raw sequence data from GEO/SRA ), but that solution is based on project ID rather than GEO Series Accession number. Does anyone know how to work out this task? Thank you very much!

sra • 4.4k views
ADD COMMENT
0
Entering edit mode

Can you post an example of the Accession number you are interested in? @Istvan's solution with eUtils should be able to accommodate your needs.

ADD REPLY
0
Entering edit mode

Thank you for your comment, genomax! The GEO Series Accession Number is something like GSE65022, and the Platform ID is like GPL19657. I want to get the SRA number something like SRR4024915.

ADD REPLY
0
Entering edit mode

This may be helpul batchentrez

ADD REPLY
0
Entering edit mode

Hi, Buffo, thanks for your comment! However, after uploading a list of Platform ID (eg, GPL19657), I could not get the SAR run number, which is something like SRR4024915.

ADD REPLY
0
Entering edit mode

Hi, Just in case you are only interested in SRR ids, SRA run selector is a very good option. You can either enter GSE65022 in the run selector and it should pull all the relevant metadata for you. For example is this url https://www.ncbi.nlm.nih.gov/Traces/study/?acc=GSE65022&go=go

ADD REPLY
2
Entering edit mode
7.4 years ago

You can connect GEO to the SRA run info like so:

esearch -query GSE65022 -db gds | elink -target sra | efetch -format runinfo

then from that you can build the command to automate data download as such (this only gets the first 10 spots to allow easy testing):

esearch -query GSE65022 -db gds | elink -target sra | efetch -format runinfo | cut -d ',' -f 1 | grep SRR | xargs fastq-dump -X 10 --split-files

remove the limit of -X 10 when getting all the data.

ADD COMMENT
0
Entering edit mode

Thank you very much, Istvan. The command you provided is really helpful!

ADD REPLY
0
Entering edit mode

I wonder if search can retrieve strain information like shown in runInfo table (https://www.ncbi.nlm.nih.gov/Traces/study/?acc=SRR1761531&go=go ) or SRX summary information (https://www.ncbi.nlm.nih.gov/gds/?term=SRX844624 )

ADD REPLY
0
Entering edit mode

There is an XML file that contains all the information that is displayed, though getting the data out can be somewhat convoluted. For example:

esearch -db sra -query SRR1761531 | efetch > summary.xml
cat summary.xml | xtract -Pattern SAMPLE_ATTRIBUTE -element TAG,VALUE

would produce:

source_name Leaf tissue
cultivar    Nipponbare
tissue  leaf
treatment   control
developmental stage Vegetative stage
ADD REPLY

Login before adding your answer.

Traffic: 1611 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6