Question

Downloading Multi Experiment .Sra Files From Ncbi Archive Automatedly

5

Entering edit mode

12.4 years ago

narges ▴ 210

Hi all, in order to do some comparisons, I need to download 161 raw dataset files from NCBI, below link: http://www.ncbi.nlm.nih.gov/sites/entrez?db=sra&LinkName=pubmed_sra&from_uid=20220758 I should save them first into cluster but I do not know what is the best way of downloading these files. Many thanks in advance for your help.

sra ncbi next-gen • 11k views

ADD COMMENT • link updated 12.4 years ago by Sean Davis 27k • written 12.4 years ago by narges ▴ 210

score 7 · Answer 1 · 2012-08-20

7

Entering edit mode

12.4 years ago

matted 7.8k

Here's a similar answer, but maybe useful if (like me) you don't like working with .SRA files.

Find the run(s) in the EBI ENA (http://www.ebi.ac.uk/ena/data/view/SRP001540 for yours).

Then click the "View: Text" link which will download this file. The 17th column has the full FTP link for each file.

Like Sukhdeep's example, you could then run something like cut -f 17 SRP001540 | tail -n +2 | xargs wget.

The advantage of the ENA is that you can download FASTQ files directly, and skip the slow step of converting the .SRA files back to FASTQ.

ADD COMMENT • link 12.4 years ago by matted 7.8k

0

Entering edit mode

+1 good one for just fastq's

ADD REPLY • link 12.4 years ago by Sukhi Singh 11k

score 6 · Answer 2 · 2012-08-20

6

Entering edit mode

12.4 years ago

Sukhi Singh 11k

On the page you have given, select all the experiments you want, and then click on Send to:->File->Summary. A csv file will be downloaded for the experiments you selected with a link to the ftp.

Now, in your directory in cluster, make a folder and move the file in there. I assume the 14th column of that file is the ftp links, change as required in the following command.

sed 1d file | cut -f14 | wget -i -

This will download all the experiment archives. Use -b in the wget to send it to background.

Cheers

ADD COMMENT • link 12.4 years ago by Sukhi Singh 11k

0

Entering edit mode

millions of thanks . just one more thing, the information text file regarding this dataset is in the link : http://eqtl.uchicago.edu/RNA_Seq_data/list_lanes_pickrell_2010_nature if one search for for example NA19200 in this file there would be 4 results, 2 for each center(yale and argonne) , I have assumed that for this individual sample (NA19200), there are two technical replicates, one in argonne center and one in yale canter, but I can not understand what do they mean with this "2" after the name of this individual for the second sample in the same center. I mean why do they have NA19200 and NA192002 for this individual?

ADD REPLY • link 12.4 years ago by narges ▴ 210

0

Entering edit mode

It's library replicates. From the supplement to their paper: "In the course of examining variability between libraries, multiple libraries were prepared and sequenced for a subset of cell lines."

ADD REPLY • link 12.4 years ago by matted 7.8k

score 3 · Answer 3 · 2012-08-20

3

Entering edit mode

12.4 years ago

Sean Davis 27k

If your workflow includes R, you might take a look at the Bioconductor SRAdb package. It has functions for searching SRA/ENA (both) locally, finding and generating URLs, and downloading from SRA including some simple functionality for scripting using aspera.

ADD COMMENT • link 12.4 years ago by Sean Davis 27k