Question

Download all SRA of one species, organized

2

Entering edit mode

9.9 years ago

Lee Katz ★ 3.2k

Hi, I want to download all SRA entries from NCBI, pertaining to one species. I also want to organize it by biosample. For example, I hope to have a directory like this:

Vibrio cholerae
├──biosample1
   ├──file1.sra
   ├──file2.sra
├──biosample2
   ├──file1.sra
...

What do you think? I have already starting thinking of this method with the NCBI edirect CLI utilities but it doesn't quite work. My idea was 1) to find all biosamples labeled with the correct species; 2) find all links to SRA; 3) get the SRA identifiers and SAMN identifiers

esearch -db biosample -query '"Vibrio cholerae"[organism]' |\
 elink -target sra -db sra |\
 ...

Biosample edirect NCBI SRA • 5.6k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Lee Katz ★ 3.2k

0

Entering edit mode

I think that this is a good initial list which I am going to start editing by hand. And then the next step--getting the reads and organizing them properly?

esearch -db biosample -query '"Vibrio cholerae"[organism]' | efetch -format xml | xtract -pattern Ids -element Id > Vc.biosample.tsv

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Lee Katz ★ 3.2k

Ram · Answer 1 · 2015-02-07

2

Entering edit mode

9.8 years ago

Sean Davis 27k

The Bioconductor SRAdb package is built around a SQLite database of SRA metadata. It can be fairly useful for tasks that involve mining SRA for specific data.

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.8 years ago by Sean Davis 27k

Ram · Answer 2 · 2014-12-21

A while ago I have tried something similar. The goal was to determine the multi locus sequence type directly from the sequencing reads for all the datasets available from the short read archive for Staphylococcus aureus. By now, the big sequencing centers (namely Sanger) have deposited more than 10000 such datasets in the short read archive. I expect that most of them will never be assembled and put into the WGS database.

The determination of the sequence type worked reasonably well for my test datasets. But when I applied it to larger number of datasets I run into all kinds of odd behaviour:

I soon run out of disk space
mis-assigned species (not S.aureus but another staphylococcal species)
ill defined datasets

For example I found FASTQ files with all reads of length 200 where the first 100 nucleotides represent the first read of a paired end run, and the next 100 nucleotides represent the second read. Or I found FASTQ files from paired end runs where some of the sequences of the second read where reverse-complemented but others were not. I also found FASTQ files where sequences were reverse-complemented but not the quality values. Thus I came to the conclusion that I have to inspect each downloaded FASTQ file individually before applying the automated MLST search, and I never started any automated download of a large number of SRR datasets.

You will need a substancial ammount of disk space for your project, my guess is about 10 TB. There are currently about 2800 SRR datasets available for Vibrio cholerae. In average a dataset may be 700 Mb in size. If you convert SRA to FASTQ, you will need another 700 Mb. If you map the reads to a reference genome, the resulting BAM file will also be at least 700 Mb. This is because you copy the data all the time in typical work flows involving FASTQ and BAM files. Thus you will need about 2.5 Gb disk space for every SRA dataset you process.

With my (maybe limited) resources the download of a single dataset from the european short read archive took about 5 minutes. This means that downloading 3000 datasets will take about 10 days to finish. All this is feasible but it is definitely not a project for a rainy sunday afternoon.

Ram · Answer 3 · 2014-12-22

It's ridiculous (see: piet's answer in the form of difficulties) but this was my solution. I am sure it is imperfect.

# create the xml
esearch -db biosample -query '"Vibrio cholerae O1"[organism]' | efetch -format xml > Vc.biosample.xml

# turn the xml into some nice spreadsheet for biosample
cat Vc.biosample.xml | xtract -pattern BioSample -element Id -block Attributes -absent "Sample name" -present strain -element Attribute > Vc.biosample.tsv

# figure out a spreadsheet with name/assembly/sra
cat Vc.biosample.tsv | perl -F'\t' -lane 'chomp; chomp(@F); my($SAMN,$name); for(@F){if(/^SAM/ && !$SAMN){$SAMN=$_;} elsif(!$name && !/^SAM/ && !/^ERS/ && !/^AE\d+/ && !/^A[A-Z]\w+/ && !/^SRS/ && !/^GSM/ && !/^CP/ && !/^\d+$/){$name=$_; $origName=$name; $name=~s/^\s+|\s+$//g; $name=~s/(\s|\/|\.)+/_/g;} } $origName=~s/.*(strain|str\.)\s*//i; $cmd="esearch -db biosample -query \"${origName}[all] Vibrio cholerae O1[organism]\""; $asm=`$cmd | elink -target assembly |efetch -format docsum| xtract -pattern DocumentSummarySet -element DocumentSummary/Id`; $sraxml=`$cmd | elink -target sra |efetch -format docsum| xtract -pattern DocumentSummary -element Runs`; $asm=~s/^\s+|\s+$//g; my @SRR; while($sraxml=~/([ES]RR\d+)/g){push(@SRR,$1);} print STDERR $cmd; print join("\t",$origName,$asm,join(",",@SRR));'  | tee Vc.asmSra.tsv

From there I will use other tools like fastq-dump to download the data into the appropriate directories.

Ram · Answer 4 · 2015-02-06

I have tinkered around with the web interface of NCBI and found that it is possible to download the results of a search in the SRA database as a nice and easy to parse CSV file. This option is called 'RunInfo' (select 'Send to File' / File / Format / 'RunInfo' in your webbrowser). This CSV file even contains the links for downloading the SRA files.

Moreover, data in 'runinfo' CSV format can be downloaded also programmatically. Here is a short python program to accomplish this task:

#!/usr/bin/python2.7

import json, urllib

query = '"Vibrio cholerae O1"[Organism]"'

params = {'db':'sra', 'term':query, 'usehistory':'y', 'retmode':'json',}    
url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'
stream = urllib.urlopen(url, urllib.urlencode(params))
answer = json.loads(stream.read())
webenv = answer["esearchresult"]["webenv"]
query_key = answer["esearchresult"]["querykey"]

params = {'db':'sra', 'save':'efetch', 'rettype':'runinfo', 'WebEnv':webenv, 'query_key':query_key,}
url = 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi'
outfile = 'runinfo.csv'
urllib.urlretrieve(url, data=urllib.urlencode(params), filename=outfile)