Download all SRA of one species, organized
4
2
Entering edit mode
9.9 years ago
Lee Katz ★ 3.2k

Hi, I want to download all SRA entries from NCBI, pertaining to one species. I also want to organize it by biosample. For example, I hope to have a directory like this:

Vibrio cholerae
├──biosample1
   ├──file1.sra
   ├──file2.sra
├──biosample2
   ├──file1.sra
...

What do you think? I have already starting thinking of this method with the NCBI edirect CLI utilities but it doesn't quite work. My idea was 1) to find all biosamples labeled with the correct species; 2) find all links to SRA; 3) get the SRA identifiers and SAMN identifiers

esearch -db biosample -query '"Vibrio cholerae"[organism]' |\
 elink -target sra -db sra |\
 ...
Biosample edirect NCBI SRA • 5.6k views
ADD COMMENT
0
Entering edit mode

I think that this is a good initial list which I am going to start editing by hand. And then the next step--getting the reads and organizing them properly?

esearch -db biosample -query '"Vibrio cholerae"[organism]' | efetch -format xml | xtract -pattern Ids -element Id > Vc.biosample.tsv
ADD REPLY
2
Entering edit mode
9.8 years ago

The Bioconductor SRAdb package is built around a SQLite database of SRA metadata. It can be fairly useful for tasks that involve mining SRA for specific data.

ADD COMMENT
1
Entering edit mode
9.9 years ago
piet ★ 1.9k

A while ago I have tried something similar. The goal was to determine the multi locus sequence type directly from the sequencing reads for all the datasets available from the short read archive for Staphylococcus aureus. By now, the big sequencing centers (namely Sanger) have deposited more than 10000 such datasets in the short read archive. I expect that most of them will never be assembled and put into the WGS database.

The determination of the sequence type worked reasonably well for my test datasets. But when I applied it to larger number of datasets I run into all kinds of odd behaviour:

  1. I soon run out of disk space
  2. mis-assigned species (not S.aureus but another staphylococcal species)
  3. ill defined datasets

For example I found FASTQ files with all reads of length 200 where the first 100 nucleotides represent the first read of a paired end run, and the next 100 nucleotides represent the second read. Or I found FASTQ files from paired end runs where some of the sequences of the second read where reverse-complemented but others were not. I also found FASTQ files where sequences were reverse-complemented but not the quality values. Thus I came to the conclusion that I have to inspect each downloaded FASTQ file individually before applying the automated MLST search, and I never started any automated download of a large number of SRR datasets.

You will need a substancial ammount of disk space for your project, my guess is about 10 TB. There are currently about 2800 SRR datasets available for Vibrio cholerae. In average a dataset may be 700 Mb in size. If you convert SRA to FASTQ, you will need another 700 Mb. If you map the reads to a reference genome, the resulting BAM file will also be at least 700 Mb. This is because you copy the data all the time in typical work flows involving FASTQ and BAM files. Thus you will need about 2.5 Gb disk space for every SRA dataset you process.

With my (maybe limited) resources the download of a single dataset from the european short read archive took about 5 minutes. This means that downloading 3000 datasets will take about 10 days to finish. All this is feasible but it is definitely not a project for a rainy sunday afternoon.

ADD COMMENT
0
Entering edit mode

I agree this will take a ton of disk space. Fortunately I am not limited right now on disk space, and I am actually going to filter out a good portion of the samples. However I am still at a loss on how to do it still.

ADD REPLY
0
Entering edit mode
9.9 years ago
Lee Katz ★ 3.2k

It's ridiculous (see: piet's answer in the form of difficulties) but this was my solution. I am sure it is imperfect.

# create the xml
esearch -db biosample -query '"Vibrio cholerae O1"[organism]' | efetch -format xml > Vc.biosample.xml

# turn the xml into some nice spreadsheet for biosample
cat Vc.biosample.xml | xtract -pattern BioSample -element Id -block Attributes -absent "Sample name" -present strain -element Attribute > Vc.biosample.tsv

# figure out a spreadsheet with name/assembly/sra
cat Vc.biosample.tsv | perl -F'\t' -lane 'chomp; chomp(@F); my($SAMN,$name); for(@F){if(/^SAM/ && !$SAMN){$SAMN=$_;} elsif(!$name && !/^SAM/ && !/^ERS/ && !/^AE\d+/ && !/^A[A-Z]\w+/ && !/^SRS/ && !/^GSM/ && !/^CP/ && !/^\d+$/){$name=$_; $origName=$name; $name=~s/^\s+|\s+$//g; $name=~s/(\s|\/|\.)+/_/g;} } $origName=~s/.*(strain|str\.)\s*//i; $cmd="esearch -db biosample -query \"${origName}[all] Vibrio cholerae O1[organism]\""; $asm=`$cmd | elink -target assembly |efetch -format docsum| xtract -pattern DocumentSummarySet -element DocumentSummary/Id`; $sraxml=`$cmd | elink -target sra |efetch -format docsum| xtract -pattern DocumentSummary -element Runs`; $asm=~s/^\s+|\s+$//g; my @SRR; while($sraxml=~/([ES]RR\d+)/g){push(@SRR,$1);} print STDERR $cmd; print join("\t",$origName,$asm,join(",",@SRR));'  | tee Vc.asmSra.tsv

From there I will use other tools like fastq-dump to download the data into the appropriate directories.

ADD COMMENT
0
Entering edit mode

Your are querying Entrez for "Vibrio cholerae O1" which means serogroup O1 only. Submitters are allowed to assign arbitrary taxonid to their sequences or biosamples. In my experience "species" is usually assigned correctly, but subspecies taxons are used rather inconsistently.

Here are just two examples of serogroup O1 isolates which are direct subtaxons of "Vibrio cholerae" instead of being subtaxons of "Vibrio cholerae O1" and which will be missed by your query:

  1. biosample SAMN02744003: ATCC1403 (Ogawa serovar O:1)
  2. biosample SAMN02603897: Vibrio cholerae M66-2 (/serotype="O1", see CP001233)
ADD REPLY
0
Entering edit mode
9.8 years ago
piet ★ 1.9k

I have tinkered around with the web interface of NCBI and found that it is possible to download the results of a search in the SRA database as a nice and easy to parse CSV file. This option is called 'RunInfo' (select 'Send to File' / File / Format / 'RunInfo' in your webbrowser). This CSV file even contains the links for downloading the SRA files.

Moreover, data in 'runinfo' CSV format can be downloaded also programmatically. Here is a short python program to accomplish this task:

#!/usr/bin/python2.7

import json, urllib

query = '"Vibrio cholerae O1"[Organism]"'

params = {'db':'sra', 'term':query, 'usehistory':'y', 'retmode':'json',}    
url = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'
stream = urllib.urlopen(url, urllib.urlencode(params))
answer = json.loads(stream.read())
webenv = answer["esearchresult"]["webenv"]
query_key = answer["esearchresult"]["querykey"]

params = {'db':'sra', 'save':'efetch', 'rettype':'runinfo', 'WebEnv':webenv, 'query_key':query_key,}
url = 'http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi'
outfile = 'runinfo.csv'
urllib.urlretrieve(url, data=urllib.urlencode(params), filename=outfile)
ADD COMMENT

Login before adding your answer.

Traffic: 2480 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6