Question

How to pull metadata from SRA based on BioSample ID

0

Entering edit mode

4.4 years ago

millere • 0

I have a list of over 8,000 BioSample IDs in a text file and I want to use this list to pull specific information on associated sequencing runs from the SRA database. The code I'm using does work, but it's very, very slow. Is there a faster way to do this using entrez-direct? Am I just missing something very obvious? Any help would be very much appreciated!

# an example biosample list
echo "SAMN14390563
SAMN14390566
SAMN14390576
SAMN14390578
SAMN14453547
SAMN14453553" > biosamples.txt

# pull SRA info based on example biosample list
cat biosamples.txt | xargs -n 1 sh -c 'esearch -db sra -query "$0 [BSPL]" | \
join-into-groups-of 20 | \
efetch -db sra -format runinfo -mode xml | \
xtract -pattern Row -def "NA" -element Run spots bases spots_with_mates avgLength \
size_MB download_path Experiment LibraryStrategy LibrarySelection LibrarySource \
LibraryLayout InsertSize InsertDev Platform Model SRAStudy BioProject ProjectID \
Sample BioSample SampleType TaxID ScientificName SampleName CenterName \
Submission Consent > metadata_sra.txt'

ncbi entrez-direct • 3.3k views

ADD COMMENT • link updated 4.4 years ago by vkkodali_ncbi ★ 3.8k • written 4.4 years ago by millere • 0

score 4 · Accepted Answer · 2020-07-11

You may want to try epost as follows:

$ cat samples.txt 
SAMN14390563
SAMN14390566
SAMN14390576
SAMN14390578
SAMN14453547
SAMN14453553
$ epost -db biosample -input samples.txt -format acc | \
elink -target sra | \
efetch -db sra -format runinfo -mode xml | \
xtract -pattern Row -def "NA" -element Run spots bases spots_with_mates avgLength \
size_MB download_path Experiment LibraryStrategy LibrarySelection LibrarySource \
LibraryLayout InsertSize InsertDev Platform Model SRAStudy BioProject ProjectID \
Sample BioSample SampleType TaxID ScientificName SampleName CenterName \
Submission Consent > metadata_sra.txt

This skips the esearch step run once for every single accession. Note, epost has some limits on the number of accessions you can provide at a time. To circumvent this, Entrez Direct comes with a built in function called join-into-groups-of that can be used here. You can read more about it here -- look for it under the group "Processing in Groups".