I have a list of over 8,000 BioSample IDs in a text file and I want to use this list to pull specific information on associated sequencing runs from the SRA database. The code I'm using does work, but it's very, very slow. Is there a faster way to do this using entrez-direct? Am I just missing something very obvious? Any help would be very much appreciated!
# an example biosample list
echo "SAMN14390563
SAMN14390566
SAMN14390576
SAMN14390578
SAMN14453547
SAMN14453553" > biosamples.txt
# pull SRA info based on example biosample list
cat biosamples.txt | xargs -n 1 sh -c 'esearch -db sra -query "$0 [BSPL]" | \
join-into-groups-of 20 | \
efetch -db sra -format runinfo -mode xml | \
xtract -pattern Row -def "NA" -element Run spots bases spots_with_mates avgLength \
size_MB download_path Experiment LibraryStrategy LibrarySelection LibrarySource \
LibraryLayout InsertSize InsertDev Platform Model SRAStudy BioProject ProjectID \
Sample BioSample SampleType TaxID ScientificName SampleName CenterName \
Submission Consent > metadata_sra.txt'
This is SO much faster! Thank you! Making use of the
join-into-groups-of
function, I ended up with the following command:Whereas my original command was taking 4+ hours to run, this does the same thing in just a few minutes!