Connect SRA Bio Sample to Run
1
0
Entering edit mode
3.0 years ago
Ivan ▴ 60

I have a list of Sequence Read Archive accession numbers for raw data reads, and this list looks like this :

SAMN03421314  
SAMN03421315 
SAMN03421316 
SAMN03421317 
SAMN03421318 
SAMN03421319 
SAMN03421320 
SAMN03421321 
...

This list is stored in a text file. I downloaded every SRA file using sratoolkit's prefetch command. What I got is a list of folders, each containing .SRA file, but all those folders are named not by their Biosample (e.g. SAMN03421321 ), but by their Run (SRR1927228). What I want to do is connect each Biosample to the Run (eg. SAMN03421321 : SRR1927228), and not to do that manually, as I have a bunch of folders.

Is there a fast tool to do just that - not re-download genomes, but just look up those two IDs?

SRA sratoolkit • 801 views
ADD COMMENT
2
Entering edit mode
3.0 years ago
vkkodali_ncbi ★ 3.8k

You can use Entrez Direct for this as follows:

$ cat samples.txt 
SAMN03421314  
SAMN03421315 
SAMN03421316 
SAMN03421317 
SAMN03421318 
SAMN03421319 
SAMN03421320 
SAMN03421321 
$ epost -db biosample -input samples.txt | elink -target sra | efetch -format runinfo > runinfo.csv

The output csv file has 47 fields where field 1 is the SRA run accession and field 26 is the BioSample accession. You can parse the CSV using awk :

$ awk 'BEGIN{FS=",";OFS="\t"}{print $1,$26}' runinfo.csv
Run         BioSample
SRR1927184  SAMN03421314
SRR1927214  SAMN03421316
SRR1927218  SAMN03421318
SRR1927224  SAMN03421320
SRR1927212  SAMN03421315
SRR1927215  SAMN03421317
SRR1927221  SAMN03421319
SRR1927228  SAMN03421321
ADD COMMENT

Login before adding your answer.

Traffic: 1766 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6