Question

wonderdump.sh and FTP site URL conventions for SRR identifiers

0

Entering edit mode

7.3 years ago

mbelmadani ★ 1.4k

I've been using wonderdump.sh from the Biostars handbook for some time. I'm now curious about the part that builds the ftp site url:

PATH1=${SRR:0:6}
PATH2=${SRR:0:10}
URL="ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/${PATH1}/${PATH2}/${SRR}.sra"

I've seen SRR ids be either of length 10 or length 9, so PATH2 is effectively the full SRR id in both these conditions.

Examples:
ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR000/SRR000001/SRR000001.sra
ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR213/SRR2138040/SRR2138040.sra

I'm curious to know why this is done explicitly like this. Is it established that the "PATH2" portion of the ftp URLs will always be at most 10 characters long, in anticipation of length-11 SRR ids? As in, if an SRR id longer than 10 characters ever comes into use, then the PATH2 part should be the SRR id truncated at 10 characters? If that's the case, could someone point me to a reference where this convention is described?

If that's not the case or part of any known specification, then wouldn't an 11-character long identifier break wonderdump.sh?

Much appreciated!

ftp SRA wonderdump ncbi convention • 2.1k views

ADD COMMENT • link 7.3 years ago by mbelmadani ★ 1.4k

0

Entering edit mode

What NCBI may or may not do in future is speculative. But as of now there are finite directories at the PATH1 level and those include SRR000 to SRR999 (some of the directories are still empty so there is room for growth). If NCBI does start using longer ID's in PATH2, it would be a simple change to account for that.

As I recall wonderdump.sh was specifically written to allow SRA downloads to work on linux subsystem on Windows 10.

In general, EBI-ENA should be your first stop to download fastq format sequence data. This avoids having to deal with SRA and its related inconveniences.

ADD REPLY • link 7.3 years ago by GenoMax 153k

0

Entering edit mode

Thanks for the reply!

I agree that what NCBI choses to do is speculative, however, wonderdump.sh explicitly truncates the SRR portion at 10 characters to create PATH2, when currently there's no difference between doing this and returning the full SRR. I'm assuming there's a good reason for this possibly based on some convention, or else the script would/could just use the SRR.

You're right that the script used to mention something about being a workaround for Windows Bash, but it's been updated to:
Wonderdump is a workaround to download SRA files directly when fastq-dump's internet connection does not work. Which can happen surprisingly frequently. Which is true at least in my experience and the reason why we're getting .sra files with this method.

I agree EBI-ENA is more straightforward, however my requirement is to download datasets from SRA, so it's not up to me at this point. I'm maintaining a pipeline that uses wonderdump.sh, so I would like to future proof this as much as possible.

ADD REPLY • link 7.3 years ago by mbelmadani ★ 1.4k

1

Entering edit mode

I don't think there is a published convention. Here are some examples of 9 character SRR ID's. Try to see what happens with these. Istvan may have used examples of 10 character SRR ID's in the handbook and thus in wonderdump.sh.

ADD REPLY • link 7.3 years ago by GenoMax 153k

score 1 · Accepted Answer · 2018-06-13

I got a reply from an SRA curator:

The PATH2 is intended to be the full SRA Run accession, and is not restricted to a character limit.

So it appears that doing PATH2=${SRR:0:10} is not necessary or requested by any convention, and would probably indeed break if 11-character long identifiers ever come into use. The rest of the e-mail pasted below also suggests that the FTP site may not be available in the future:

However, as the SRA database grows to a very large, this avenue for getting SRA Run files becomes more difficult to maintain. The SRA will provide support for the ByRun and ByStudy FTP paths to accessions for now, but our systems group predicts that it may not be able to support it at some point in the future and suggests using the SRA toolkit to access Runs (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc).

For my purposes, I'll update wonderdump.sh to make sure it doesn't silently truncate PATH2 at 10 characters, or at least explicitly raise an error if it runs into an identifier longer than 10 characters.