I've been using wonderdump.sh from the Biostars handbook for some time. I'm now curious about the part that builds the ftp site url:
PATH1=${SRR:0:6}
PATH2=${SRR:0:10}
URL="ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/${PATH1}/${PATH2}/${SRR}.sra"
I've seen SRR ids be either of length 10 or length 9, so PATH2 is effectively the full SRR id in both these conditions.
Examples:
ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR000/SRR000001/SRR000001.sra
ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR213/SRR2138040/SRR2138040.sra
I'm curious to know why this is done explicitly like this. Is it established that the "PATH2" portion of the ftp URLs will always be at most 10 characters long, in anticipation of length-11 SRR ids? As in, if an SRR id longer than 10 characters ever comes into use, then the PATH2 part should be the SRR id truncated at 10 characters? If that's the case, could someone point me to a reference where this convention is described?
If that's not the case or part of any known specification, then wouldn't an 11-character long identifier break wonderdump.sh?
Much appreciated!
What NCBI may or may not do in future is speculative. But as of now there are finite directories at the
PATH1
level and those includeSRR000 to SRR999
(some of the directories are still empty so there is room for growth). If NCBI does start using longer ID's inPATH2
, it would be a simple change to account for that.As I recall
wonderdump.sh
was specifically written to allow SRA downloads to work on linux subsystem on Windows 10.In general, EBI-ENA should be your first stop to download fastq format sequence data. This avoids having to deal with SRA and its related inconveniences.
Thanks for the reply!
I agree that what NCBI choses to do is speculative, however, wonderdump.sh explicitly truncates the SRR portion at 10 characters to create PATH2, when currently there's no difference between doing this and returning the full SRR. I'm assuming there's a good reason for this possibly based on some convention, or else the script would/could just use the SRR.
You're right that the script used to mention something about being a workaround for Windows Bash, but it's been updated to:
Wonderdump is a workaround to download SRA files directly when fastq-dump's internet connection does not work. Which can happen surprisingly frequently.
Which is true at least in my experience and the reason why we're getting.sra
files with this method.I agree EBI-ENA is more straightforward, however my requirement is to download datasets from SRA, so it's not up to me at this point. I'm maintaining a pipeline that uses wonderdump.sh, so I would like to future proof this as much as possible.
I don't think there is a published convention. Here are some examples of 9 character SRR ID's. Try to see what happens with these. Istvan may have used examples of 10 character SRR ID's in the handbook and thus in
wonderdump.sh
.