Hello, I am trying to download fastq files (both single cell and bulk rna-seq) virtually. I am first creating a screen and then using SLURM to start a job. Next I use SRA explorer's bash script (which I save as a bash script) and run it to try to get the fastq files. It seems that some files are corrupt. I can see this from downstream processing with other tools or sometimes, simply by checking file size. Each time a rerun the same bash script from SRA Explorer a DIFFERENT file is corrupted. When I individually run the curl command (from the bash script) for a corrupted fastq file, the file is downloaded properly
What is going on? And what steps should I take to prevent this from happening?
Thank you
EDIT: WALLTIME used was >200 hours which is sufficient for the job, its not the samples at the end were corrupt or incomplete its like some random random sample
Seconding this. Downside of tools like curl and wget is that they directly write to the final output file. That means, if output is
foo.fq.gz
then they write to this file and during download file size just gets bigger until finished. Hence, if download fails prematurely, then the file is present but incomplete. Without md5sum hard to diagnose ad hoc. So, what you can do is:1) Check your slurm logs. There should be a log output file per job, and if you got timed-out then there will be a message for that, so you know which job went incomplete.
2) Instead of curl or wget, use the Aspera download links from sra-explorer.info. This only creates the visible final output file if download finished successfully. See for setup: Setting up Aspera Connect (ascp) on Linux and macOS
3) Not sure I should recommend this, but I usually do downloads (if they filish within minutes, not downloads that take hours) via the head node. It doesn't consume notable memory or CPUs and I don't get automatically killed on our cluster when doing it. It's probably bad advise because head node should be taboo- but as long as it works...so what... But anyway, better try the other stuff first.
I think, running pure IO-jobs on login nodes is ok on most clusters and there isn't any advantage in moving the download to a compute node. On some, they explicitly state that the login nodes are for compiling, moving data etc. So I think its ok to use nohup to download raw data.
GNU screen over nohup all day :)
I guess all according to taste, running simple nohup is more "conservative". It wouldn't be the first time I canceled a long-running process by accidentally pressing Ctrl-C to "get out" of the "log viewer", or forgot to turn on logging; but then again there are always pros and cons https://stackoverflow.com/questions/20766300/nohup-vs-screen-which-is-better-for-long-running-process
The only reason where one must use screen in my understanding is if you have to interact with the process later on, like entering passwords and so on.
WALLTIME used was >200 hours which is sufficient for the job, its not the samples at the end were corrupt or incomplete its like some random random sample.