I have a few thousand FastQ files from a single cell RNA-seq experiment that I want to align and quantify with Salmon. (Paired-end, Illumina Smart-Seq2, Human)
Because the file names vary, I thought using GNU parallel to feed the files to salmon would be simpler than creating a for loop script (I have limited experience with bash and regex).
This was my code (slurm stuff excluded):
module load parallel
module load salmon
parallel salmon quant -i /work/InternalMedicine/s184335/genome_folder/alias/hg38/salmon_sa_index/default/. -l A --validateMappings --gcBias --seqBias --threads 48 -o MDS_salmon_pseudoalignmentandquant -1 {} -2 {=s/_R1_001_val_1/_R2_001_val_2/=} ::: *_R1_001_val_1.fq.gz
Now, when I take a look at the output file, for most of the samples, I get something like this:
where the program reloads salmon before it finishes analyzing the previous sample, and moves on. In some cases, I will also get an error message saying salmon quant was invoked incorrectly.
Strangely though, towards the end of the output file, it shows several samples that were successfully mapped and quantified by salmon (I think). However, if I go to the results folder, there is only one quant.sf file for one sample.
One last thing of note is that, when I submit the batch script (I'm using an HPC) the job gets interrupted every so often because of a node fail.
I have tried running salmon quant with only one sample and that seems to work okay.
Could GNU parallel somehow be causing these problems? Perhaps my script is problematic and makes Salmon prematurely move on to the next sample, without mapping/writing the results?
I'm quite lost as I can't see an obvious error message and I'm rather new to bioinformatics. Would appreciate any help.
Thanks for the answer. I'll try to figure out how to set up a new folder for each pair of files processed using
parallel
.But I'm wondering how this will also solve the problem of salmon moving on to the next pair of files before mapping and quantifying the previous pair.
As noted and pictured above:
Specifying a variable output folder will probably solve the issue with only one quant.sf file but I'm not sure how it affects salmon not mapping and quantifying most of the samples. It doesn't even seem to read the pair of files, stopping after it loads the reference transcriptome.
Also, the files are not biological replicates. They are just paired reads from each single cell, so I'm passing them to salmon two by two, with each sample being a single cell.
My hunch is that as soon as a new job starts it immediately overwrites the folder since you are using the same name for all jobs. Prior result data is thus lost and you end up with quant results of the last sample processed at end. When this is not the case e.g. Run1_out is folder for Run1 , Run2_out is folder for Run2 there would be no contention with directory names and writes happening to those directories.
salmon
is fast enough that you could simply submit independent SLURM jobs and not use parallel at all. Something like thisThis will get you a separate output directory (with name of the sample) for every sample.
Thanks mate, it looks like that was indeed the problem.
Also, thank you for the example code! I made it a single job since I didn't want to run thousands of individual jobs.
On a separate note, I really need to properly learn bash and reg ex.
Below is the final job script that I ran:
I will try to do the same with
parallel
and update for posterity.