I have a pipeline script called pipeline.sh
. I usually execute this for a single sample like so:
$ pipeline.sh sample1 sample1.R1.fastq.gz sample1.R2.fastq.gz
Where $1 is sample ID, $2 and $3 are the read files.
I use GNU parallel with a parameters file that specifies the paths to each file.
$ nohup parallel -j 4 -a params.pipeline.txt --colsep '\s+\ ./pipeline.sh
I want to automate my pipeline so that I don't need a parameters file and it looks through the folders for the reads (same file structure as they come out of the sequencer).
I have:
parallel -j 4 ./pipeline.sh {/1.} ::: *.R1.fastq.gz :::+ *.R2.fastqgz
However this assumes the fastqs are in the same folder, how can I change my parallel so that it searches through the folder structure for the files.
This sounds like something Nextflow would be able to do fairly easily. It's a fairly steep learning curve, but worthwhile. The documentation is also very thorough. I've implemented read mapping, SNP calling, SV calling, and methylation calling pipelines that work with only needing to parameterise input path, run ID, and reference genome path for most and Nextflow handles the rest.
what's your folder structure?