Automating pipeline with Parallel, read files in separate folders
2
1
Entering edit mode
2.5 years ago
SaltedPork ▴ 170

I have a pipeline script called pipeline.sh . I usually execute this for a single sample like so:

$ pipeline.sh sample1 sample1.R1.fastq.gz sample1.R2.fastq.gz

Where $1 is sample ID, $2 and $3 are the read files.

I use GNU parallel with a parameters file that specifies the paths to each file.

$ nohup parallel -j 4 -a params.pipeline.txt --colsep '\s+\ ./pipeline.sh

I want to automate my pipeline so that I don't need a parameters file and it looks through the folders for the reads (same file structure as they come out of the sequencer).

I have:

parallel -j 4 ./pipeline.sh {/1.} ::: *.R1.fastq.gz :::+ *.R2.fastqgz  

However this assumes the fastqs are in the same folder, how can I change my parallel so that it searches through the folder structure for the files.

fastq automation bash parallel • 1.2k views
ADD COMMENT
0
Entering edit mode

This sounds like something Nextflow would be able to do fairly easily. It's a fairly steep learning curve, but worthwhile. The documentation is also very thorough. I've implemented read mapping, SNP calling, SV calling, and methylation calling pipelines that work with only needing to parameterise input path, run ID, and reference genome path for most and Nextflow handles the rest.

ADD REPLY
0
Entering edit mode

what's your folder structure?

ADD REPLY
4
Entering edit mode
2.5 years ago

using nextflow (not tested, but it should look like this):

nextflow.enable.dsl = 1
params.directories=""

process scanDirectories {
output:
    path("paths.txt") into paths
script:
"""
find ${params.directories}  -type f -name "*.R1.fq.gz" \
    awk -F '/' '{S=\$NF;gsub("\\.R1\\.fq\\.gz\$","",S);F2=\$0;gsub("\\.R1\\.fq\\.gz\$",".R2.fq.gz",F2);printf("%s,%s,%s\\n",S,\$0,F2);}' > paths.txt

"""
}


paths.splitCsv(header: false,sep:',',strip:true).set{pipe_in}

process runPipeline {
tag "${sample}"
input:
    tuple val(sample),val(R1),val(R2) from pipe_in
output:
    path("result.txt") into result_ch
script:
"""
echo "DO Something ${sample} ${R1} ${R2}" > result.txt
"""
}

and then something like

nextflow run -resume script.nf --directories "/path/to/dir1 /path/to/dir2"
ADD COMMENT
2
Entering edit mode
2.5 years ago
ole.tange ★ 4.5k

Let us assume the files are called:

a/b/c/d/sample1.R1.fastq.gz
a/b/c/d/sample1.R2.fastq.gz
a/b/e/f/sample2.R1.fastq.gz
a/b/e/f/sample2.R2.fastq.gz

Then you may try:

parallel --plus -j 4 ./pipeline.sh {/...} {} {/R1/R2} ::: */*/*/*/*.R1.fastq.gz
ADD COMMENT

Login before adding your answer.

Traffic: 2392 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6