Question

Help with parallelization and loop cutadapt

0

Entering edit mode

4.4 years ago

russell.stewart.j ▴ 30

Hi all!

I'm painfully inexperienced when it comes to coding. I know it's possible to do use cutadapt for trimming without separate lines of code but I'm not sure how. I have 24 paired end samples all with variations on the following names:

A1_S12_R1_001.fastq
A1_S12_R2_001.fastq
A3_S13_R1_001.fastq
A3_S13_R2_001.fastq
B1_S14_R1_001.fastq
B1_S14_R2_001.fastq
B3_S15_R1_001.fastq
B3_S15_R2_001.fastq
...

So I've got separate cutadapt lines to trim each:

cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o A1_S12_R1_001_trimmed.fastq -p A1_S12_R2_001_trimmed.fastq A1_S12_R1_001.fastq A1_S12_R2_001.fastq > A1_S12_cutadapt.txt

cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o A3_S13_R1_001_trimmed.fastq -p A3_S13_R2_001_trimmed.fastq A3_S13_R1_001.fastq A3_S13_R2_001.fastq > A3_S13_cutadapt.txt

I know there is a way to list my fastqs and drop the root of the file name into a loop command, something like this:

for i in $(ls *fastq | sed 's/_R[12]_001.fastq//' | sort -u); do cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ${i}_R1_001_trimmed.fastq -p ${i}_R2_001_trimmed.fastq ${i}_R1_001.fastq ${i}_R2_001.fastq > ${i}_cutadapt.txt

Actually, I'd ideally run it using GNU Parallel but I know the syntax is slightly different. In fact, I've used something like this for non-paired end samples before, but don't know how to adapt it for paired end reads:

ls | time parallel -j+0 --eta 'fastx_clipper -a TGGAATTCTCGGG -c -v -i {} -o ../processing/{.}.clip'

Any suggestions or further reading would be appreciated. I'd love to understand these variables better.

bash cutadapt loop • 2.1k views

ADD COMMENT • link updated 4.4 years ago by Dave Carlson ★ 2.1k • written 4.4 years ago by russell.stewart.j ▴ 30

score 2 · Answer 1 · 2020-12-21

2

Entering edit mode

4.4 years ago

Dave Carlson ★ 2.1k

Hi Russel, I believe something like the following will work for your cutadapt command:

parallel --verbose --link 'cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o {1.}.trimmed.fastq -p {2.}.trimmed.fastq {1} {2}  > {1.}_cutadapt.txt' ::: *R1_001.fastq ::: *R2_001.fastq

In the case of the filenames you provided, you should see the following output:

A1_S12_R1_001.trimmed.fastq
A1_S12_R2_001.trimmed.fastq
A1_S12_R1_001_cutadapt.txt

The above command assumes that the fastq files to be trimmed are in your current working directory. The --link flag will make sure that each R1 and R2 stay together. The use of {1.} and {2.} will take the input file names and keep the basenames while removing the file extension string, allowing you to add the "trimmed" part to the name.

The --verbose flag will print out each command that is run. You could also try replacing this with --dry-run to make sure that each command looks appropriate.

ADD COMMENT • link 4.4 years ago by Dave Carlson ★ 2.1k

1

Entering edit mode

This worked perfectly! Thanks Dave!

ADD REPLY • link 4.4 years ago by russell.stewart.j ▴ 30

0

Entering edit mode

This script may give problems if number of files match, but with different names. Simple example below:

input files:

$ ls *.fastq
test2_R2_001.fastq  test_R1_001.fastq

with OP code:

$ parallel --dry-run --link 'cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o {1.}.trimmed.fastq -p {2.}.trimmed.fastq {1} {2}  > {1.}_cutadapt.txt' ::: *R1_001.fastq ::: *R2_001.fastq

cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o test_R1_001.trimmed.fastq -p test2_R2_001.trimmed.fastq test_R1_001.fastq test2_R2_001.fastq  > test_R1_001_cutadapt.txt

test2_R2_001.fastq is not a match for test_R1_001.fastq, yet picked up by the function. Function would run, but it is incorrect.

Following function is safer IMHO. It would only look for matching R1 and R2.

$ parallel --dry-run --link 'cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o {.}.trimmed.fastq -p {=s/R1/R2/;s/\.fastq//=}.trimmed.fastq {} {=s/R1/R2/=}  > {.}_cutadapt.txt' ::: *R1_001.fastq

cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o test_R1_001.trimmed.fastq -p test_R2_001.trimmed.fastq test_R1_001.fastq test_R2_001.fastq  > test_R1_001_cutadapt.txt

Now test2 sample is not picked up and function would fail as there is no test_R2_001.fastq.

ADD REPLY • link 4.4 years ago by cpad0112 21k