Help with parallelization and loop cutadapt
1
0
Entering edit mode
4.1 years ago

Hi all!

I'm painfully inexperienced when it comes to coding. I know it's possible to do use cutadapt for trimming without separate lines of code but I'm not sure how. I have 24 paired end samples all with variations on the following names:

A1_S12_R1_001.fastq
A1_S12_R2_001.fastq
A3_S13_R1_001.fastq
A3_S13_R2_001.fastq
B1_S14_R1_001.fastq
B1_S14_R2_001.fastq
B3_S15_R1_001.fastq
B3_S15_R2_001.fastq
...

So I've got separate cutadapt lines to trim each:

cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o A1_S12_R1_001_trimmed.fastq -p A1_S12_R2_001_trimmed.fastq A1_S12_R1_001.fastq A1_S12_R2_001.fastq > A1_S12_cutadapt.txt

cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o A3_S13_R1_001_trimmed.fastq -p A3_S13_R2_001_trimmed.fastq A3_S13_R1_001.fastq A3_S13_R2_001.fastq > A3_S13_cutadapt.txt

I know there is a way to list my fastqs and drop the root of the file name into a loop command, something like this:

for i in $(ls *fastq | sed 's/_R[12]_001.fastq//' | sort -u); do cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o ${i}_R1_001_trimmed.fastq -p ${i}_R2_001_trimmed.fastq ${i}_R1_001.fastq ${i}_R2_001.fastq > ${i}_cutadapt.txt

Actually, I'd ideally run it using GNU Parallel but I know the syntax is slightly different. In fact, I've used something like this for non-paired end samples before, but don't know how to adapt it for paired end reads:

ls | time parallel -j+0 --eta 'fastx_clipper -a TGGAATTCTCGGG -c -v -i {} -o ../processing/{.}.clip'

Any suggestions or further reading would be appreciated. I'd love to understand these variables better.

bash cutadapt loop • 2.0k views
ADD COMMENT
2
Entering edit mode
4.1 years ago
Dave Carlson ★ 2.1k

Hi Russel, I believe something like the following will work for your cutadapt command:

parallel --verbose --link 'cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o {1.}.trimmed.fastq -p {2.}.trimmed.fastq {1} {2}  > {1.}_cutadapt.txt' ::: *R1_001.fastq ::: *R2_001.fastq

In the case of the filenames you provided, you should see the following output:

A1_S12_R1_001.trimmed.fastq
A1_S12_R2_001.trimmed.fastq
A1_S12_R1_001_cutadapt.txt

The above command assumes that the fastq files to be trimmed are in your current working directory. The --link flag will make sure that each R1 and R2 stay together. The use of {1.} and {2.} will take the input file names and keep the basenames while removing the file extension string, allowing you to add the "trimmed" part to the name.

The --verbose flag will print out each command that is run. You could also try replacing this with --dry-run to make sure that each command looks appropriate.

ADD COMMENT
1
Entering edit mode

This worked perfectly! Thanks Dave!

ADD REPLY
0
Entering edit mode

This script may give problems if number of files match, but with different names. Simple example below:

input files:

$ ls *.fastq
test2_R2_001.fastq  test_R1_001.fastq

with OP code:

$ parallel --dry-run --link 'cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o {1.}.trimmed.fastq -p {2.}.trimmed.fastq {1} {2}  > {1.}_cutadapt.txt' ::: *R1_001.fastq ::: *R2_001.fastq

cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o test_R1_001.trimmed.fastq -p test2_R2_001.trimmed.fastq test_R1_001.fastq test2_R2_001.fastq  > test_R1_001_cutadapt.txt

test2_R2_001.fastq is not a match for test_R1_001.fastq, yet picked up by the function. Function would run, but it is incorrect.

Following function is safer IMHO. It would only look for matching R1 and R2.

$ parallel --dry-run --link 'cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o {.}.trimmed.fastq -p {=s/R1/R2/;s/\.fastq//=}.trimmed.fastq {} {=s/R1/R2/=}  > {.}_cutadapt.txt' ::: *R1_001.fastq

cutadapt -a AGATCGGAAGAG -A AGATCGGAAGAG -o test_R1_001.trimmed.fastq -p test_R2_001.trimmed.fastq test_R1_001.fastq test_R2_001.fastq  > test_R1_001_cutadapt.txt

Now test2 sample is not picked up and function would fail as there is no test_R2_001.fastq.

ADD REPLY

Login before adding your answer.

Traffic: 2126 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6