Hi all,I wrote this script to loop over two sets of fastq files I have (sample1_R1 and sample1_R2 sample2_R1 sample2_R2) I want cutadapt to do its thing on sample1 as paired end and sample2 as paired end.
Originally I had issues with this loop iterating over the trimmed output but I changed "${name}_trimmed.R2.fastq.gz
" for -p, and -o, to "${name}_trimmed.fastq.gz
" and this took care of that issue. My for loop can be found in the code below under COMMAND EXECUTED.
Now, the problem is this:
- This change I made to the output file names, writes the correct names for the output files in my folder. so I have
sample1_R1_trimmed.fastq.gz
and the samesample1_R2_trimmed.fastq.gz
. Good so far, although there is something I noticed in the output summary file( when the script is executed I have job_ID.o and job_ID.e files, the summary report I am referring to is the Job_ID.o file ) , which if not redirected to a text file would spit out as "JobID.o" file.
Below is the first part of my Job_ID.o report:
Notice the command line parameters. It is taking in all R1 at a time and similarly taking all R2s at a time( not posted the R2 part). So basically the script is scmehowtelling Cutadapt to treat the files as single end, even though I have the -o and -p option which is issue #1.
Second issue is, despite this, the output files (sample1_R1_trimmed.fastq.gz
and sample1_R2_trimmed.fastq.gz
) that are written in my folder have the correct R1 and R2 names! Although I don't believe the the read trimmed is complteley good.
Why is my script treating this as SE? It could be I gave ${name}.fastq.gz twice and hence treating it as SE? I need to give it two files and I can't take delete one of these. So is there a way to indicate to use R1 and R2 at this point in code? I tried putting ${name}_R1.fastq.gz and that did not process any files at all! Am I losing my mind here? or not getting this Cutadapt thing right? After all this might be a Cutadapt thing and not a scripting thing?
So can I do to see if it is indeed Cutadapt problem or my code! you guessed it right! It is my code :) If you scroll down below my code here, you can see I ran Cutadapt directly on command line without a for loop and compared the summary of trimming to the output from the cold below on both sample 1 and sample2.
---- COMMAND EXECUTED: ---------------------------------------------------------
( for i in *.fastq.gz; do name=(`basename "$i" .fastq.gz`); cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o ${name}_trimmed.fastq.gz -p ${name}_trimmed.fastq.gz ${name}.fastq.gz ${name}.fastq.gz; done )
--------------------------------------------------------------------------------
This is cutadapt 2.9 with Python 3.7.7
Command line parameters: -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o sample1_R1_trimmed.fastq.gz -p sample1_R1_trimmed.fastq.gz sample1_R1.fastq.gz sample1_R1.fastq.gz
Processing reads on 1 core in paired-end mode ...
Finished in 82.19 s (115 us/read; 0.52 M reads/minute).
Here is the Cutadapt directly on commadline for both sample1 and sample2 as PE, to see if the numbers made sense .
for sample1
total read pairs processed: 716,612
Read 1 with adapter: **33,407** (4.7%)
Read 2 with adapter: **33,450** (4.7%)
for sample 1: Below is summary for Cutadapt launched through my script ( command in the script given above).
Total read pairs processed: 716,612
Read 1 with adapter: **33,407** (4.7%)
Read 2 with adapter: 29,236 (4.1%)
for sample 1: For R2:
Total read pairs processed: 716,612
Read 1 with adapter: 29,435 (4.1%)
Read 2 with adapter: **33,450** (4.7%)
Similar pattern for sample2. (not attached here) See? Too many funky things going on! I am not writing this script correctly and can't figure out what the problem is? I tried many modification to this code later to see if I can fix something. But honestly it is all a blind hunt! I want to understand my problem in my code here. Any help from anyone to correct this is greatly appreciated!
ADD REPLY • link • edit • moderate written 1 hour ago by geneart$$ • 10
The loop processes one sample_RX.fastq.gz a time, so it is obvious that it will process R1 first, then R2
Why don't you just loop over the sample names, like
before doing
cat sample.list
, one should specifyIFS=$'\n'
Yes! I forgot this, it's better to store old IFS, then get it back after things done
Thanks yztxwd! Will try and update.