I was reading the following post about how to run trim_galore on multiple paired-end fastq.gz files TrimGalore! on multiple paired fastq files
I installed GNU parallel and intended to use the same command suggested by eldronzhou:
find path_to_fastq -name "*_R1_merged.fastq.gz" | cut -d "_" -f1 | parallel -j 1 trim_galore --illumina --paired --fastqc -o trim_galore/ {}\_R1_merged.fastq.gz {}\_R2_merged.fastq.gz
However, my files are named slighlty differently: XXX_XX_L008_R1_001.fastq.gz and XXX_XX_L008_R2_001.fastq.gz Therefore I changed to command above to the following:
find path_to_fastq -name "*_R1_001.fastq.gz" | cut -d "R" -f1 | parallel -j 1 trim_galore --paired --fastqc -o trim_galore/ {}R1_001.fastq.gz {}R2_001.fastq.gz
However, I get the following error (showing up once for each pair of *fast.gz files:
Please provide an even number of input files for paired-end FastQ trimming! Aborting ...
I'm guessing something is wrong in my syntax and somehow the order of the files I provide is wrong - maybe R1 and R2 are not given in the right pairs somehow? My *fastq.gz files are in a separate folder and I have 20 files (so 10 pairs).
I cannot work out what is wrong, any help would be deeply appreciated.
UPDATE
After running the following --dry-run command, I get the following output:
trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970085_CTCAAGC_L008_R1_001.fastq.gz ../../fastq/E970085_CTCAAGC_L008_R2_001.fastq.gz
trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970096_CGAAGGT_L008_R1_001.fastq.gz ../../fastq/E970096_CGAAGGT_L008_R2_001.fastq.gz
trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970084_CCTTGTC_L008_R1_001.fastq.gz ../../fastq/E970084_CCTTGTC_L008_R2_001.fastq.gz
trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970090_TCAGAAG_L008_R1_001.fastq.gz ../../fastq/E970090_TCAGAAG_L008_R2_001.fastq.gz
trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970092_ACAGTAC_L008_R1_001.fastq.gz ../../fastq/E970092_ACAGTAC_L008_R2_001.fastq.gz
trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970038_CTGGTTG_L008_R1_001.fastq.gz ../../fastq/E970038_CTGGTTG_L008_R2_001.fastq.gz
trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970094_AGGACTG_L008_R1_001.fastq.gz ../../fastq/E970094_AGGACTG_L008_R2_001.fastq.gz
trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970073_CTAGGTC_L008_R1_001.fastq.gz ../../fastq/E970073_CTAGGTC_L008_R2_001.fastq.gz
trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970088_GGATCAT_L008_R1_001.fastq.gz ../../fastq/E970088_GGATCAT_L008_R2_001.fastq.gz
trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970095_CAGTCAT_L008_R1_001.fastq.gz ../../fastq/E970095_CAGTCAT_L008_R2_001.fastq.gz
I still can't see an obvious mistake...
UPDATE2
I think I may have found the mistake.. I think that somehow, when using GNU parallel, I end up with commands (see just above) lacking the quotes around "--outdir ..."! And so I think the answer is:
find ../../fastq/ -name "*R1_001.fastq.gz" | cut -d "R" -f1 | parallel --dry-run -j 1 trim_galore --paired --fastqc_args \"--outdir /home/user/my_projects/project1/data/qc/trim-galore\" {}R1_001.fastq.gz {}R2_001.fastq.gz
# Output command I want (one example only)
trim_galore --paired --fastqc_args "--outdir /home/user/my_projects/project1/data/qc/trim-galore" ../../fastq/E970085_CTCAAGC_L008_R1_001.fastq.gz ../../fastq/E970085_CTCAAGC_L008_R2_001.fastq.gz
did you count the number of files in
path_to_fastq
directory? does the folder contain matching R1 and R2 files?Yes, there are 20, so they are in 10 pairs. The folder contains a .txt and a .sha1 file but surely, given my command above, that should not be a problem?
syntax seems to be fine by me after checking few dummy files. Add
--dry-run
immediately after parallel command. Check the dummy run.This is so bizarre.. The --dry-run returns that my files are in the right pairs!
You would have completed the trim runs by now if you had run them serially :-)
Well true haha but I intend to run this on over 100 samples eventually so I need to know how to do this. I seriously cannot understand what I'm doing wrong
@ole.tange is developer of
parallel
so the answer below should work.Why do you need
find path_to_fastq -name "*_R1_001.fastq.gz"
? A simplels -1 *_R1_001.fastq.gz
should do. Make surels -1 *_R1_001.fastq.gz | wc -l
gets an equal number asls -1 *_R2_001.fastq.gz | wc -l
.There are definitely the right number of files when I do those checks. I think its something to do with my find command which is not listing the R1 files in the same order as in the folder.. I tried replacing find with ls -1 but I have the same problem. This is so confusing
In general, for most of the tools, outdirs/ouputs are never quoted. I am not sure trimgalore requirements. I think first you should run the program with barebones command. For eg. remove
--fastqc_args
in function above.