Trim_galore on multiple fast.qz files: syntax problem
1
2
Entering edit mode
6.5 years ago
m98 ▴ 420

I was reading the following post about how to run trim_galore on multiple paired-end fastq.gz files TrimGalore! on multiple paired fastq files

I installed GNU parallel and intended to use the same command suggested by eldronzhou:

find  path_to_fastq  -name "*_R1_merged.fastq.gz" | cut -d "_" -f1 | parallel -j 1 trim_galore --illumina --paired --fastqc -o trim_galore/ {}\_R1_merged.fastq.gz {}\_R2_merged.fastq.gz

However, my files are named slighlty differently: XXX_XX_L008_R1_001.fastq.gz and XXX_XX_L008_R2_001.fastq.gz Therefore I changed to command above to the following:

find  path_to_fastq  -name "*_R1_001.fastq.gz" | cut -d "R" -f1 | parallel -j 1 trim_galore --paired --fastqc -o trim_galore/ {}R1_001.fastq.gz {}R2_001.fastq.gz

However, I get the following error (showing up once for each pair of *fast.gz files:

Please provide an even number of input files for paired-end FastQ trimming! Aborting ...

I'm guessing something is wrong in my syntax and somehow the order of the files I provide is wrong - maybe R1 and R2 are not given in the right pairs somehow? My *fastq.gz files are in a separate folder and I have 20 files (so 10 pairs).

I cannot work out what is wrong, any help would be deeply appreciated.

UPDATE

After running the following --dry-run command, I get the following output:

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970085_CTCAAGC_L008_R1_001.fastq.gz ../../fastq/E970085_CTCAAGC_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970096_CGAAGGT_L008_R1_001.fastq.gz ../../fastq/E970096_CGAAGGT_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970084_CCTTGTC_L008_R1_001.fastq.gz ../../fastq/E970084_CCTTGTC_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970090_TCAGAAG_L008_R1_001.fastq.gz ../../fastq/E970090_TCAGAAG_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970092_ACAGTAC_L008_R1_001.fastq.gz ../../fastq/E970092_ACAGTAC_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970038_CTGGTTG_L008_R1_001.fastq.gz ../../fastq/E970038_CTGGTTG_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970094_AGGACTG_L008_R1_001.fastq.gz ../../fastq/E970094_AGGACTG_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970073_CTAGGTC_L008_R1_001.fastq.gz ../../fastq/E970073_CTAGGTC_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970088_GGATCAT_L008_R1_001.fastq.gz ../../fastq/E970088_GGATCAT_L008_R2_001.fastq.gz

trim_galore --paired --fastqc_args --outdir /home/user/my_projects/project1/data/qc/trimgalore ../../fastq/E970095_CAGTCAT_L008_R1_001.fastq.gz ../../fastq/E970095_CAGTCAT_L008_R2_001.fastq.gz

I still can't see an obvious mistake...

UPDATE2

I think I may have found the mistake.. I think that somehow, when using GNU parallel, I end up with commands (see just above) lacking the quotes around "--outdir ..."! And so I think the answer is:

find ../../fastq/ -name "*R1_001.fastq.gz" | cut -d "R" -f1 | parallel --dry-run -j 1 trim_galore --paired --fastqc_args \"--outdir /home/user/my_projects/project1/data/qc/trim-galore\" {}R1_001.fastq.gz {}R2_001.fastq.gz

# Output command I want (one example only)
trim_galore --paired --fastqc_args "--outdir /home/user/my_projects/project1/data/qc/trim-galore" ../../fastq/E970085_CTCAAGC_L008_R1_001.fastq.gz ../../fastq/E970085_CTCAAGC_L008_R2_001.fastq.gz
nsg trim_galore gnu parallel • 8.8k views
ADD COMMENT
0
Entering edit mode

did you count the number of files in path_to_fastq directory? does the folder contain matching R1 and R2 files?

ADD REPLY
0
Entering edit mode

Yes, there are 20, so they are in 10 pairs. The folder contains a .txt and a .sha1 file but surely, given my command above, that should not be a problem?

ADD REPLY
1
Entering edit mode

syntax seems to be fine by me after checking few dummy files. Add --dry-run immediately after parallel command. Check the dummy run.

ADD REPLY
0
Entering edit mode

This is so bizarre.. The --dry-run returns that my files are in the right pairs!

ADD REPLY
0
Entering edit mode

You would have completed the trim runs by now if you had run them serially :-)

ADD REPLY
0
Entering edit mode

Well true haha but I intend to run this on over 100 samples eventually so I need to know how to do this. I seriously cannot understand what I'm doing wrong

ADD REPLY
0
Entering edit mode

@ole.tange is developer of parallel so the answer below should work.

ADD REPLY
0
Entering edit mode

Why do you need find path_to_fastq -name "*_R1_001.fastq.gz"? A simple ls -1 *_R1_001.fastq.gz should do. Make sure ls -1 *_R1_001.fastq.gz | wc -l gets an equal number as ls -1 *_R2_001.fastq.gz | wc -l.

ADD REPLY
0
Entering edit mode

There are definitely the right number of files when I do those checks. I think its something to do with my find command which is not listing the R1 files in the same order as in the folder.. I tried replacing find with ls -1 but I have the same problem. This is so confusing

ADD REPLY
0
Entering edit mode

In general, for most of the tools, outdirs/ouputs are never quoted. I am not sure trimgalore requirements. I think first you should run the program with barebones command. For eg. remove --fastqc_args in function above.

ADD REPLY
1
Entering edit mode
6.5 years ago
ole.tange ★ 4.5k

This:

find  path_to_fastq  -name "*_R1_001.fastq.gz" |
  parallel -j 1 trim_galore --paired --fastqc -o trim_galore/ {} {= s/_R1_/_R2_/ =}

or:

find  path_to_fastq  -name "*_R1_001.fastq.gz" |
  parallel --plus -j 1 trim_galore --paired --fastqc -o trim_galore/ {} {/_R1_/_R2_}

will run:

trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abaci_R1_001.fastq.gz path_to_fastq/abaci_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/aardvarks_R1_001.fastq.gz path_to_fastq/aardvarks_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/a_R1_001.fastq.gz path_to_fastq/a_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/aardvark_R1_001.fastq.gz path_to_fastq/aardvark_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abalones_R1_001.fastq.gz path_to_fastq/abalones_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abaft_R1_001.fastq.gz path_to_fastq/abaft_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abacus_R1_001.fastq.gz path_to_fastq/abacus_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abacuses_R1_001.fastq.gz path_to_fastq/abacuses_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/aback_R1_001.fastq.gz path_to_fastq/aback_R2_001.fastq.gz
trim_galore --paired --fastqc -o trim_galore/ path_to_fastq/abalone_R1_001.fastq.gz path_to_fastq/abalone_R2_001.fastq.gz

If that does not work out of the box, try running each command by hand one at a time.

ADD COMMENT
0
Entering edit mode

is it necessary to have find line over here? can we not use:

parallel --plus -j 1 trim_galore --paired --fastqc -o trim_galore/ {} {/_R1_/_R2_} ::: path_to_fastq/*_R1_001.fastq.gz

ADD REPLY
0
Entering edit mode

find is used because OP used find.

Your solution will often give the same result, but will fail if the files are in subdirs inside path_to_fastq or if there are so many that they do not fit on a single command line.

ADD REPLY

Login before adding your answer.

Traffic: 1582 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6