I am completely new to bioinformatics so I'm looking to learn how to do this.
I have multiple directories with fastq files: E.g; 10 Directories with each time series, each with Treatment and control directories, each with rep1 rep2 rep3.
For example: T9/Infected/Rep1/*.fastq.gz.
I'm looking to create a loop to run fastQC on each fastq file instead of having to submit a separate job for each directory.
Then to either output the fastQC data to a single directory or if possible a directory corresponding to each rep - e.g. rep1 results go into a folder called rep1 and so on
with gnu-parallel and try this (For fastq):
Example dry-run output is (for fastq files):
this would search all .fastq files and create output in each corresponding directory. Remove
--dry-run
option once you validate dummy runThanks for the response. Can you just help me clarify what you said because I'm a rookie at this: so does the '.' after find dictate the directory that the find command will look in? So I could put this as for example T9 and it will look for the fastq files in all the subdirectories in this directory? Then it will pass these files into the fastqc job?
. represents current directory. In current directory, look for files with fastq extension.
parallel is a function from GNU-Parallel program. --dry-run tells the program not to execute the program, but do a dummy run i.e print what commands will be executed. -o is for output. {} denotes input (could be any thing, but in this case output from find command..fastq files with file path). {//} is a function parameter within gnu-parallel to print only the path of the file, not the name of file or it's extension. / is simply /. No special meaning. {} is input.
The first argument after a
find
command is the directory to start looking in..
is shorthand for ‘my current working directory’.It could just as easily read:
Thanks that makes sense.
Could you explain what the '{ //}' means on the path to output? What do those brackets mean?
My bash script so far:
module load fastqc
cd /path/to/directory/lettuce_bot_timeseries/data/reads/
find . -name "*.fastq.gz" | parallel fastqc -o ../../fastqanalysis
The
parallel
program has quite unconventional syntax. It would be worth googling some beginners tutorials and examples to really understand it, rather than just have us explain specifics (we will be happy to clarify things of course).parallel
is an invaluable tool to have in your toolkit, so it is well worth investing an hour or so now to learn the basics, and save yourself dozens of hours in future.What have you tried so far?
Hint:
find
and its-exec
option will be your friend here. Alternativelyls
orfind
piped toparallel
will also work nicely.You needn’t loop the directories, there are better ways :)