Here's my way.
Directory raw
contains all paired-ends reads files, which are carefully
named with same suffixes: _R1.fastq.gz
for read 1 and _R2.fastq.gz
for read 2.
$ ls raw | head -n 4
delta_G_9-30_R1.fastq.gz
delta_G_9-30_R2.fastq.gz
WT_9-30_R1.fastq.gz
WT_9-30_R2.fastq.gz
...
I use a python script cluster_files
to cluster paired-end reads by samples:
$ cluster_files -p '(.+?)_R[12]\.fastq\.gz$' raw/ -o raw.cluster
$ tree
.
├── raw
│ ├── delta_G_9-30_R1.fastq.gz
│ ├── delta_G_9-30_R2.fastq.gz
│ ├── WT_9-30_R1.fastq.gz
│ └── WT_9-30_R2.fastq.gz
└── raw.cluster
├── delta_G_9-30
│ ├── delta_G_9-30_R1.fastq.gz -> ../../raw/delta_G_9-30_R1.fastq.gz
│ └── delta_G_9-30_R2.fastq.gz -> ../../raw/delta_G_9-30_R2.fastq.gz
└── WT_9-30
├── WT_9-30_R1.fastq.gz -> ../../raw/WT_9-30_R1.fastq.gz
└── WT_9-30_R2.fastq.gz -> ../../raw/WT_9-30_R2.fastq.gz
For every analysis step, I creat a new directory with same directory structure
in which reads file are soft links. There are many benefits for this:
- Safety. Previously produced files are independent from current working space,
which would not be deleted by accident.
- Clear file organization. Every process has its own working space.
- Easy for parallelization using tools like GNU parallel and rush
Now let's clean the reads. (BTW, it should be ${TRIMMOMATIC}
not $(TRIMMOMATIC)
).
Replacement string {/}
represents the basename of input file path, here it's the sample name.
$ ls -d raw.cluster.clean/* | parallel -j 6 --dryrun \
java -jar ${TRIMMOMATIC} PE -threads 4 -phred33 \
{}/{/}_R1.fastq.gz {}/{/}_R2.fastq.gz \
{}/{/}_R1.fq.gz {}/{/}_U_R1.fq.gz {}/{/}_R2.fq.gz {}/{/}_U_R2.fq.gz \
ILLUMINACLIP:trimSeq.fasta:2:7:7 MINLEN:20
java -jar PE -threads 4 -phred33 raw.cluster.clean/delta_G_9-30/delta_G_9-30_R1.fastq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_R2.fastq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_R1.fq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_U_R1.fq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_R2.fq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_U_R2.fq.gz ILLUMINACLIP:trimSeq.fasta:2:7:7 MINLEN:20
java -jar PE -threads 4 -phred33 raw.cluster.clean/WT_9-30/WT_9-30_R1.fastq.gz raw.cluster.clean/WT_9-30/WT_9-30_R2.fastq.gz raw.cluster.clean/WT_9-30/WT_9-30_R1.fq.gz raw.cluster.clean/WT_9-30/WT_9-30_U_R1.fq.gz raw.cluster.clean/WT_9-30/WT_9-30_R2.fq.gz raw.cluster.clean/WT_9-30/WT_9-30_U_R2.fq.gz ILLUMINACLIP:trimSeq.fasta:2:7:7 MINLEN:20
Next step, results of trimmomatic are clustered for further analyis, e.g., assembly.
$ cluster_files -p '(.+?)_\d.*\.fq\.gz$' raw.cluster.clean -o raw.cluster.clean.assembly
$ tree raw.cluster.clean.assembly
raw.cluster.clean.assembly
├── delta_G
│ ├── delta_G_9-30_R1.fq.gz -> ../../raw.cluster.clean/delta_G_9-30/delta_G_9-30_R1.fq.gz
│ ├── delta_G_9-30_R2.fq.gz -> ../../raw.cluster.clean/delta_G_9-30/delta_G_9-30_R2.fq.gz
│ ├── delta_G_9-30_U_R1.fq.gz -> ../../raw.cluster.clean/delta_G_9-30/delta_G_9-30_U_R1.fq.gz
│ └── delta_G_9-30_U_R2.fq.gz -> ../../raw.cluster.clean/delta_G_9-30/delta_G_9-30_U_R2.fq.gz
└── WT
├── WT_9-30_R1.fq.gz -> ../../raw.cluster.clean/WT_9-30/WT_9-30_R1.fq.gz
├── WT_9-30_R2.fq.gz -> ../../raw.cluster.clean/WT_9-30/WT_9-30_R2.fq.gz
├── WT_9-30_U_R1.fq.gz -> ../../raw.cluster.clean/WT_9-30/WT_9-30_U_R1.fq.gz
└── WT_9-30_U_R2.fq.gz -> ../../raw.cluster.clean/WT_9-30/WT_9-30_U_R2.fq.gz
# assemble with spades:
ls -d raw.cluster.clean.assembly/* | parallel spades.py -k 21,33,47,55,63,77 -t 4 -m 10 --careful -o {}/spades -1 {}/{/}_1.fq.gz -2 {}/{/}_2.fq.gz -s {}/{/}_1.unpaired.fq.gz -s {}/{/}_2.unpaired.fq.gz'
Some times I want to assembly using another tool, so I can create another working space
by:
$ cluster_files -p '(.+?)_\d.*\.fq\.gz$' raw.cluster.clean -o raw.cluster.clean.assembly_with_xxx
Here's one project of mine:
raw
raw.fastqc
raw.kaiju
raw.cluster
raw.cluster.clean
raw.cluster.clean.mapping
raw.cluster.clean.mapping.pilon
raw.cluster.clean.mapping.breseq
raw.cluster.clean.spades.plasmid
raw.cluster.clean.spades.result
Does R1, R2 should be specified in samplelist.txt?
No need to specify R1 and R2.
need help for this question