How samplelist.txt may look like for parallel
3
1
Entering edit mode
7.8 years ago
biotech ▴ 570

I received a script from a colleague and I'm figuring out how samplelist.txt looks like.

cat samplelist.txt |parallel -j 6 --verbose --progress java -jar $(TRIMMOMATIC) PE -threads 4 -phred33 {}_R1.fastq.gz {}_R2.fastq.gz {}_R1.fq.gz {}_U_R1.fq.gz {}_R2.fq.gz {}_U_R2.fq.gz ILLUMINACLIP:trimSeq.fasta:2:7:7 MINLEN:20

These are the fastq files the script is intended to trim:

WT_11-3_R1.fastq.gz
WT_11-3_R2.fastq.gz
WT_12-8_R1.fastq.gz
WT_12-8_R2.fastq.gz
WT_9-30_R1.fastq.gz
WT_9-30_R2.fastq.gz
delta_G_11-3_R1.fastq.gz
delta_G_11-3_R2.fastq.gz
delta_G_12-8_R1.fastq.gz
delta_G_12-8_R2.fastq.gz
delta_G_9-30_R1.fastq.gz
delta_G_9-30_R2.fastq.gz
TRIMMOMATIC • 2.1k views
ADD COMMENT
2
Entering edit mode
7.8 years ago
EagleEye 7.6k

In ideal case it should be One sample name per line.

WT_11-3
WT_12-8
WT_9-30
delta_G_11-3
delta_G_12-8
delta_G_9-30
ADD COMMENT
0
Entering edit mode

Does R1, R2 should be specified in samplelist.txt?

ADD REPLY
0
Entering edit mode

No need to specify R1 and R2.

ADD REPLY
0
Entering edit mode

need help for this question

ADD REPLY
2
Entering edit mode
7.8 years ago
ole.tange ★ 4.5k

Since 20140722 you can let GNU Parallel do the trimming of the file name:

s/_R..fastq.gz//; will remove _R1.fastq.gz or _R2.fastq.gz

$seen{$_}++ && $job->skip(); will skip the job if the trimmed name is already seen

--rpl  '{_} [...]' will replace {_} with [...].

So in total:

cat samplelist.txt |parallel --rpl  '{_} s/_R..fastq.gz//; $seen{$_}++ && $job->skip()' -j 6  --verbose --progress echo java -jar PE -threads 4 -phred33 {_}_R1.fastq.gz {_}_R2.fastq.gz {_}_R1.fq.gz {_}_U_R1.fq.gz {_}_R2.fq.gz {_}_U_R2.fq.gz ILLUMINACLIP:trimSeq.fasta:2:7:7 MINLEN:20
ADD COMMENT
0
Entering edit mode
7.8 years ago

Here's my way.

Directory raw contains all paired-ends reads files, which are carefully named with same suffixes: _R1.fastq.gz for read 1 and _R2.fastq.gz for read 2.

$ ls raw | head -n 4
delta_G_9-30_R1.fastq.gz
delta_G_9-30_R2.fastq.gz
WT_9-30_R1.fastq.gz
WT_9-30_R2.fastq.gz
...

I use a python script cluster_files to cluster paired-end reads by samples:

$ cluster_files -p '(.+?)_R[12]\.fastq\.gz$' raw/ -o raw.cluster

$ tree
.
├── raw
│   ├── delta_G_9-30_R1.fastq.gz
│   ├── delta_G_9-30_R2.fastq.gz
│   ├── WT_9-30_R1.fastq.gz
│   └── WT_9-30_R2.fastq.gz
└── raw.cluster
    ├── delta_G_9-30
    │   ├── delta_G_9-30_R1.fastq.gz -> ../../raw/delta_G_9-30_R1.fastq.gz
    │   └── delta_G_9-30_R2.fastq.gz -> ../../raw/delta_G_9-30_R2.fastq.gz
    └── WT_9-30
        ├── WT_9-30_R1.fastq.gz -> ../../raw/WT_9-30_R1.fastq.gz
        └── WT_9-30_R2.fastq.gz -> ../../raw/WT_9-30_R2.fastq.gz

For every analysis step, I creat a new directory with same directory structure in which reads file are soft links. There are many benefits for this:

  1. Safety. Previously produced files are independent from current working space, which would not be deleted by accident.
  2. Clear file organization. Every process has its own working space.
  3. Easy for parallelization using tools like GNU parallel and rush

Now let's clean the reads. (BTW, it should be ${TRIMMOMATIC} not $(TRIMMOMATIC)). Replacement string {/} represents the basename of input file path, here it's the sample name.

$ ls -d raw.cluster.clean/* | parallel -j 6 --dryrun \
    java -jar ${TRIMMOMATIC} PE -threads 4 -phred33 \
    {}/{/}_R1.fastq.gz {}/{/}_R2.fastq.gz \
    {}/{/}_R1.fq.gz {}/{/}_U_R1.fq.gz {}/{/}_R2.fq.gz {}/{/}_U_R2.fq.gz \
    ILLUMINACLIP:trimSeq.fasta:2:7:7 MINLEN:20

java -jar PE -threads 4 -phred33 raw.cluster.clean/delta_G_9-30/delta_G_9-30_R1.fastq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_R2.fastq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_R1.fq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_U_R1.fq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_R2.fq.gz raw.cluster.clean/delta_G_9-30/delta_G_9-30_U_R2.fq.gz ILLUMINACLIP:trimSeq.fasta:2:7:7 MINLEN:20

java -jar PE -threads 4 -phred33 raw.cluster.clean/WT_9-30/WT_9-30_R1.fastq.gz raw.cluster.clean/WT_9-30/WT_9-30_R2.fastq.gz raw.cluster.clean/WT_9-30/WT_9-30_R1.fq.gz raw.cluster.clean/WT_9-30/WT_9-30_U_R1.fq.gz raw.cluster.clean/WT_9-30/WT_9-30_R2.fq.gz raw.cluster.clean/WT_9-30/WT_9-30_U_R2.fq.gz ILLUMINACLIP:trimSeq.fasta:2:7:7 MINLEN:20

Next step, results of trimmomatic are clustered for further analyis, e.g., assembly.

$ cluster_files -p '(.+?)_\d.*\.fq\.gz$' raw.cluster.clean -o raw.cluster.clean.assembly

$ tree raw.cluster.clean.assembly
raw.cluster.clean.assembly
├── delta_G
│   ├── delta_G_9-30_R1.fq.gz -> ../../raw.cluster.clean/delta_G_9-30/delta_G_9-30_R1.fq.gz
│   ├── delta_G_9-30_R2.fq.gz -> ../../raw.cluster.clean/delta_G_9-30/delta_G_9-30_R2.fq.gz
│   ├── delta_G_9-30_U_R1.fq.gz -> ../../raw.cluster.clean/delta_G_9-30/delta_G_9-30_U_R1.fq.gz
│   └── delta_G_9-30_U_R2.fq.gz -> ../../raw.cluster.clean/delta_G_9-30/delta_G_9-30_U_R2.fq.gz
└── WT
    ├── WT_9-30_R1.fq.gz -> ../../raw.cluster.clean/WT_9-30/WT_9-30_R1.fq.gz
    ├── WT_9-30_R2.fq.gz -> ../../raw.cluster.clean/WT_9-30/WT_9-30_R2.fq.gz
    ├── WT_9-30_U_R1.fq.gz -> ../../raw.cluster.clean/WT_9-30/WT_9-30_U_R1.fq.gz
    └── WT_9-30_U_R2.fq.gz -> ../../raw.cluster.clean/WT_9-30/WT_9-30_U_R2.fq.gz

# assemble with spades:
ls -d raw.cluster.clean.assembly/* | parallel spades.py -k 21,33,47,55,63,77 -t 4 -m 10 --careful -o {}/spades -1 {}/{/}_1.fq.gz -2 {}/{/}_2.fq.gz -s {}/{/}_1.unpaired.fq.gz -s {}/{/}_2.unpaired.fq.gz'

Some times I want to assembly using another tool, so I can create another working space by:

$ cluster_files -p '(.+?)_\d.*\.fq\.gz$' raw.cluster.clean -o raw.cluster.clean.assembly_with_xxx

Here's one project of mine:

raw
raw.fastqc
raw.kaiju
raw.cluster
raw.cluster.clean
raw.cluster.clean.mapping
raw.cluster.clean.mapping.pilon
raw.cluster.clean.mapping.breseq
raw.cluster.clean.spades.plasmid
raw.cluster.clean.spades.result
ADD COMMENT

Login before adding your answer.

Traffic: 1967 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6