Trimmomatic job script to run on multiple pair end read file
2
0
Entering edit mode
6.8 years ago
Bioinfonext ▴ 470

I run below Trimmomatic job script on single pair end read file at HPC SERVER, How I can run it as loop if there are multiple pair end read file like this:

41_R1.fastq                                  41_R2.fastq

42_R1.fastq                                  42_R2.fastq

43_R2.fastq                                  43_R2.fastq



 #!/bin/bash

#$ -N trimmomtic
**strong text**
#$ -o /mnt/scratch/users/3052771/testdata/trimread.out

#$ -pe smp-verbose 20

module add apps/trimmomatic 

#$ -wd /mnt/scratch/users/3052771/testdata

trimmomatic PE -phred33 41_R1.fastq 41_R2.fastq 41_R1_paired.fq.gz 41_R1_unpaired.fq.gz 41_R2_paired.fq.gz 41_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
RNA-Seq • 30k views
ADD COMMENT
5
Entering edit mode
6.8 years ago
GenoMax 147k

One way to do this.

for i in `ls -1 *R1*.fastq | sed 's/\_R1.fastq//'`; do echo trimmomatic PE -phred33 $i\_R1.fastq $i\_R2.fastq $i\_R1_paired.fq.gz $i\_R1_unpaired.fq.gz $i\_R2_paired.fq.gz $i\_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 >> cmd_file; done

This will save all command lines in file cmd_file. Which will look like this.

$ cat cmd_file
trimmomatic PE -phred33 41_R1.fastq 41_R2.fastq 41_R1_paired.fq.gz 41_R1_unpaired.fq.gz 41_R2_paired.fq.gz 41_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
trimmomatic PE -phred33 42_R1.fastq 42_R2.fastq 42_R1_paired.fq.gz 42_R1_unpaired.fq.gz 42_R2_paired.fq.gz 42_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Use the lines to replace in the scheduler script above. Adjust the log output files as needed.

ADD COMMENT
0
Entering edit mode

Thanks a lot.

what if my input files is in fq.gz formats, can I change fastq to fq.gz?

for i in `ls -1 *R1*.fa.gz | sed 's/\_R1.fa.gz//'`; do echo trimmomatic PE -phred33 $i\_R1.fa.gz $i\_R2.fa.gz $i\_R1_paired.fq.gz $i\_R1_unpaired.fq.gz $i\_R2_paired.fq.gz $i\_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 >> cmd_file; done
ADD REPLY
0
Entering edit mode

That is correct. Replace the part that is common to all files. I am going to assume that the fa.gz above is an error. Trimmomatic will expect `fastq format files as input.

ADD REPLY
0
Entering edit mode

This script is working perfectly on one pair end reads:

#!/bin/bash
#$ -N trimmomtic
#$ -o /mnt/scratch/users/3052771/testdata/test-job/trimread.out

#$ -pe smp-verbose 20

module add apps/trimmomatic

#$ -wd /mnt/scratch/users/3052771/testdata/test-job

trimmomatic PE -phred33 SRR4104638_R1.fastq SRR4104638_R2.fastq SRR4104638_R1_paired.fq.gz SRR4104638_R1_unpaired.fq.gz SRR4104638_R2_paired.fq.gz SRR4104638_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

But this loop is not working, but above script is giving output: Could you please suggest where I am making mistake:

Input file names are like this:

SRR4104639_R1.fastq                  SRR4104639_R2.fastq                               

SRR4104640_R1.fastq                  SRR4104640_R2.fastq                               

SRR4104637_R1.fastq                  SRR4104637_R2.fastq 




#!/bin/bash
#$ -N trimmomtic
#$ -o /mnt/scratch/users/3052771/testdata/test-jbo/trimread.out
#$ -pe smp-verbose 20
module add apps/trimmomatic
#$ -wd /mnt/scratch/users/3052771/testdata/test-job

for i in `ls -1 *R1*.fastq | sed 's/\_R1.fastq//'`; do echo trimmomatic PE -phred33 $i\_R1.fastq $i\_R2.fastq $i\_R1_paired.fq.gz $i\_R1_unpaired.fq.gz $i\_R2_paired.fq.gz $i\_R2_unpaired.fq.gz ILLUMINACLIP:contams_forward_rev.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50 >> cmd_file; done
ADD REPLY
0
Entering edit mode

It is not a good idea to spawn multiple jobs from inside a main job. You may need to look into job arrays if you want to do that. What job scheduler are you using?

ADD REPLY
0
Entering edit mode

I am not sure how to use task array:

HPC cluster admin suggest this:

Used to run an array of serial jobs
–Input and output files can be separated
–Individual tasks are independent; may run anywhere
–Use the “-t <start>-<end>” option

#!/bin/bash
#$ -t 1-100 –cwd–o ~/results/taskjob.$JOB_ID
module load apps/fastapp
Fastapp–i~/inputdata/intput.$SGE_TASK_ID–o \~/results/out.$SGE_TASK_ID
ADD REPLY
0
Entering edit mode

I am using Grid Engine Scheduler.

ADD REPLY
0
Entering edit mode

can you please help me with, that which file is to be taken for the further analysis?

ADD REPLY
0
Entering edit mode

I run above loop and all cmd command put in the above script but it is not working:

#!/bin/bash

#$ -N trimmomtic

#$ -o /mnt/scratch/users/3052771/testdata/test-job

#$ -pe smp-verbose 20

module load apps/trimmomatic

#$ -wd /mnt/scratch/users/3052771/testdata/test-job

trimmomatic PE -phred33 SRR4104638_R1.fastq SRR4104638_R2.fastq SRR4104638_R1_paired.fq.gz SRR4104638_R1_unpaired.fq.gz SRR4104638_R2_paired.fq.gz SRR4104638_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104639_R1.fastq SRR4104639_R2.fastq SRR4104639_R1_paired.fq.gz SRR4104639_R1_unpaired.fq.gz SRR4104639_R2_paired.fq.gz SRR4104639_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104640_R1.fastq SRR4104640_R2.fastq SRR4104640_R1_paired.fq.gz SRR4104640_R1_unpaired.fq.gz SRR4104640_R2_paired.fq.gz SRR4104640_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104641_R1.fastq SRR4104641_R2.fastq SRR4104641_R1_paired.fq.gz SRR4104641_R1_unpaired.fq.gz SRR4104641_R2_paired.fq.gz SRR4104641_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104642_R1.fastq SRR4104642_R2.fastq SRR4104642_R1_paired.fq.gz SRR4104642_R1_unpaired.fq.gz SRR4104642_R2_paired.fq.gz SRR4104642_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104643_R1.fastq SRR4104643_R2.fastq SRR4104643_R1_paired.fq.gz SRR4104643_R1_unpaired.fq.gz SRR4104643_R2_paired.fq.gz SRR4104643_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104644_R1.fastq SRR4104644_R2.fastq SRR4104644_R1_paired.fq.gz SRR4104644_R1_unpaired.fq.gz SRR4104644_R2_paired.fq.gz SRR4104644_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50

trimmomatic PE -phred33 SRR4104645_R1.fastq SRR4104645_R2.fastq SRR4104645_R1_paired.fq.gz SRR4104645_R1_unpaired.fq.gz SRR4104645_R2_paired.fq.gz SRR4104645_R2_unpaired.fq.gz ILLUMINACLIP:all_adapter.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:50
ADD REPLY
0
Entering edit mode

Create a separate file for each job with just one trimmomatic command in it.

ADD REPLY
0
Entering edit mode

Hi Genomax,

Above script is also working, I just got confused because job was in queue, now it is generated all result files. Now could you also suggest loop to save hisat command line in cmd file for the file so that I can run together like trimmomatic.

files names is like this:

SRR4104637_R1_paired.fq.gz
SRR4104637_R2_paired.fq.gz


SRR4104638_R1_paired.fq.gz
SRR4104638_R2_paired.fq.gz

Hisat2 command for single pair end files:

hisat2 -x genome_index -1 SRR4104637_R1_paired.fq.gz -2 SRR4104637_R2_paired.fq.gz -S SRR4104637.sam
ADD REPLY
1
Entering edit mode

This time the common part in the file names is _paired.fq.gz so you can chop that off with sed and then create a new loop. Command will be saved in cmd_file as before.

for i in `ls -1 *R1_paired*| sed 's/\_R1_paired.fq.gz//'`; do echo hisat2 -x genome_index -1 $i\_R1_paired.fq.gz -2 $i\_R2_paired.fq.gz -S $i.sam >> cmd_file
ADD REPLY
0
Entering edit mode

Hi genomax:

Loop for hisat2 generating files like this:

hisat2 -x rice_index -1 SRR4104637_R1_R1_paired.fq.gz -2 SRR4104637_R1_R2_paired.fq.gz -S SRR4104637_R1.sam
hisat2 -x rice_index -1 SRR4104638_R1_R1_paired.fq.gz -2 SRR4104638_R1_R2_paired.fq.gz -S SRR4104638_R1.sam

Can you correct it so that it can generate files like this:

hisat2 -x genome_index -1 SRR4104637_R1_paired.fq.gz -2 SRR4104637_R2_paired.fq.gz -S SRR4104637.sam
ADD REPLY
0
Entering edit mode

Try the modified code above.

ADD REPLY
0
Entering edit mode

Thank you very much. Now its working perfectly.

ADD REPLY
0
Entering edit mode

Hi Genomax,

I am trying to run the hisat2 loop but not getting it correct: How I can modify command to remove extra R1

I am getting output like this:

hisat2 -x chrX_tran -1 Leaf_T1_F_R10_S1_L001_R1_R1_paired.fq.gz -2 Leaf_T1_F_R10_S1_L001_R1_R2_paired.fq.gz -S Leaf_T1_F_R10_S1_L001_R1.sam

command:

for i in `ls -1 *R1_paired*| sed 's/\_R1_paired.fq.gz//'`; do echo hisat2 -x chrX_tran -1 $i\_R1_paired.fq.gz -2 $i\_R2_paired.fq.gz -S $i.sam >> cmd2_file;done

input files are like this:

Root_T3_S_R2_S60_L001_R2_paired.fq.gz
Root_T3_S_R3_S59_L001_R1_paired.fq.gz
Root_T3_S_R3_S59_L001_R2_paired.fq.gz
Root_T3_S_R5_S58_L001_R1_paired.fq.gz
Root_T3_S_R5_S58_L001_R2_paired.fq.gz

I will be thankful for your time and advise.

Thanks Bioinfonext

ADD REPLY
0
Entering edit mode

I am not sure what you mean by extra R1? That loop should work with the example file names posted. Here I am just printing the name prefixes.

$ for i in `ls -1 *R1_paired*| sed 's/\_R1_paired.fq.gz//'`; do echo ${i};done
Root_T3_S_R3_S59_L001
Root_T3_S_R5_S58_L001

You could also do :

$ for i in *R1*.gz; do name=$(basename ${i} _R1_paired.fq.gz); echo ${name}; done
Root_T3_S_R3_S59_L001
Root_T3_S_R5_S58_L001
ADD REPLY
0
Entering edit mode

Dear genomax,

thanks for your all help. Can I run same loop for HTseq, I got alignment files and now now to run the loop to generate the read count for each alignment sam file saperetly. alignment file are like this:

 Root_T3_F_R9_S27_L001.sam

 Root_T3_S_R2_S60_L001.sam

 Root_T3_S_R3_S59_L001.sam

 Leaf_T1_F_R10_S1_L001.sam

 Leaf_T1_F_R2_S5_L001.sam

 Leaf_T1_F_R3_S4_L001.sam

for single file HTseq command:

htseq-count  Leaf_T1_F_R3_S4_L001.sam rice.gff  > Leaf_T1_F_R3_S4.count.txt

How can use above loop to generate HTseq command for all sam files.

I am thinking to use this, but how I can add count.txt to save output files.

for i in `ls -1 *L001.sam*| sed 's/\_L001.sam//'`; do echo htseq-count rice.gff  $i\_L001.sam >> cmd2_file;done

above command generated output like this:

htseq-count  rice.gff  Leaf_T3_S_R6_S42_L001.sam
htseq-count  rice.gff   Leaf_T3_S_R7_S41_L001.sam

But if it can write script like this will be a great help:

htseq-count  rice.gff  Leaf_T3_S_R6_S42_L001.sam >  Leaf_T3_S_R6_S42_L001.count.txt

Thanks bioinfonext

ADD REPLY
0
Entering edit mode

I got out put file like txt by using below command:

for i in `ls -1 *L001.sam*| sed 's/\_L001.sam//'`; do echo htseq-count rice.gff  $i\_L001.sam >$i\_L001.COUNT.txt>> cmd2_file;done

Please let me know if it is ok Thanks Bioinfonext

ADD REPLY
1
Entering edit mode

That looks ok. Instead of using using bare $i, it is a better practice to use ${i} to reference a variable.

ADD REPLY
0
Entering edit mode

Hi Genomax,

HTSeq loop command:

for i in `ls -1 *L001.sam*| sed 's/\_L001.sam//'`; do echo htseq-count rice.gff  $i\_L001.sam >$i\_L001.COUNT.txt>> cmd2_file;done

above command generate cmd file like this, it do not direct output to txt file in cmd file; instead of adding output txt file to cmd command, it generate empty txt file according to sample name.

htseq-count Leaf_T1_F_R10_S1_L001.sam Oryza_indica.ASM465v1.44.chr.gtf

htseq-count Leaf_T1_F_R2_S5_L001.sam Oryza_indica.ASM465v1.44.chr.gtf

htseq-count Leaf_T1_F_R3_S4_L001.sam Oryza_indica.ASM465v1.44.chr.gtf

It will be great if it can also add output count txt file according to sample name in the command.

ADD REPLY
0
Entering edit mode

it will be great if can generate cmd file like this:

htseq-count Leaf_T1_F_R10_S1_L001.sam Oryza_indica.ASM465v1.44.chr.gtf >Leaf_T1_F_R10_S1_L001.count.txt
ADD REPLY
0
Entering edit mode

Use this:

$ for i in `ls -1 *L001.sam*| sed 's/\_L001.sam//'`; do echo "htseq-count ${i}_L001.sam  Oryza_indica.ASM465v1.44.chr.gtf  >${i}_L001.COUNT.txt" >> cmd2_file;done
ADD REPLY
1
Entering edit mode
5.6 years ago

ADD COMMENT
0
Entering edit mode

Dear Vijay,

Thanks for the above script. what about this adapter fasta file: ILLUMINACLIP:/data/results/NEB_adapter.fasta

do we need to change the path for this in the script and from where we can get this adapter file?

thanks bioinfonext

ADD REPLY
0
Entering edit mode

Use the adapter sequence file for your own data.

ADD REPLY
0
Entering edit mode

Thank you very much genomax!

ADD REPLY

Login before adding your answer.

Traffic: 1930 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6