Question

How to perform quality control on multiple folders using trimmomatic?

0

Entering edit mode

2.8 years ago

pavelasquezv ▴ 50

Hi all,

I would like to perform quality control on multiple fastq folders using trimmomatic. I am going to work with thousands of SRA files that I will download from NCBI. However, each experiment may have used specific adapters according to the equipment. Therefore, I need to specify the adapter type for each experiment in the ILLUMINACLIP parameter, which doesn't let me loop to trim all the SRA files in one loop. Do you have any suggestions to run the trimmomatic with all SRAs checking any adapter?

This is the code:

cat  listaSE | while read l; do prefetch -p $l; \
fasterq-dump $l -p -e $NSLOTS;  \
java -jar /Storage/progs/Trimmomatic-0.38/trimmomatic-0.38.jar SE -threads $NSLOTS \
    $l*fastq \
    $l\_SR.fastq  \
    HEADCROP:12 ILLUMINACLIP:TruSeq3-SE:2:30:10 \
    SLIDINGWINDOW:5:20 LEADING:3 TRAILING:3 MINLEN:40;
done

trimmomatic • 2.4k views

ADD COMMENT • link 2.8 years ago by pavelasquezv ▴ 50

1

Entering edit mode

I need to specify the adapter type for each experiment in the ILLUMINACLIP parameter

You could use an adapter file that contains multiple commercial adapters. While there is some chance that there may be a bit of over-trimming it should be negligible.

bbduk.sh includes such a file in resources directory that you could use here.

ADD REPLY • link 2.8 years ago by GenoMax 151k

0

Entering edit mode

Thanks for your answer GenoMax. Is there no way to do this with the trimmomatic program?

ADD REPLY • link 2.8 years ago by pavelasquezv ▴ 50

1

Entering edit mode

Not unless you can find a way to include which adapter file you want to use for each sample in your listaSE file since that is the only input for your script. If you only have SRR* accessions then they alone won't tell you about which adapter to scan for.

ADD REPLY • link 2.8 years ago by GenoMax 151k

0

Entering edit mode

Hi GenoMax, I hope you are well. I am trying to trim with bbduk.sh but I think I have a problem with the adapters. Do you have a suggestion to remove the adapters or trim the first and last 12 base pairs?
Many thanks!

This is the code:

bbduk.sh in1=SRR000001.fastq out=SRR000001_unpaired_20Poly.fastq ref=/home/pavelasquezv/RNAseq/bbmap/resources/adapters.fa qtrim=lr trimq=20 overwrite=true ktrim=r  qskip=4 ways=10 ftm=5 maq=15 minlen=40 trimpolya=10 trimpolyg=10 trimpolyc=10 barcodefilter=t forcetrimleft=1

out fastqc

ADD REPLY • link 2.8 years ago by pavelasquezv ▴ 50

1

Entering edit mode

You do not need to trim the initial 12 bp. That pattern is commonly seen in RNAseq libraries due to random primers not being so random. There is a blog post from authors of FastQC on this topic: https://sequencing.qcfail.com/articles/positional-sequence-bias-in-random-primed-libraries/ (NOTE: SSL cert on the site has expired once again but that site is safe, I pinged author of FastQC to fix the SSL cert, this should be done by tomorrow, if you want to wait until then to visit the link above).

ADD REPLY • link 2.8 years ago by GenoMax 151k

0

Entering edit mode

Thanks a lot my friend! So I can continue safer. You are the best GenoMax! Kind regards, Alex

ADD REPLY • link 2.8 years ago by pavelasquezv ▴ 50

1

Entering edit mode

As a remark, if you are going to process such large amounts of samples do yourself a favor and spend time learning a workflow manager such as Nextflow or Snakemake first that allows caching and resuming pipeline operations. Likely the pipeline you build does not only trimming but also much more, and for large sample size you need some parallelization, e.g. on a cluster and for these the managers are tailored plus it makes pipelines more robust and feasable.

ADD REPLY • link 2.8 years ago by ATpoint 88k

0

Entering edit mode

Many thanks for your suggestion ATpoint!

ADD REPLY • link 2.8 years ago by pavelasquezv ▴ 50

0

Entering edit mode

Hi ATpoint I have very interested in using a Nextflow or Snakemake but i don´t have any idea how can I do that. I am a new to bioinformatics. Please, do you have any idea how to build a pipeline in Nextflow with the following script?

#!/bin/bash

#$ -q all.q
#$ -V
#$ -cwd
#$ -pe smp 10
#$ -l hostname=figsrv

module load STAR/2.7.10a
module load FastQC/0.11.8
module load sratoolkit/3.0.0
module load Trimmomatic/0.38

grep -Eo '\S*' df_SE_H_arm.csv  > listaSE

cat  listaSE | while read l; do prefetch -p $l; \
fasterq-dump $l -p -e $NSLOTS;  \
java -jar /Storage/progs/Trimmomatic-0.38/trimmomatic-0.38.jar SE -threads $NSLOTS \
    $l.fastq \
    $l\_SR.fastq  \
    HEADCROP:12 ILLUMINACLIP:/Storage/progs/Trimmomatic-0.38/adapters/TruSeq3-SE:2:30:10 \
    SLIDINGWINDOW:4:15 LEADING:3 TRAILING:3 MINLEN:35; \
rm -r $l.fastq $l; \
fastqc -t $NSLOTS --noextract $l*SR.fastq  -O /Storage/data1/angelica.vargas/fastqc_2_SE/; \
rm -r /Storage/data1/angelica.vargas/fastqc_2/*zip; \
STAR \
 --genomeDir /Storage/data1/angelica.vargas/index \
 --runThreadN $NSLOTS \
 --readFilesIn $l*SR.fastq  \
 --outFileNamePrefix /Storage/data1/angelica.vargas/STAR_SE/$l\_ \
 --quantMode GeneCounts; \
cd /STAR_SE; \
paste $l\_ReadsPerGene.out.tab | grep -v "_" | \
     awk '{printf "%s\t", $1}{for (i=2;i<=NF;i+=4) printf "%s\t", $i; printf "\n" }' \
     > tmp; \
sed -e "1igene_name\t$(ls $l\_ReadsPerGene.out.tab | \
     tr '\n' '\t' | sed 's/_ReadsPerGene.out.tab//g')" tmp | \
     sed "s/gene-//g" | cut -f1-7 \
     > $l\_raw_counts_matrix.txt; \
rm tmp; \
cd ..; \
mv /Storage/data1/angelica.vargas/STAR_SE/*Log.final.out /Storage/data1/angelica.vargas/qresultsSE/; \
mv /Storage/data1/angelica.vargas/STAR_SE/*matrix.txt /Storage/data1/angelica.vargas/resultsSE/; \
rm -r $l* /Storage/data1/angelica.vargas/STAR_SE/$l* fasterq.tmp*; \
done

ADD REPLY • link 2.8 years ago by pavelasquezv ▴ 50

1

Entering edit mode

Sorry, but I am not going to do your coding for you. Both Nextflow and Snakemake have lots of tutorial material at their websites that you can use to learn from.

ADD REPLY • link 2.8 years ago by ATpoint 88k

0

Entering edit mode

Sorry, my friend, I am trying but is very difficult for me. But I will get it. Many thanks again for your suggestion! All the best!

ADD REPLY • link 2.8 years ago by pavelasquezv ▴ 50

score 0 · Answer 1 · 2022-07-12

read two columns in the initial file ?

cat  listaSE | while read l CLIP ; do prefetch -p $l; \
fasterq-dump $l -p -e $NSLOTS;  \
java -jar /Storage/progs/Trimmomatic-0.38/trimmomatic-0.38.jar SE -threads $NSLOTS \
    $l*fastq \
    $l\_SR.fastq  \
    HEADCROP:12 "ILLUMINACLIP:${CLIP}" \
    SLIDINGWINDOW:5:20 LEADING:3 TRAILING:3 MINLEN:40;
done