bbduk command line options for EM-seq (NEB)
2
0
Entering edit mode
20 months ago
tamu.anand ▴ 30

Hi

I would like to use bbduk to replicate the command line flags used by trim_galore for EM-Seq (NEB).

The bismark user guide recommends this for trim_galore

--clip_R1 10 --clip_R2 10 --three_prime_clip_R1 10 --three_prime_clip_R2 10

Explanation of the trim_galore parameters

--clip_R1 <int>   

Instructs Trim Galore to remove <int> bp from the 5' end 
of read 1 (or single-end reads). This may be useful if 
the qualities were very poor, or if there is some
sort of unwanted bias at the 5' end. 
Default: OFF.

--clip_R2 <int>  

Instructs Trim Galore to remove <int> bp from the 5' end
of read 2 (paired-end reads  only). This may be useful if 
the qualities were very poor, or if there is some sort
of unwanted bias at the 5' end. For paired-end BS-Seq, 
it is recommended to remove the first few bp because  
the end-repair reaction may introduce a bias towards low
methylation. Please refer to the M-bias plot section in 
the Bismark User Guide for  some examples. 
Default: OFF.

--three_prime_clip_R1 <int>     

Instructs Trim Galore to remove <int> bp from 
the 3' end of read 1 (or single-end reads) AFTER 
adapter/quality trimming has been performed. This 
may remove some  unwanted bias from the 3' end 
that is not directly related to adapter sequence or 
basecall quality.
Default: OFF.

--three_prime_clip_R2 <int>     

Instructs Trim Galore to remove <int> bp from 
the 3' end of read 2 AFTER adapter/quality trimming 
has been performed. This may remove some 
unwanted bias from  the 3' end that is not 
directly related to adapter sequence or basecall quality.
Default: OFF.

I was wondering how I could get the same with bbduk - the relevant bbduk flags I could find were

forcetrimleft=0     (ftl) If positive, trim bases to the left of this position
forcetrimright=0    (ftr) If positive, trim bases to the right of this position
forcetrimright2=0   (ftr2) If positive, trim this many bases on the right end.
forcetrimmod=0      (ftm) If positive, right-trim length to be equal to zero,

Based on what trim_galore provides, I am guessing the equivalent bbduk command line option would be

bbduk.sh in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ktrim=l ktrim=r forcetrimleft=10
forcetrimright2=10

Is the above correct?

Thanks in advance.

bbduk trim-galore • 2.5k views
ADD COMMENT
1
Entering edit mode
20 months ago
GenoMax 148k

I suggest doing this in a two step process. In step one scan and trim for adapters and then pass the result for hard trim of 10 bp to step 2 (replace NN with a value that is less than 1/2 length of your reads).

bbduk.sh -Xmx4g  in1=read1.fq in2=read2.fq out=stdout.fq ref=adapters.fa ktrim=r k=NN | bbduk.sh -Xmx4g in=stdin.fq out1=clean1.fq out2=clean2.fq forcetrimleft=10 forcetrimright2=10 int=t

Your example does not have a way to specify adapter sequences to trim. If you add that they perhaps this would work as well

bbduk.sh in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=adapters.fa ktrim=r k=NN forcetrimleft=10 forcetrimright2=10
ADD COMMENT
0
Entering edit mode

Hi GenoMax

Thanks a lot. I have some follow up questions.

Your answer has 2 code snippets:

  1. Chained/piped command with k=NN and use of ref=adapters.fa in the first part of the pipe.
  2. without pipes and without use k=NN, with ref=adapters.fa

    Are you suggesting either of the two code snippets will work in my case?

You also mention this below between the 2 snippets and I am a bit confused here as both the snippets have ref=adapters.fa. Were you referring to add it to the second part of the pipe in Code snippet 1?

*Your example does not have a way to specify adapter 
sequences to trim. If you add that they perhaps this 
would work as well*

Also, I think code snippet 1 needs int=t - this based on Brian's note in #43 here

When reading an interleaved file from stdin, 
you must always specify "int=t" because it 
is only autodetected if reading from a file

Thanks once again.

ADD REPLY
0
Entering edit mode

Code snippet 1 will need int=t thanks for linking the thread (edited my example).

I don't think Brian has explicitly put down the order of operations in bbduk in help docs but I think the two command lines above should produce similar output once you add adapters.fa files to your example (since without that bbduk has no way to know about which adapter sequences to scan/trim). You should also add k=NN in your command line and only use ktrim=r since with Illumina the adapter should only be present on 3'-end of the sequence.

ADD REPLY
0
Entering edit mode

Hi GenoMax

Is this the order of operations you were referring - from the same link I shared before (Brian's note in #43 here)

It's not possible to change the order of processing operations in BBDuk, and that would not be easy to do. The order is currently:

1) filter by minAvgQuality
2) force-trim
3) kmer-analyze (trim, filter, or mask)
4) trim by overlap
5) quality-trim

Also, in your original answer (copy/pasted below), should it actually be 1/2 the length of adapters - asking so as I happened to see this comment of yours

replace NN with a value that is less than 1/2 length of your reads
ADD REPLY
0
Entering edit mode

I think both ways still may produce very similar results but you can try and let us know. Going by the list above my way may be operationally logical since force trim takes precedence.

You want to select a k to make sure that you get good initial matches. What is the length of your reads?

ADD REPLY
0
Entering edit mode

I am trying this with some example data provided by NEBiolabs in their paper - reads are 99 bp

ADD REPLY
0
Entering edit mode

Try k=25. I generally stay somewhere in 20-25 range.

ADD REPLY
0
Entering edit mode

Hi

When I try code snippet 1 (piped command), I am getting this weird error about can't read file stdout.fastq

bbduk.sh  in1=200ng_em-seq.ds.1.fastq in2=200ng_em-seq.ds.2.fastq out=stdout.fastq ref=adapters.fa ktrim=r k=25 | bbduk.sh in=stdout.fastq out1=bbduk_run_1_200ng_em-seq_1.fastq out2=bbduk_run_1_200ng_em-seq_2.fastq forcetrimleft=10 forcetrimright2=10 int=t

Version 39.01

Set INTERLEAVED to true
Exception in thread "main" java.lang.RuntimeException: Can't read file 'stdout.fastq'
        at shared.Tools.testInputFiles(Tools.java:185)
        at jgi.BBDuk.<init>(BBDuk.java:922)
        at jgi.BBDuk.main(BBDuk.java:78)
0.046 seconds.

I think the 2nd part of the pipe should be in=stdin.fq instead of in=stdout.fq and when I change that it works.

ADD REPLY
0
Entering edit mode

Correct. I will change that in my answer.

ADD REPLY
0
Entering edit mode
20 months ago
tamu.anand ▴ 30

For the sake of completeness, reproducibility and code snippet clarity, I will add here what worked for me based on @Genomax answer. Thanks GenoMax

The fastq files are in the zip file from this link from the New England Biolabs Genome Research Paper

  • please see this issue tracker for discussions on trim_galore params for EM-Seq

trim_galore command - rename output files to trim_galore_200ng_em_1.fastq and trim_galore_200ng_em_2.fastq

trim_galore --cores 8 \
  --clip_R1 10 --clip_R2 10 \
   --three_prime_clip_R1 10 \
    --three_prime_clip_R2 10 \
    --paired 200ng_em-seq.ds.1.fastq \
      200ng_em-seq.ds.2.fastq

I ran 4 bbduk commands - bbduk.sh ref=adapters.fa in1=200ng_em-seq.ds.1.fastq in2=200ng_em-seq.ds.2.fastq followed by one of these

#command 1
out=stdout.fq ktrim=r k=25 | bbduk.sh in=stdin.fastq \
out1=bbduk_run1_200ng_em-seq_1.fastq \
out2=bbduk_run1_200ng_em-seq_2.fastq \
forcetrimleft=10 forcetrimright2=10 int=t

#command 2
ktrim=r k=25 \
out1=bbduk_run2_200ng_em-seq_1.fastq \
out2=bbduk_run2_200ng_em-seq_2.fastq \
forcetrimleft=10 forcetrimright2=10

#command 3
out=stdout.fq ktrim=r k=25 trimq=10 qtrim=rl tpe tbo | bbduk.sh in=stdin.fastq \
out1=bbduk_run3_200ng_em-seq_1.fastq \
out2=bbduk_run3_200ng_em-seq_2.fastq \
forcetrimleft=10 forcetrimright2=10 int=t

#command 4
ktrim=r k=25 tpe tbo trimq=10 qtrim=rl \
out1=bbduk_run4_200ng_em-seq_1.fastq \
out2=bbduk_run4_200ng_em-seq_2.fastq \
forcetrimleft=10 forcetrimright2=10

I then used seqfu stats on the original fastq, trim_galore and bbduk outputs seqfu stats as an image

Based on the above, I feel command3 or command4 with bbduk would be the closest to what trim_galore produces.

Hope this helps.

Thanks GenoMax for your valuable inputs as always.

ADD COMMENT
0
Entering edit mode

Hi GenoMax

A quick follow-up question: If my input files are gzipped, do I need to use out=sdout.fq.gz and in=stdin.fq.gz in the piped command?

ADD REPLY
0
Entering edit mode

Yes. BBTools respect/follow file extensions and keep files in the same format.

ADD REPLY
0
Entering edit mode

I am trying it with the same files gzipped and using out=stdout.fq.gz and in=stdin.fq.gz - it does not work.

  • it works only without the *.gz extensions for stdin/stdout and it produces gzipped output files if starting with gzipped inputs
ADD REPLY
0
Entering edit mode

Since you are simply passing the intermediate file from one operation to other keeping it uncompressed should be fine.

ADD REPLY

Login before adding your answer.

Traffic: 3739 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6