Question

How to extract reads with exact match with BBDuk?

0

Entering edit mode

4.8 years ago

camelest ▴ 50

Hi, I'm pretty new to this area and is wondering anyone could help. I have a pair-ended RNA-seq data. What I'd like to do is to extract reads with a certain sequence in the 5' end of read2 from a fastq file, for example TTTTT. I've tried with BBDuk as below.

bbduk.sh -Xmx1g in1=read1.fastq in2=read2.fastq outm1=read1.filtered.fastq outm2=read2.filtered.fastq literal=TTTTT k=5 rcomp=f mm=f restrictleft=5

then I get read2.filtered.fastq mostly starting from TTTTT. However, when I do

cat read2.filtered.fastq | awk '(NR%4==2){print}' | grep -v "^TTTTT"

then there are many reads that do not start with TTTTT. For example,

ACTTACTGAGTTTTAAATGGTATAAATTTCTGGCATCTGGCAGGTG CTGCCACACACCCCAGTGGGCTTGGGGCCTGGCTGGAACTATTCAC GTCCCAAATATTCACTATGCCTTCTTTGGTGCCGGAAACTAACAGT TCCAAAACATTATCATTTCAATATGTAATCAACATAAAAAAATAAG

Why are these reads extracted? Where did I do wrong? Any suggestion would be really appreciated.

RNA-Seq fastq BBDuk • 1.7k views

ADD COMMENT • link 4.8 years ago by camelest ▴ 50

1

Entering edit mode

Maybe read 1 starts with TTTTT?

ADD REPLY • link 4.8 years ago by h.mon 35k

1

Entering edit mode

Sorry for the late reply. Thank you so much for your input. I found that you are totally correct. Is there any good ways in order to extract paired-ends with only read2 starting with TTTTT but not read1? I was thinking only use the read2 file as the input, do the same thing and use repair.sh in1=read1.fastq in2=read2.filtered.fastq out1=fixed1.fastq out2=fixed2.fastq outs=singletons.fq repair.

ADD REPLY • link 4.8 years ago by camelest ▴ 50

1

Entering edit mode

That would be the way to go.

ADD REPLY • link 4.8 years ago by GenoMax 152k

0

Entering edit mode

Thank you for your comment. This was really helpful.

ADD REPLY • link 4.8 years ago by camelest ▴ 50

1

Entering edit mode

You should be able to use a pipe to do this in one step. Give it a try.

bbduk.sh -Xmx1g in=read2.fastq outm=stdout.fastq  literal=TTTTT k=5 rcomp=f mm=f restrictleft=5 | repair.sh in1=read1.fastq in2=stdin.fastq out1=fixed1.fastq out2=fixed2.fastq outs=singletons.fq repair

ADD REPLY • link 4.8 years ago by GenoMax 152k