Hi, I'm pretty new to this area and is wondering anyone could help. I have a pair-ended RNA-seq data. What I'd like to do is to extract reads with a certain sequence in the 5' end of read2 from a fastq file, for example TTTTT. I've tried with BBDuk as below.
bbduk.sh -Xmx1g in1=read1.fastq in2=read2.fastq outm1=read1.filtered.fastq outm2=read2.filtered.fastq literal=TTTTT k=5 rcomp=f mm=f restrictleft=5
then I get read2.filtered.fastq mostly starting from TTTTT. However, when I do
cat read2.filtered.fastq | awk '(NR%4==2){print}' | grep -v "^TTTTT"
then there are many reads that do not start with TTTTT. For example,
ACTTACTGAGTTTTAAATGGTATAAATTTCTGGCATCTGGCAGGTG CTGCCACACACCCCAGTGGGCTTGGGGCCTGGCTGGAACTATTCAC GTCCCAAATATTCACTATGCCTTCTTTGGTGCCGGAAACTAACAGT TCCAAAACATTATCATTTCAATATGTAATCAACATAAAAAAATAAG
Why are these reads extracted? Where did I do wrong? Any suggestion would be really appreciated.
Maybe read 1 starts with TTTTT?
Sorry for the late reply. Thank you so much for your input. I found that you are totally correct. Is there any good ways in order to extract paired-ends with only read2 starting with TTTTT but not read1? I was thinking only use the read2 file as the input, do the same thing and use
repair.sh in1=read1.fastq in2=read2.filtered.fastq out1=fixed1.fastq out2=fixed2.fastq outs=singletons.fq repair
.That would be the way to go.
Thank you for your comment. This was really helpful.
You should be able to use a pipe to do this in one step. Give it a try.