Filter fastq file by first two base pairs
3
0
Entering edit mode
9.5 years ago
Jautis ▴ 580

Hi, I have the fastq file example.fq and I would like to get only reads that start with a TG. Does anybody have advice for how to do this? I've been searching and and all the filtering systems I've seen look at read length or elements in the read id, but not in the sequence.

Thank you!

fastq • 2.6k views
ADD COMMENT
3
Entering edit mode
9.5 years ago
gunzip -c file.fastq.gz |\
    paste  - - - -  | awk -F '\t' '($2 ~ /^[Tt][Gg]/)' | tr "\t" "\n"
ADD COMMENT
0
Entering edit mode

Works like a charm, thank you! Do you think you could explain how it works?

ADD REPLY
0
Entering edit mode
9.5 years ago
iraun 6.2k

Another possibility, just having fun:

grep --no-group-separator -A2 -B1 '^TG' test.fq | grep --no-group-separator -A3 '^@HISEQ'

The reads ID in my fq file start with @HISEQ. Change @HISEQ with whatever your read IDs start with.

ADD COMMENT
0
Entering edit mode
9.5 years ago

With BBMap:

bbduk.sh in=example.fq out=filtered.fq k=2 literal=TG rcomp=f mm=f restrictleft=2

This will also work on fasta.

ADD COMMENT

Login before adding your answer.

Traffic: 2123 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6