Question

Extract specific reads from FASTQ files based on subsequence

2

Entering edit mode

9.6 years ago

Paul ★ 1.5k

Dear all,

I have FASTQ files and on start of my read I have 7 nucleotides tag - I would like to extract reads with this specific tag.

I would like to search in first 15 nucleotides of my reads, if match - extract this read to new fastq files.

Thank you for any ideas or help.

fastq bash awk • 12k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.6 years ago by Paul ★ 1.5k

0

Entering edit mode

any other tools for sam question?

ADD REPLY • link 7.8 years ago by springpython • 0

1

Entering edit mode

Why? Are all of these tools inadequate?

ADD REPLY • link 7.9 years ago by harold.smith.tarheel ★ 5.0k

1

Entering edit mode

There's a wonderful quote in Jurassic Park: "...were so preoccupied with whether they could, they didn't stop to think if they should."

ADD REPLY • link 7.9 years ago by Ram 44k

1

Entering edit mode

Nice :) Jurassic Park has a lot of apropos quotables for genetic research...

Personally, I like "Life will find a way"

ADD REPLY • link 7.9 years ago by Brian Bushnell 20k

1

Entering edit mode

"Life ... uhh ... finds a way" :)

ADD REPLY • link 7.9 years ago by Ram 44k

0

Entering edit mode

My meme was more eloquent ;)

ADD REPLY • link 7.9 years ago by Brian Bushnell 20k

0

Entering edit mode

We need an /r/biostars

ADD REPLY • link 7.9 years ago by Ram 44k

0

Entering edit mode

I am actually stuck in python code for same question. hence asked for suggestions.

ADD REPLY • link 7.9 years ago by springpython • 0

0

Entering edit mode

Don't add answers unless you're answering the original question. Use Add Reply or Add Comment instead. Please read https://www.biostars.org/t/how-to/

I'm moving this to a comment on the top-level post now.

ADD REPLY • link 7.9 years ago by Ram 44k

Ram · Accepted Answer · 2015-05-07

Probably not the fastest option:

grep -B1 -A2 "^AGATCGG" file.fq | grep -v "^--$" > out.fq

Find lines that begin with "AGATCGG", grab one line before each hit and two lines after each hit, remove lines that are "--".

Note that it's possible (but extremely unlikely) that some quality value line begins like "AGATCGG", in which case the above command would mess up the output file. The likelihood of this is probably very close to zero but if you're processing a file with googolplex lines, maybe it could happen.

score 3 · Accepted Answer · 2015-05-07

3

Entering edit mode

9.6 years ago

Ram 44k

You can use Heng Li's bioawk:

To check if the tag is part of the first 15 bases

bioawk -c fastx 'substr($seq,0,15) ~ /$TAG/ { print }' reads.fq.gz

To match the first 7 bases to your tag,

bioawk -c fastx 'substr($seq,0,7) == $TAG { print }' reads.fq.gz

ADD COMMENT • link 5.2 years ago by Ram 44k

Ram · Accepted Answer · 2015-05-07

In order to stay old-school, you can use the FastX toolkit's barcode splitter.

You need a text file with your tag (let it be myTag.txt) and then you can run (with N mismatches):

cat sequence.fastq | /usr/local/bin/fastx_barcode_splitter.pl -Q33 --bcfile myTag.txt --bol --mismatches $N --prefix out --suffix .txt

In order to remove the tag-sequence, you can use the fastx_trimmer from the same toolkit

Using the Fastx-tools, do not forget to add the -Q33 option.

Ram · Accepted Answer · 2015-05-07

2

Entering edit mode

9.6 years ago

Brian Bushnell 20k

You can also do this with BBDuk:

bbduk.sh -Xmx1g in=reads.fq out=filtered.fq k=7 mm=f rcomp=f restrictleft=15 literal=ACGTACG

If you want multiple tags, you can list them separated by commas; and if you want to allow 1bp mismatch, set hdist=1.

ADD COMMENT • link updated 7.9 years ago by Ram 44k • written 9.6 years ago by Brian Bushnell 20k