Extract specific reads from FASTQ files based on subsequence
4
2
Entering edit mode
9.7 years ago
Paul ★ 1.5k

Dear all,

I have FASTQ files and on start of my read I have 7 nucleotides tag - I would like to extract reads with this specific tag.

I would like to search in first 15 nucleotides of my reads, if match - extract this read to new fastq files.

Thank you for any ideas or help.

fastq bash awk • 12k views
ADD COMMENT
0
Entering edit mode

any other tools for sam question?

ADD REPLY
1
Entering edit mode

Why? Are all of these tools inadequate?

ADD REPLY
1
Entering edit mode

There's a wonderful quote in Jurassic Park: "...were so preoccupied with whether they could, they didn't stop to think if they should."

ADD REPLY
1
Entering edit mode

Nice :) Jurassic Park has a lot of apropos quotables for genetic research...

Personally, I like "Life will find a way"

ADD REPLY
1
Entering edit mode

"Life ... uhh ... finds a way" :)

ADD REPLY
0
Entering edit mode

My meme was more eloquent ;)

ADD REPLY
0
Entering edit mode

We need an /r/biostars

ADD REPLY
0
Entering edit mode

I am actually stuck in python code for same question. hence asked for suggestions.

ADD REPLY
0
Entering edit mode

Don't add answers unless you're answering the original question. Use Add Reply or Add Comment instead. Please read https://www.biostars.org/t/how-to/

I'm moving this to a comment on the top-level post now.

ADD REPLY
5
Entering edit mode
9.7 years ago
5heikki 11k

Probably not the fastest option:

grep -B1 -A2 "^AGATCGG" file.fq | grep -v "^--$" > out.fq

Find lines that begin with "AGATCGG", grab one line before each hit and two lines after each hit, remove lines that are "--".

Note that it's possible (but extremely unlikely) that some quality value line begins like "AGATCGG", in which case the above command would mess up the output file. The likelihood of this is probably very close to zero but if you're processing a file with googolplex lines, maybe it could happen.

ADD COMMENT
3
Entering edit mode
9.7 years ago
Ram 44k

You can use Heng Li's bioawk:

To check if the tag is part of the first 15 bases

bioawk -c fastx 'substr($seq,0,15) ~ /$TAG/ { print }' reads.fq.gz

To match the first 7 bases to your tag,

bioawk -c fastx 'substr($seq,0,7) == $TAG { print }' reads.fq.gz
ADD COMMENT
2
Entering edit mode
9.7 years ago
michael.ante ★ 3.9k

In order to stay old-school, you can use the FastX toolkit's barcode splitter.

You need a text file with your tag (let it be myTag.txt) and then you can run (with N mismatches):

cat sequence.fastq | /usr/local/bin/fastx_barcode_splitter.pl -Q33 --bcfile myTag.txt --bol --mismatches $N --prefix out --suffix .txt

In order to remove the tag-sequence, you can use the fastx_trimmer from the same toolkit

Using the Fastx-tools, do not forget to add the -Q33 option.

ADD COMMENT
2
Entering edit mode
9.7 years ago

You can also do this with BBDuk:

bbduk.sh -Xmx1g in=reads.fq out=filtered.fq k=7 mm=f rcomp=f restrictleft=15 literal=ACGTACG

If you want multiple tags, you can list them separated by commas; and if you want to allow 1bp mismatch, set hdist=1.

ADD COMMENT

Login before adding your answer.

Traffic: 1816 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6