Question

Extracting fastq files, based on their fasta counterparts

0

Entering edit mode

8.1 years ago

roblogan6 ▴ 50

I have two files. One is a multifasta file, then other is a multifastq. The same sequences are found in both files, the files are just in different formats. I have subsets of the multifasta file, and would like to find all those sequences in the multifastq file. The subsets are merely small multifasta files (~ 100 sequences) out of the original (~125K sequences).
I feel like grep should be able to do this nicely, but I don't actually know much of anything about grep. I do know, though, that it has a finite memory storage and it might not be the best when working with large files such as two 125K sequence multifasta/q files. I need the sequence and the phred quality scores. A sequence in one file looks like this:

>m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30/0_59
AAGAGAGAGATCCTCTTAAGACTCCCAACACGAATTCTCTATTACGCACA
TTATGTATAA

The same sequence in the other file looks like:

@m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30/0_59 RQ=0.771
AAGAGAGAGATCCTCTTAAGACTCCCAACACGAATTCTCTATTACGCACATTATGTATA
+
&%,--.-)..)&$.),.*&"*'.$&(('(-'))*)-#&$(,+-($&$#%%%,*+$*++'

As you can see, the header IDs are very similar, but not identical. Thanks for the help! -Rob

fastq fasta grep perl database • 2.2k views

ADD COMMENT • link updated 8.1 years ago by Brian Bushnell 20k • written 8.1 years ago by roblogan6 ▴ 50

0

Entering edit mode

Two supplementary questions.

Are the ID's identical in fasta and fastq files?
Do you need the full fastq records or just the sequence?

ADD REPLY • link 8.1 years ago by GenoMax 148k

score 1 · Answer 1 · 2016-11-16

1

Entering edit mode

8.1 years ago

Brian Bushnell 20k

With the BBMap package:

filterbyname.sh in=x.fastq out=y.fastq names=z.fasta include

ADD COMMENT • link 8.1 years ago by Brian Bushnell 20k

score 0 · Answer 2 · 2016-11-16

0

Entering edit mode

8.1 years ago

venu 7.1k

You can do something like following (Note: I've not tested it)

sed '/^>/d' fasta_file.fa | while read -r fasta; do grep -A2 -B1 "$fasta" fastq.fq >> new_fastq.fq; done

ADD COMMENT • link 8.1 years ago by venu 7.1k