Extracting fastq files, based on their fasta counterparts
3
0
Entering edit mode
8.1 years ago
roblogan6 ▴ 50

I have two files. One is a multifasta file, then other is a multifastq. The same sequences are found in both files, the files are just in different formats. I have subsets of the multifasta file, and would like to find all those sequences in the multifastq file. The subsets are merely small multifasta files (~ 100 sequences) out of the original (~125K sequences).
I feel like grep should be able to do this nicely, but I don't actually know much of anything about grep. I do know, though, that it has a finite memory storage and it might not be the best when working with large files such as two 125K sequence multifasta/q files. I need the sequence and the phred quality scores. A sequence in one file looks like this:

>m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30/0_59
AAGAGAGAGATCCTCTTAAGACTCCCAACACGAATTCTCTATTACGCACA
TTATGTATAA

The same sequence in the other file looks like:

@m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30/0_59 RQ=0.771
AAGAGAGAGATCCTCTTAAGACTCCCAACACGAATTCTCTATTACGCACATTATGTATA
+
&%,--.-)..)&$.),.*&"*'.$&(('(-'))*)-#&$(,+-($&$#%%%,*+$*++'

As you can see, the header IDs are very similar, but not identical. Thanks for the help! -Rob

fastq fasta grep perl database • 2.2k views
ADD COMMENT
0
Entering edit mode

Two supplementary questions.

  1. Are the ID's identical in fasta and fastq files?
  2. Do you need the full fastq records or just the sequence?
ADD REPLY
1
Entering edit mode
8.1 years ago

With the BBMap package:

filterbyname.sh in=x.fastq out=y.fastq names=z.fasta include
ADD COMMENT
0
Entering edit mode
8.1 years ago
venu 7.1k

You can do something like following (Note: I've not tested it)

sed '/^>/d' fasta_file.fa | while read -r fasta; do grep -A2 -B1 "$fasta" fastq.fq >> new_fastq.fq; done
ADD COMMENT

Login before adding your answer.

Traffic: 1466 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6