I have two files. One is a multifasta file, then other is a multifastq. The same sequences are found in both files, the files are just in different formats. I have subsets of the multifasta file, and would like to find all those sequences in the multifastq file. The subsets are merely small multifasta files (~ 100 sequences) out of the original (~125K sequences).
I feel like grep should be able to do this nicely, but I don't actually know much of anything about grep. I do know, though, that it has a finite memory storage and it might not be the best when working with large files such as two 125K sequence multifasta/q files.
I need the sequence and the phred quality scores.
A sequence in one file looks like this:
>m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30/0_59
AAGAGAGAGATCCTCTTAAGACTCCCAACACGAATTCTCTATTACGCACA
TTATGTATAA
The same sequence in the other file looks like:
@m160505_031746_42156_c100980652550000001823221307061611_s1_p0/30/0_59 RQ=0.771
AAGAGAGAGATCCTCTTAAGACTCCCAACACGAATTCTCTATTACGCACATTATGTATA
+
&%,--.-)..)&$.),.*&"*'.$&(('(-'))*)-#&$(,+-($&$#%%%,*+$*++'
As you can see, the header IDs are very similar, but not identical. Thanks for the help! -Rob
Two supplementary questions.