Hello everyone,
I have zero knowledge about bioinformatics and I am sorry about if this question comes as oblivious but I've done lots of research and couldn't find an answer.
Let's say I have FASTQ file and I need to find most frequent 25 K-mers (k=30) in this file. What should be the algorithmic approach?
Let's say file is something like this:
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
@SEQ_ID
GATTTGGGAGTAAATCCATTTGTTCAACTCACAGTTTGTTCAAAGCAGTATCGATCAAAT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
What I do is, I first read the first line after @SEQ_ID
and find all possible 30-mers (as substrings) from it, then I move on the second sequence after second @SEQ_ID
and find 30-mers from there as well.
However, I couldn't find any information about: Should I concatenate these two strings together and look for k-mers there?
In other words, should I count (for example) last 10 characters of the first line and first 20 characters of second line as a k-mer?
Thank you
Thanks for the answer Matt. Any text editor on Mac is just freezing trying to open the output khmer produces. Any ideas?
man less?............
It produces a binary file. Makes no sense.
If you provide information about which script you ran and with what parameters I might be able to help. If you're using the
load-into-counting.py
script then you need to process the output mer graph (probably your binary file) using theabundance-dist.py
script.After reading the docs and this answer, it's more clear. Thank you very much for the help Matt!