Question

Finding most frequent k-mers in Fastq file

0

Entering edit mode

8.8 years ago

murat • 0

Hello everyone,

I have zero knowledge about bioinformatics and I am sorry about if this question comes as oblivious but I've done lots of research and couldn't find an answer.

Let's say I have FASTQ file and I need to find most frequent 25 K-mers (k=30) in this file. What should be the algorithmic approach?

Let's say file is something like this:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

@SEQ_ID
GATTTGGGAGTAAATCCATTTGTTCAACTCACAGTTTGTTCAAAGCAGTATCGATCAAAT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

What I do is, I first read the first line after @SEQ_ID and find all possible 30-mers (as substrings) from it, then I move on the second sequence after second @SEQ_ID and find 30-mers from there as well.

However, I couldn't find any information about: Should I concatenate these two strings together and look for k-mers there?

In other words, should I count (for example) last 10 characters of the first line and first 20 characters of second line as a k-mer?

Thank you

kmer sequencing • 4.5k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.8 years ago by murat • 0

Ram · Answer 1 · 2016-02-01

2

Entering edit mode

8.8 years ago

Matt Shirley 10k

You should likely not be concatenating the sequences together. Each read is a substring of a larger DNA template that is either being read from one or both ends during sequencing. If you concatenate the reads together you'll be creating a junction that does not exist in the original DNA sequence.

Regarding kmer counting, I would suggest looking at khmer.

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by Matt Shirley 10k

0

Entering edit mode

Thanks for the answer Matt. Any text editor on Mac is just freezing trying to open the output khmer produces. Any ideas?

ADD REPLY • link 8.8 years ago by murat • 0

0

Entering edit mode

man less?............

ADD REPLY • link 8.8 years ago by Matt Shirley 10k

0

Entering edit mode

It produces a binary file. Makes no sense.

ADD REPLY • link 8.8 years ago by murat • 0

0

Entering edit mode

If you provide information about which script you ran and with what parameters I might be able to help. If you're using the load-into-counting.py script then you need to process the output mer graph (probably your binary file) using the abundance-dist.py script.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by Matt Shirley 10k

0

Entering edit mode

After reading the docs and this answer, it's more clear. Thank you very much for the help Matt!

ADD REPLY • link 8.8 years ago by murat • 0