Question

find unique sequences among a set of fasta entries

0

Entering edit mode

3.7 years ago

j.matt.franklin • 0

What is the best way to determine arbitrary-length uniqueness for sequences?

Let's say I have 100 DNA sequences, ranging from 300bp to 100kb. I want to know all the regions of each sequence that is unique among this set. The individual sequences contain significant repeat DNA, so I want to know which regions are NOT part of repeat DNA, and not found in the other sequences. I am also interested in finding the unique regions within this set that are not present within the entire human genome.

My first thought was to just blast each pair of sequences, and keep track of unaligned regions. But this seemed inefficient.

Any help thinking about this problem would be super helpful. Thanks.

alignment BLAST genome repeat • 1.5k views

ADD COMMENT • link updated 3.7 years ago by Mensur Dlakic ★ 28k • written 3.7 years ago by j.matt.franklin • 0

1

Entering edit mode

Way you described would be the way to do it even if it seems inefficient. Since your sequences are wide ranging in size other redundancy methods will likely not work.

ADD REPLY • link 3.7 years ago by GenoMax 147k

score 0 · Answer 1 · 2021-03-01

0

Entering edit mode

3.7 years ago

Mensur Dlakic ★ 28k

I suggest you try to count k-mers. Pick a decent size, say 31, and find all unique k-mers of that size. Once you do, map them to your sequences, and look for clusters of unique k-mers which will signify unique sequences. Or you can start with longer k-mers and not look for clusters at all.

ADD COMMENT • link 3.7 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

This may help:

https://github.com/OpenGene/UniqueKMER

ADD REPLY • link 3.7 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

kmercountexact.sh from BBMap suite can also be used for this.

What I am not sure of is how one maps k-mers to sequence since it is going to be a significant bookkeeping task. Unless I am missing something simple.