find unique sequences among a set of fasta entries
1
0
Entering edit mode
3.7 years ago

What is the best way to determine arbitrary-length uniqueness for sequences?

Let's say I have 100 DNA sequences, ranging from 300bp to 100kb. I want to know all the regions of each sequence that is unique among this set. The individual sequences contain significant repeat DNA, so I want to know which regions are NOT part of repeat DNA, and not found in the other sequences. I am also interested in finding the unique regions within this set that are not present within the entire human genome.

My first thought was to just blast each pair of sequences, and keep track of unaligned regions. But this seemed inefficient.

Any help thinking about this problem would be super helpful. Thanks.

alignment BLAST genome repeat • 1.5k views
ADD COMMENT
1
Entering edit mode

Way you described would be the way to do it even if it seems inefficient. Since your sequences are wide ranging in size other redundancy methods will likely not work.

ADD REPLY
0
Entering edit mode
3.7 years ago
Mensur Dlakic ★ 28k

I suggest you try to count k-mers. Pick a decent size, say 31, and find all unique k-mers of that size. Once you do, map them to your sequences, and look for clusters of unique k-mers which will signify unique sequences. Or you can start with longer k-mers and not look for clusters at all.

ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

kmercountexact.sh from BBMap suite can also be used for this.

What I am not sure of is how one maps k-mers to sequence since it is going to be a significant bookkeeping task. Unless I am missing something simple.

ADD REPLY
0
Entering edit mode

Thanks! This is helpful.

ADD REPLY

Login before adding your answer.

Traffic: 1909 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6