What is the best way to determine arbitrary-length uniqueness for sequences?
Let's say I have 100 DNA sequences, ranging from 300bp to 100kb. I want to know all the regions of each sequence that is unique among this set. The individual sequences contain significant repeat DNA, so I want to know which regions are NOT part of repeat DNA, and not found in the other sequences. I am also interested in finding the unique regions within this set that are not present within the entire human genome.
My first thought was to just blast each pair of sequences, and keep track of unaligned regions. But this seemed inefficient.
Any help thinking about this problem would be super helpful. Thanks.
Way you described would be the way to do it even if it seems inefficient. Since your sequences are wide ranging in size other redundancy methods will likely not work.