Question

Non-Random Clusters Of Markers In Genomic Data

3

Entering edit mode

13.1 years ago

didymos ▴ 210

I have count data describing how many markers are connected with each chromosome position:

[0,0,0,1,0,0,0,2,0,0,0,1,1,....]

However, I have 3 or even 4 orders of magnitude less number of markers than available positions - so I have a lot of zeros.

My question is how to find clusters of markers with non-random distribution, e.g. too dense comparing to random positioning?

I have calculated distribution of pair distances between markers and compare it with simulated distances from random distribution, and they are different.
I assume that markers are localize both in random and non-random fashion but I am only interested in non-random clusters.

Actually I am even looking into similarity of my problem to other bioinformatic approaches in seq analysis (SNP, HMM in CpG island discovery,... ) for some ideas...

sequence hmm random genomics r • 2.3k views

ADD COMMENT • link updated 13.1 years ago by brentp 24k • written 13.1 years ago by didymos ▴ 210

score 1 · Answer 1 · 2011-10-07

This is an interesting problem. I don't have a great solution, but here's what I've tried in the past. Hopefully others have a more rigorous approach...

The distribution of "stuff" in the genome is already clustered so finding other stuff that's clustered in a different fashion is not trivial (or easy, depending on how you look at it).

You could do a moving average of the count data and look for peaks. Then it's a matter of determining a good window size. You could also use bins (overlapping or otherwise) and find those with a high sum. You could then compare that to randomly-generated + binned data.

For more realism, you'll want the randomly generated data to have the same auto-correlation that you expect to see in the genome--whatever that might be. I suppose you could report significance with respect to each level of auto-correlation that you use in generating your random data.