Clustering of CNV genomic coordinates, take 2
1
1
Entering edit mode
6.8 years ago
Sakti ▴ 530

Dear Biostars,

After searching the internet for quite a while, I have yet to find an easy solution for clustering of human genomic coordinates. This post asked the same question a couple of years ago, but there was no answer as to how one could simply cluster a bed file and be able to graph it in IGV (or any of your favorite genome graphers), and make it look like this figure.

Here's the breakdown of the problem at hand:

Data type: Human CNV data detected by both array and sequencing. Output from these analysis is a .bed file with the CNV positions, similar to this:

chr    start    end    cnv_id    sample_name    sample_category

Clustering type: anything rolls, from k-means to unsupervised.

Question: Are there samples that preferentially cluster together because they share very similar CNV positions? Is this clustering of CNVs meaninful given the sample category (i.e. sick vs normal)?

I have read about CNVTools, which to my understanding needs probe intensities; I could never get iCluster to work; IGVTools doesn't have a clustering function; I'm unsure seqMINER or any other TSS/ChIP clustering algorithm will work with longer stretches of DNA sequence; and everything I have read about clustering methods in R revolves around single genes/values and not genomic coordinates.

It is why I appeal to the Biostars wisdom once more. I'd be grateful if someone could recommend a solution to this problem.

Thanks!

Sakti

cluster analysis genomic coordinates cnv bed • 2.3k views
ADD COMMENT
0
Entering edit mode

What data are you trying to cluster? What is the assay and what is the question you want to answer? Are you dealing with copy number data, or something else? Sequence-based, or array?

ADD REPLY
0
Entering edit mode

Hi Sean, thanks for commenting. I have updated the post with the answers to your questions.

ADD REPLY
1
Entering edit mode
6.8 years ago

There is not a general approach to dealing with these types of data that I know of and you have multiple questions that you seem to be asking of your data. That said, one approach you might find useful to define a set of genomic "bins" across the genome and then build a matrix of: SAMPLE x BIN. Each cell of the matrix has a TRUE (or 1) if the sample has a CNV that overlaps that genomic region. Tools like bedtools or GenomicRanges might help with that task.

From there, more standard matrix-based approaches are available for clustering and statistical testing.

ADD COMMENT
0
Entering edit mode

Thanks a lot Sean! I was pondering the genomic bins solution, which seems what will work in the end for my data. Thanks!!

ADD REPLY

Login before adding your answer.

Traffic: 2349 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6