Question

How to do unsupervised clustering using copy number variation data?

2

Entering edit mode

8.6 years ago

dr.chenway ▴ 20

Hi, ALL, I want to do unsupervised clustering using segmented copy number variation data (like those derived from SNP array), and then visualize it. The results will look like the following figure (Figure 1A). Samples are clustered based on their CNV.

Clustering of copy number (Figure 1A)

I know how to draw a heatmap with clustering using data in matrix in R software. However, the data structure of the segmented copy number is quite different. I only know IGV tools can visualize this kind of data. But IGV doesn't provide options to do the clustering. Can anybody give me some instructions to do this? Any help will be greatly appreciated.

SNP CNV R Clustering IGV • 5.6k views

ADD COMMENT • link updated 8.0 years ago by manali.rupji ▴ 30 • written 8.6 years ago by dr.chenway ▴ 20

0

Entering edit mode

Isn't that described in the method section of the paper (if you gave the link to the paper, we could read it) ? The key is to get a vector representation of the samples that captures the relevant information. From the figure, each sample appears to be represented by a vector in which each element corresponds to a section of chromosome and the values are copy gain/loss of each chromosomal section.

ADD REPLY • link 8.6 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for your answer. This is the original paper Comprehensive Molecular Characterization of Papillary Renal-Cell Carcinoma. The authors did mention how they performed the analysis in the supplement data (page 11 of supplementary material). However, it was very simple and did not describe clearly how to do the clustering using copy number data. Thanks again.

ADD REPLY • link 8.6 years ago by dr.chenway ▴ 20

1

Entering edit mode

As I read it, they represented each tumor with a vector of regions identified by the GISTIC2.0 software as having copy number variations and each value in the vector is the log2 of the copy number of the corresponding region. Then they did clustering with:

d<-dist(data,method="euclidean")
tree<-hclust(d,method="ward.D2")

ADD REPLY • link 8.6 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Could you please elaborate a little on "The key is to get a vector representation of the samples" and "they represented each tumor with a vector"? Thanks.

ADD REPLY • link 7.5 years ago by apuhegde ▴ 20

0

Entering edit mode

Vector representation of the samples: each sample is represented by a series of numbers, each of which is considered to describe or capture some feature/property of the samples. This set of numbers is called a feature vector in machine learning and related fields. Note that for data mining purposes, all samples have to be described using the same set of features/properties.
They represented each tumor with a vector: In the case discussed here, each sample is represented by the number of copies it has of specific genomic regions.

ADD REPLY • link 7.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I wish to perform a clustering analysis on the long-insert whole genome sequencing assay CNV data based on the Multiple Myeloma database. As a part of their download, I have only the .seg file made available. I believe the GISTIC2.0 software requires a markers.file.

1) is GISTIC2.0 tool appropriate to use for whole genome sequencing assay CNV analysis? if not, what tools could I use? 2) How to account for the samples that do not have a copy gain, copy loss or is copy neutral?

ADD REPLY • link 8.0 years ago by manali.rupji ▴ 30

0

Entering edit mode

I wish to perform a clustering analysis on the long-insert whole genome sequencing assay CNV data based on the Multiple Myeloma database. As a part of their download, I have only the .seg file made available. I believe the GISTIC2.0 software requires a markers.file.

1) is GISTIC2.0 tool appropriate to use for whole genome sequencing assay CNV analysis? if not, what tools could I use? 2) How to account for the samples that do not have a copy gain, copy loss or is copy neutral?