Question

classifying samples by TCGA signature

1

Entering edit mode

5.1 years ago

garrettbullivant ▴ 10

Hi all,

I have some RNA-seq samples from multiple glioblastoma tumours that I'm now trying to classify according to a specific gene signature (from Verhaak et al., 2010) using R. The gene signature is reported as a gene list with specific centroids for each of the 4 clusters (https://api.gdc.cancer.gov/data/941f81a1-05d7-4f84-80ec-534b8dc1ebac). I'm wondering how I can use this signature to classify my samples in R? Would it involve some sort of k-nearest neighbours method?

Additionally, the signature was identified using microarray data, but I am classifying RNA-seq data. Is there any sort of adjustment I should make to the signature to account for this?

Thanks in advance!

R RNA-Seq • 1.1k views

ADD COMMENT • link updated 5.1 years ago by Kevin Blighe 89k • written 5.1 years ago by garrettbullivant ▴ 10

0

Entering edit mode

Hi garrettbullivant, You may already have seen this paper but if not for your last part of the question, this paper might help. They took the RSEM rnaseq and microarray values and standardized it and centered around mean.( I personally have not done this yet although I am still trying to recreate some parts from this paper as an exercise. But when I saw this blog I thought it might help you!) Description of thiscan be found in the BRS section of the supplement. DOI: 10.1158/1078-0432.CCR-18-2953 https://clincancerres.aacrjournals.org/content/25/10/3141 Comprehensive Genetic Characterization of Human Thyroid Cancer Cell Lines: A Validated Panel for Preclinical Studies

Iñigo Landa, Nikita Pozdeyev, Christopher Korch, Laura A. Marlow, Robert C. Smallridge, John A. Copland, Ying C. Henderson, Stephen Y. Lai, Gary L. Clayman, Naoyoshi Onoda, Aik Choon Tan, Maria E.R. Garcia-Rendueles, Jeffrey A. Knauf, Bryan R. Haugen, James A. Fagin and Rebecca E. Schweppe

ADD REPLY • link 4.6 years ago by geneart$$ ▴ 50

score 1 · Answer 1 · 2020-07-01

If they have reported centroids, then I imagine that they have used PAM (partitioning around medoids) clustering, and not k-means or k-NN, but you can check the citation. So, you could, in effect, simply subset the TCGA GBM samples for these genes and then try to identify ideal clusters via various metrics, like:

Jaccard Index
silhouette method
consensus clustering
elbow method
gap statistic

Once you identify the ideal number of clusters, k, you would then re-perform PAM on the TCGA GBM data with the identified value of k. The idea would be that the original groups identified by the authors will be 'un-earthed' in this way.

You could also simply do hierarchical clustering with the subset GBM data and define a tree-cut height to identify the original groups.

Many different ways to do it - some more elaborate ways likely exist.

Additionally, the signature was identified using microarray data, but I am classifying RNA-seq data. Is there any sort of adjustment I should make to the signature to account for this?

Then this will be a good test of the signature.

Kevin