This is the post/tutorial where Dr Kevin Blighe showed how to perform survival analysis with gene here in this he showed how to use single gene as predictor and create groups based on expression and classify them as mid or high or low.
Now How to do the same with a combination of genes which in my case I have filtered from WGCNA analysis which i would like to use on the patient cohort ? This is the literature I would like to refer where they tested LSC17 and LSC3 which is 17 gene and 3 gene signature, LSC17 is the final signature which is now used.
This is another paper where they have use this to test against their new signature in pediatric AML samples
So here they show different classification I would like to know about LSC17 or LSC47 ,how the grouping is achieved when there is more than 1 gene is involved?
I would like to create a dummy data-frame
set.seed(123)
nr1 = 4; nr2 = 8; nr3 = 6; nr = nr1 + nr2 + nr3
nc1 = 6; nc2 = 8; nc3 = 10; nc = nc1 + nc2 + nc3
mat = cbind(rbind(matrix(rnorm(nr1*nc1, mean = 1, sd = 0.5), nr = nr1),
matrix(rnorm(nr2*nc1, mean = 0, sd = 0.5), nr = nr2),
matrix(rnorm(nr3*nc1, mean = 0, sd = 0.5), nr = nr3)),
rbind(matrix(rnorm(nr1*nc2, mean = 0, sd = 0.5), nr = nr1),
matrix(rnorm(nr2*nc2, mean = 1, sd = 0.5), nr = nr2),
matrix(rnorm(nr3*nc2, mean = 0, sd = 0.5), nr = nr3)),
rbind(matrix(rnorm(nr1*nc3, mean = 0.5, sd = 0.5), nr = nr1),
matrix(rnorm(nr2*nc3, mean = 0.5, sd = 0.5), nr = nr2),
matrix(rnorm(nr3*nc3, mean = 1, sd = 0.5), nr = nr3))
)
mat = mat[sample(nr, nr), sample(nc, nc)] # random shuffle rows and columns
rownames(mat) = paste0("gene", seq_len(nr))
colnames(mat) = paste0("LSC", seq_len(nc))
Creating a new Df to transpose it
mat2 <- mat %>% as.data.frame() %>% t() %>% as.data.frame() %>% rownames_to_column("Patient")
Now for example if I want to use some genes such as gene1,gene3,gene5,gene8 and gene10
how to use these combination of genes and categories my Patient group into Low/high/Mid
In this post a threshold was set as such
highExpr <- 1.0
lowExpr <- -1.0
survplotdata$CXCL12 <- ifelse(survplotdata$CXCL12 >= highExpr, 'High',
ifelse(survplotdata$CXCL12 <= lowExpr, 'Low', 'Mid'))
survplotdata$MMP10 <- ifelse(survplotdata$MMP10 >= highExpr, 'High',
ifelse(survplotdata$MMP10 <= lowExpr, 'Low', 'Mid'))
What or How would the threshold should be when I would like to use more than 1 genes together?
Any suggestion or help would be really appreciated
UPDATE
In the paper they have used this LSC17 score = (DNMT3B × 0.0874) + (ZBTB46 × −0.0347) + (NYNRIN × 0.00865) + (ARHGAP22 × −0.0138) + (LAPTM4B × 0.00582) + (MMRN1 × 0.0258) + (DPYSL3 × 0.0284) + (KIAA0125 × 0.0196) + (CDK6 × −0.0704) + (CPXM1 × −0.0258) + (SOCS2 × 0.0271) + (SMIM24 × −0.0226) + (EMP1 × 0.0146) + (NGFRAP1 × 0.0465) + (CD34 × 0.0338) + (AKR1C3 × −0.0402) + (GPR56 × 0.0501).
and then they categorize this above- and below-median scores in the training cohort were associated with adverse and favourable cytogenetic risk, respectively, a median threshold was used to discretize scores into high and low groups.
Now for understanding I have made a fake lasso regression score for 5 genes let call it as LSC5
LSC5 score = (gene1 × 0.0874) + (gene6 × −0.0347) + (gene10 × 0.00865) + (gene22 × −0.0138) + (gene28× 0.00582)
Now How do I use the above score get the median for each sample and characterize them into low/mid/high ?
It's not clear if you would like to combine the low/high information coming from each gene into a single label to use for a patient. Regarding the threshold for each gene, it's hard to determine which single value would stratify into high and low expression level. Why not using quantiles? This gives you control over expression range and the ability to determine how wide your high/low label may be.
Sorry for the
you would like to combine the low/high information coming from each gene into a single label to use for a patient
this information I was missing in my post let me update it...Why not using quantiles? This gives you control over expression range and the ability to determine how wide your high/low label may be. This can be one way but the other appriach is my above query since I have very much similar approach I have narrowed down and used Lasso to generate the scores which I want to test if it can get better against LSC17 or worse
Hello Shred I have updated the information regarding your suggestion Why not using quantiles? This gives you control over expression range and the ability to determine how wide your high/low label may be. Well I definitely want to know how to do that if you can show me