Creating a new Df to transpose it

Question

Creating sample groups from a combination of genes for survival analysis

1

Entering edit mode

2.1 years ago

1769mkc ★ 1.2k

This is the post/tutorial where Dr Kevin Blighe showed how to perform survival analysis with gene here in this he showed how to use single gene as predictor and create groups based on expression and classify them as mid or high or low.

Now How to do the same with a combination of genes which in my case I have filtered from WGCNA analysis which i would like to use on the patient cohort ? This is the literature I would like to refer where they tested LSC17 and LSC3 which is 17 gene and 3 gene signature, LSC17 is the final signature which is now used.

This is another paper where they have use this to test against their new signature in pediatric AML samples

So here they show different classification I would like to know about LSC17 or LSC47 ,how the grouping is achieved when there is more than 1 gene is involved?

I would like to create a dummy data-frame

   set.seed(123)
nr1 = 4; nr2 = 8; nr3 = 6; nr = nr1 + nr2 + nr3
nc1 = 6; nc2 = 8; nc3 = 10; nc = nc1 + nc2 + nc3
mat = cbind(rbind(matrix(rnorm(nr1*nc1, mean = 1,   sd = 0.5), nr = nr1),
                  matrix(rnorm(nr2*nc1, mean = 0,   sd = 0.5), nr = nr2),
                  matrix(rnorm(nr3*nc1, mean = 0,   sd = 0.5), nr = nr3)),
            rbind(matrix(rnorm(nr1*nc2, mean = 0,   sd = 0.5), nr = nr1),
                  matrix(rnorm(nr2*nc2, mean = 1,   sd = 0.5), nr = nr2),
                  matrix(rnorm(nr3*nc2, mean = 0,   sd = 0.5), nr = nr3)),
            rbind(matrix(rnorm(nr1*nc3, mean = 0.5, sd = 0.5), nr = nr1),
                  matrix(rnorm(nr2*nc3, mean = 0.5, sd = 0.5), nr = nr2),
                  matrix(rnorm(nr3*nc3, mean = 1,   sd = 0.5), nr = nr3))
)
mat = mat[sample(nr, nr), sample(nc, nc)] # random shuffle rows and columns
rownames(mat) = paste0("gene", seq_len(nr))
colnames(mat) = paste0("LSC", seq_len(nc))

Creating a new Df to transpose it

mat2 <- mat %>% as.data.frame() %>% t() %>% as.data.frame() %>% rownames_to_column("Patient")

Now for example if I want to use some genes such as gene1,gene3,gene5,gene8 and gene10 how to use these combination of genes and categories my Patient group into Low/high/Mid

In this post a threshold was set as such

highExpr <- 1.0
lowExpr <- -1.0
survplotdata$CXCL12 <- ifelse(survplotdata$CXCL12 >= highExpr, 'High',
  ifelse(survplotdata$CXCL12 <= lowExpr, 'Low', 'Mid'))
survplotdata$MMP10 <- ifelse(survplotdata$MMP10 >= highExpr, 'High',
  ifelse(survplotdata$MMP10 <= lowExpr, 'Low', 'Mid'))

What or How would the threshold should be when I would like to use more than 1 genes together?

Any suggestion or help would be really appreciated

UPDATE In the paper they have used this LSC17 score = (DNMT3B × 0.0874) + (ZBTB46 × −0.0347) + (NYNRIN × 0.00865) + (ARHGAP22 × −0.0138) + (LAPTM4B × 0.00582) + (MMRN1 × 0.0258) + (DPYSL3 × 0.0284) + (KIAA0125 × 0.0196) + (CDK6 × −0.0704) + (CPXM1 × −0.0258) + (SOCS2 × 0.0271) + (SMIM24 × −0.0226) + (EMP1 × 0.0146) + (NGFRAP1 × 0.0465) + (CD34 × 0.0338) + (AKR1C3 × −0.0402) + (GPR56 × 0.0501).

and then they categorize this above- and below-median scores in the training cohort were associated with adverse and favourable cytogenetic risk, respectively, a median threshold was used to discretize scores into high and low groups.

Now for understanding I have made a fake lasso regression score for 5 genes let call it as LSC5

LSC5 score = (gene1 × 0.0874) + (gene6 × −0.0347) + (gene10 × 0.00865) + (gene22 × −0.0138) + (gene28× 0.00582)

Now How do I use the above score get the median for each sample and characterize them into low/mid/high ?

Survival • 1.1k views

ADD COMMENT • link 2.1 years ago by 1769mkc ★ 1.2k

2

Entering edit mode

It's not clear if you would like to combine the low/high information coming from each gene into a single label to use for a patient. Regarding the threshold for each gene, it's hard to determine which single value would stratify into high and low expression level. Why not using quantiles? This gives you control over expression range and the ability to determine how wide your high/low label may be.

ADD REPLY • link 2.1 years ago by Shred ★ 1.6k

0

Entering edit mode

Sorry for the you would like to combine the low/high information coming from each gene into a single label to use for a patient this information I was missing in my post let me update it...

ADD REPLY • link 2.1 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

Why not using quantiles? This gives you control over expression range and the ability to determine how wide your high/low label may be. This can be one way but the other appriach is my above query since I have very much similar approach I have narrowed down and used Lasso to generate the scores which I want to test if it can get better against LSC17 or worse

ADD REPLY • link 2.1 years ago by 1769mkc ★ 1.2k

0

Entering edit mode

Hello Shred I have updated the information regarding your suggestion Why not using quantiles? This gives you control over expression range and the ability to determine how wide your high/low label may be. Well I definitely want to know how to do that if you can show me

ADD REPLY • link 2.1 years ago by 1769mkc ★ 1.2k