Data setup

Question

How to exclude some of breast cancer subtypes just by looking at gene expression?

4

Entering edit mode

7.2 years ago

jack ▴ 980

Hi, Probably you all know about PAM50 gene signature for breast cancer sub typing. What I'm looking for is the some of small sets of the genes that their expression levels(expressed or not) helps me to exclude some of the breast cancer subtypes (For example, by looking at gene A and gene B expression levels I can tell something like: this can not be Her2) Does anyone have idea which genes are informative in this regard ?

RNA-Seq genome cancer gene • 8.7k views

ADD COMMENT • link updated 20 months ago by Safu • 0 • written 7.2 years ago by jack ▴ 980

score 15 · Answer 1 · 2017-12-04

15

Entering edit mode

7.2 years ago

Kevin Blighe 89k

Edit March 5th, 2020: there is no definitive answer to the OP's question. My original answer (below) is to merely foment ideas on how to address the question.

Hi Jack,

In order to achieve this, you will have to model your expression data using the PAM50 classifier genes, and then allow the modelling to decide which genes can be best used to predict the breast cancer sub-types.

Keep in mind that PAM50 is already a relatively small classifier that can segregate the different sub-types.

You could aim to do multinomial logistic regression because your aim is to use gene expression to predict multiple categorical variables (sub-types). I would also recommend trying out lasso-penalised regression, which is what I go over in this answer (below).

You'll require:

A gene expression dataset with PAM50 genes filtered in (if you don't know the genes, look here: Where To Download Pam50 Gene Set? )
Breast cancer subtypes for samples

Data setup

require(glmnet)

# Get your data like this:
modellingLasso
           SubType   MYC  EGFR KIF2C CDC20
Sample 1  LuminalA 11.39 10.62  9.75 10.34
Sample 2  LuminalB 10.16  8.63  8.68  9.08
Sample 3  LuminalA  9.29 10.24  9.89 10.11
Sample 4      Her2 11.53  9.22  9.35  9.13
Sample 5      Her2  8.35 10.62 10.25 10.01
Sample 6      Her2 11.71 10.43  8.87  9.44
...

The SubType variable can be a factor:

modellingLasso$SubType
...
Levels: Her2 LuminalA LuminalB ...

Now we can build the models:

build a single pass (single fold) lasso-penalised model

lassoModel <- glmnet(
  x=data.matrix(modellingLasso[,-1]),
  y=modellingLasso$SubType,
  standardize=TRUE,
  alpha=1.0,
  family="multinomial")
plot(lassoModel, xvar="lambda")

perform 10-fold cross validation

cv.lassoModel <- cv.glmnet(
  x=data.matrix(modellingLasso[,-1]),
  y=modellingLasso$SubType,
  standardize=TRUE,
  alpha=1.0,
  nfolds=10,
  family="multinomial",
  parallel=TRUE)

# plot variable deviances vs. shrinkage parameter, λ (lambda)
plot(cv.lassoModel)

identify best predictors

Now we can identify best predictors via 2 metrics: min λ (lamba); 1 standard error of λ

idealLambda <- cv.lassoModel$lambda.min
idealLambda1se <- cv.lassoModel$lambda.1se
print(idealLambda); print(idealLambda1se)

# derive coefficients for each gene
co <- coef(cv.lassoModel, s=idealLambda, exact=TRUE)
co

co.se <- coef(cv.lassoModel, s=idealLambda1se, exact=TRUE)
co.se

# identify predictors for each sub-type
rownames(co.se$LuminalA)[which(co.se$LuminalA != 0)]
rownames(co.se$LuminalB)[which(co.se$LuminalB != 0)]
rownames(co.se$Her2)[which(co.se$Her2 != 0)]

Downstream

For downstream applications, we could build a new model with our best predictors:

finalLasso <- glm(modellingLasso$SubType ~ MYC + ERBB2 + CCND1,
  data=modellingLasso,
  family=binomial(link="logit"))
summary(finalLasso)

We can also perform ROC analysis:

# ROC analysis
require(pROC)

roc <- roc(modellingLasso$SubType, fitted(finalLasso), smooth=FALSE)
plot.roc(
  roc,
  grid=TRUE,
  grid.lwd=2,
  col="royalblue",
  main="Lasso model")
text(0.3, 0.45,
  col="royalblue",
  paste("AUC (L95, AUC, U95):",
    paste(round(ci.auc(roc)[1:3], 3), sep=" ", collapse=","), sep=" "),
  cex=0.7)

Further info:

If you run into difficulty with the lasso-penalised regression, you can modify the stringency by relaxing α (alpha):

# elastic-net regression (here α=0.5)
cv.ElasticModel <- cv.glmnet(..., alpha=0.5, ...)

# ridge regression (α=0)
cv.RidgeModel <- cv.glmnet(..., alpha=0.00, ...)

Also see Professor Hastie's own in-depth tutorial: Glmnet Vignette

ADD COMMENT • link 3.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin, I have read your comment on this page and I also have a batch of RNA-seq count to predict survival When I see your data 'modellingLasso' I have two questions: 1.how to define the endpoint: if there was two patients, one have survived for 3 years and then died, the other survived for 1 year and is still alive and under follow-up. Then the first one should be died and the other should be alive?but how can we know the later one could survived longer than the first one? Or should we define a survival duration,namely 2 year. and then the first one is 2 year positive and the later one is 2 year negative?

2.does the count matric been normalized and log transformation? which method do you adapted? You use the PAM50 genes and could I use my own datasets with 500+ DEGs?

ADD REPLY • link 6.9 years ago by 1561775911 • 0

0

Entering edit mode

Hello, if you only have 2 patients, then it is not a great situation to have. However, if you have multiple biopsies of the same patients, then it may work - do you have multiple biopsies at different time-points?; or, do you literally just have 2 samples, both taken at the same time-point?

Yes, you can use any number of genes with the lasso regression. The method has previously been used with genome-wide genetic variant data in breast cancer.

ADD REPLY • link 6.9 years ago by Kevin Blighe 89k

0

Entering edit mode

Hello,kevin, thank you for your reply. but for the question 1,I just gave an example ,actually ,I have 300+patients ,and each one do not have biopsy replication. I am just wondering how could I define the endpoints?
For question 2,I'd like to ask that if the expression matrix of my own DEGs should be normalized and log transformation? There is another question,if the DEGs I choose from have high correlations between themselves ,could I choose this method of LASSO? Ridge or plasticnet is better? And I have 150 patients ,there is 1000-6000(varing with my chosen standard log2fc,padj or pvalue) DEGs as input,does this situation work?

ADD REPLY • link 6.9 years ago by 1561775911 • 0

0

Entering edit mode

Hello!

In that case, I would recommend multiple end-points (and, therefore, multiple analyses). Your end-points can be something like:

survived up to end of year 1
survived up to end of year 3

These will obviously be encoded as 0 and 1

--------------------------------

The data that you use should already be normalised. If you have microarray data, then it will already be on Log2 scale.

-------------------------------------

The cross-validation part of LASSO regression is partly aimed to mitigate (reduce) the effects of inter-correlated variables. So, be sure that you do cross-validation.

LASSO will fail if it cannot determine differences between your end-points, in which case a less severe regression penalty (like ridge or elastic-net) can be done.

ADD REPLY • link 6.9 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi,

Thanks a lot, this has been truly useful.

Notwithstanding, I have a theoretical question.

Why is SubType the response variable? Is not suppose that variable is a factor (explanatory variable) and the expression values the response variable?

Thanks for your collaboration.

Best regards,

Juan

ADD REPLY • link 6.0 years ago by judhenaosa ▴ 50

0

Entering edit mode

It depends on your point of view and the question that you are asking. I think that both ways are valid.

ADD REPLY • link 6.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks for your help.

I have an expression matrix as I show you n the next table.

Name        y    GAPDH          PGK1            BRACA1          PPIA
Sample_1    1   -3.862188587    4.0051135284    5.3343408926    -1.5971560514
Sample_2    1   -4.3324333866   4.7910247638    6.9940989042    1.3791251185
Sample_3    1   -5.4226451187   3.4368084632    3.4502391931    -5.2178824077
Sample_4    1   -6.1021958608   4.2138262027    5.1661516867    -1.146564008
Sample_5    1   -4.4832777421   4.9770826641    5.0067170526    0.7268057385
Sample_6    0   -7.7504208928   3.7658388974    2.513504636     1.1494149229
Sample_7    0   -7.9455082982   4.848660554     2.3662448291    -1.4117052604
Sample_8    0   -7.8904193457   4.6539708906    7.7405673621    -1.8672276649
Sample_9    0   -5.7225768492   2.4094299737    2.3454939119    -1.3644475688
Sample_10   0   -3.4185800705   4.3094231329    5.9014253872    -1.862157346

I have no information related to this matrix, but I have to create a predictive model. As I know, probably the matrix was created using the "voom" function, which gives negative measures. Then, it is possible to do the model just like the before example.

Note: The Y variable indicates whether a patient was diagnosed with the disease (y=1) or not (y=0).

I truly appreciate your comments.

Best regards,

Juan

ADD REPLY • link 6.0 years ago by judhenaosa ▴ 50

0

Entering edit mode

Yes, you can still use the same code for this type of data. What is the distribution of your data, though? You can check with the hist() function (be sure to not include the y variable when you generate the histogram via hist())

ADD REPLY • link 6.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi.

I shared the histogram and the overlap normal curve:

data.table$y <- NULL
row.names(data.table) <- data.table$Name
data.table$Name <- NULL

head(data.table)

all.data <- unlist(data.table)

hist(all.data,probability = T,ylim = c(0,0.12))
curve(dnorm(x,mean = mean(all.data),sd = sqrt(var(all.data))),add = T)

enter image description here

Thanks for your help.

Best regards,

Juan

ADD REPLY • link 6.0 years ago by judhenaosa ▴ 50

0

Entering edit mode

Hola de nuevo. No puedo ver la figura. Se puede encontrar información para cargar una figura aquí: How to add images to a Biostars post

ADD REPLY • link 6.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Sorry.

Here a new attempt.

enter image description here

ADD REPLY • link 6.0 years ago by judhenaosa ▴ 50

0

Entering edit mode

Looks like great data! It seems suitable for the default functionality of cv.glmnet, which expects data to be normally distributed.

ADD REPLY • link 6.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin, Thanks for your Help, It's such a great resource. I have built model LASSO for my data set, I am grateful for your tutorials. Thanks again! Secondly, when I got the best predictors I am not able to print it out it's saying (dgCMatrix) Sparse matrix error. I did use as.matrix() to print, but it's of no use. Please help me to save the predictors in a .csv file. Thanks again, Sincerely, Dave

ADD REPLY • link 6.0 years ago by David_emir ▴ 500

0

Entering edit mode

Hey Dave, it is possible that some of the functionality of these commands has changed since I posted this answer (above). Which command, specifically, are you trying to run?

ADD REPLY • link 6.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin, Sorry for the delayed reply, my Lab is in remotest part and to use the Internet I have to travel back to the city. Sorry again! I am using the following command

lassoModel <- glmnet(x=data.matrix(step1[,-1]), y=step1$SubType, standardize=TRUE, alpha=2.0, family="multinomial")
plot(lassoModel, xvar="lambda" , label = TRUE)
plot(lassoModel, xvar="dev" , label = TRUE)
print(lassoModel)

Thanks again! Dave

ADD REPLY • link 6.0 years ago by David_emir ▴ 500

0

Entering edit mode

Hey, why do you set alpha = 2.0? Usually it is between 0.0 and 1.0

Okay, so, I will hear from you again when you are next in the city. Do you not have a mobile / cellular network?

ADD REPLY • link 6.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin, I have expression profile of the list of genes (say ~100 genes) and i want to select best features out of total list. I have came across this link and tried executing the code with glmnet. However, i am getting error at the step where i want identify best predictors for each class (binomial categories; pCR and RD) the below code giving i am using

modellingLasso$Response

lassoModel <- glmnet( x=data.matrix(modellingLasso[,-1]), y=modellingLasso$Response, standardize=TRUE, alpha=0.5, family="binomial") plot(lassoModel, xvar="lambda") print(lassoModel)

cv.lassoModel <- cv.glmnet( x=data.matrix(modellingLasso[,-1]), y=modellingLasso$Response, nfolds=10, family="binomial")

plot variable deviances vs. shrinkage parameter, λ (lambda)

plot(cv.lassoModel) print(cv.lassoModel)

idealLambda <- cv.lassoModel$lambda.min

idealLambda1se <- cv.lassoModel$lambda.1se

print(idealLambda); print(idealLambda1se)

derive coefficients for each gene

co <- coef(cv.lassoModel, s=idealLambda, exact=TRUE) co

co.se <- coef(cv.lassoModel, s=idealLambda1se, exact=TRUE) co.se

identify predictors for each sub-type

(rownamesco.se$pCR)[whichco.se$pCR != 0]

Error: object 'rownamesco.se' not found

ADD REPLY • link 3.7 years ago by inayat45shaikh ▴ 40

0

Entering edit mode

Hi, you probably need

rownames(co.se$pCR)[whichco.se$pCR != 0]

ADD REPLY • link 3.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin,

Thanks for your prompt response, I tried running that earlier also, but getting error

**rownames(co.se$pCR)[whichco.se$pCR != 0]

Error in co.se$pCR : $ operator not defined for this S4 class**

Then i searched S4 class error and found that i could not access the slots of an S4 object using $ operator. Instead it was suggested to use @ operator. I tried with @ operotor as well, but getting new error

rownames(co.se@pCR)[whichco.se@pCR != 0]

Error in rownames(co.se@pCR) : no slot of name "pCR" for this object of class "dgCMatrix"

I am not how to fix this.

Best

Inayat

ADD REPLY • link 3.7 years ago by inayat45shaikh ▴ 40

0

Entering edit mode

Hi Inayat. I think that there is an issue with rendering of my code in my post (above). The commands shoudl be:

rownames(co.se$LuminalA)[which(co.se$LuminalA != 0)]
rownames(co.se$LuminalB)[which(co.se$LuminalB != 0)]
rownames(co.se$Her2)[which(co.se$Her2 != 0)]

Please note the locations of the parentheses, (

ADD REPLY • link 3.7 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks Kevin, I will try this

Best Inayat

ADD REPLY • link 3.7 years ago by inayat45shaikh ▴ 40

0

Entering edit mode

Hi Dr. Kevin, I selected 40 prognostic genes having significant survival difference (survival R package) using univariate cox regression that had 25,000 genes. I want to narrow down further genes, using Lasso regression. By using the above method, shall I use the survival time (alive and dead) in y variable as a factor and keep x as expression data. Is it fine? I am a bit confused how to add survival information in y column.

And do I need to use the entire RNA-seq data or only use 40 genes data for lasso regression. Will an expression dataset of 40 genes would be fine. My goal is to build a prognostic model. Will really appreciate your comment.

ADD REPLY • link 20 months ago by Safu • 0