I'm sorry, the description may be too long.
I wanted to find gene whose expresison values positively correlate to gleason grade in prostate cancer using TCGA data, the gleason grade is divided into five group according to the gleason scores(Grade I: gleason scores <= 6; Grade II: gleason scores = 3+4; Grade III: gleason scores = 4+3; Grade IV: gleason scores = 4+4|3+5|5+3; Grade V: gleason scores = 9|10), the variable gleason grade should be a orderd factors. the code for above opetation is below:
## download data from TCGA througn TCGAbiolinks
mrnaquery <- GDCquery(project = "TCGA-PRAD",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
experimental.strategy = "RNA-Seq",
workflow.type = "HTSeq - Counts")
GDCdownload(mrnaquery, files.per.chunk = 8)
mrna_se <- GDCprepare(mrnaquery)
## filter low tumor purity samples and normal samples
mrnatp <- TCGAquery_SampleTypes(getResults(mrnaquery)$cases,
typesample = "TP")
mrna_se <- mrna_se[,mrna_se$barcode %in%
c(TCGAtumor_purity(mrnatp,
estimate = 0,
absolute = 0,
ihc = 0,
lump = 0,
cpe = 0.7)$pure_barcodes)]
rm(mrnatp, mrnant)
## filter FFPE sampels
mrna_se <- mrna_se[,!mrna_se$is_ffpe]
## check if there are some duplicated samples
mrna_se$bcr_patient_barcode[mrna_se$shortLetterCode == "TP"] %>%
duplicated() %>%
sum()
## omit samples without gleason scores
mrna_se <- mrna_se[,!is.na(mrna_se$subtype_Clinical_Gleason)]
# construct a group factor according to gleason score
mrna_se$gleason_grade5g <- ifelse(mrna_se$subtype_Clinical_Gleason == "3+3"|
mrna_se$subtype_Clinical_Gleason == "2+4",
"I",
ifelse(mrna_se$subtype_Clinical_Gleason == "3+4",
"II",
ifelse(mrna_se$subtype_Clinical_Gleason == "4+3",
"III",
ifelse(mrna_se$subtype_Clinical_Gleason == "4+4"|
mrna_se$subtype_Clinical_Gleason == "3+5" | mrna_se$subtype_Clinical_Gleason == "5+3",
"IV",
"V"))))
mrna_se$gleason_grade5g <- factor(mrna_se$gleason_grade5g,
levels = c("I", "II", "III", "IV", "V"))
I tried two different methods to find my desied gene.
the first I fitted a ordinal logistic regression across all gene using ordinal package in R where the gleason grade variable is a dependent variable and gene expression values which has been transformed using the function vst() in DESeq2 package is a independent variable. and I could obtain the coefficient of each gene in the regression and it's corresponding p values, after obtaining all gene's coefficients and pvalues I calculated the adjusted P using "BH" method, I thought the gene whose coefficient is above zero and adjusted P is below 0.05 is what I need.
here is my code for my first methods:
mrna_dds5g <- DESeqDataSet(mrna_se, design = ~gleason_grade5g)
## filter low expression gene
mrna_dds5g <- mrna_dds5g[rowSums(assay(mrna_dds5g)) >= 10,]
## Data transformation
mrna_vst5g <- vst(mrna_dds5g, blind = F)
tcga_gg_ord <- as.ordered(mrna_vst5g$gleason_grade5g)
## construct data for ordinal logistic regression
tcga_gg_ord <- data.frame(samples = colnames(mrna_vst5g),
gleason_grade = tcga_gg_ord,
stringsAsFactors = F)
### construct a data.frame with the first and second column respectively
### representing the sample barcode and gleason grade and the other columns
### standing for the gene expression values
tcga_logit_data <- assay(mrna_vst5g) %>%
as.data.frame() %>%
rownames_to_column(var = "gene_ensembl") %>%
### add a column corresponding to the gene symbol
mutate(Symbol = mapIds(org.Hs.eg.db,
keys = gene_ensembl,
column = "SYMBOL",
keytype = "ENSEMBL"),) %>%
### filter gene without gene symbol and omit gene_emsembl column
dplyr::filter(!is.na(Symbol)) %>%
dplyr::select(-gene_ensembl) %>%
dplyr::select(Symbol, everything()) %>%
### there are some gene ensembl mapping to the same gene symbol
### retain the gene symbol rows whose mean expression values is the most
mutate(sum = rowSums(.[-1])) %>%
arrange(desc(sum)) %>%
distinct(Symbol, .keep_all = T) %>%
### transform data to a format where the first and second column respectively
### representing the sample barcode and gleason grade and the other columns
### standing for the gene expression values
column_to_rownames(var = "Symbol") %>%
t() %>% as.data.frame() %>% rownames_to_column(var = "samples") %>%
inner_join(tcga_gg_ord, by = "samples") %>%
dplyr::select(samples, gleason_grade,everything())
### apply clm() to each gene
tcga_clm_logit <- apply(tcga_logit_data[-(1:2)], 2,
function(gene){
logit <- clm(tcga_logit_data$gleason_grade ~ gene)
logit_sum <- summary(logit)
ord_logit <- list(clm = logit, summary = logit_sum)
return(ord_logit)
})
### take out coefficients, standard error, z value, and p value for every gene
tcga_clm_res <- sapply(tcga_clm_logit, function(gene){
temp <- gene$summary$coefficients["gene", ]
})
### fetch the results data.frame which have six columns
### gene symbol, standard error, z value, and p value, adjusted pvalues
tcga_clm_res <- t(tcga_clm_res) %>%
as.data.frame() %>%
`names<-`(c("Estimate", "Std.error", "z value", "pvalue")) %>%
rownames_to_column(var = "gene") %>%
mutate(padj = p.adjust(pvalue, method = "BH"))
tcga_clm_upgene <- filter(tcga_clm_res, Estimate > 0, padj < 0.01) %>%
arrange(padj)
tcga_clm_downgene <- filter(tcga_clm_res, Estimate < 0, padj < 0.01) %>%
arrange(padj)
of course, I also test the proportional assumption using nominal_test() function, but weird thing is that all p values is NA. and I don't know if the logistic regression can be a good method in such circumstance?
the second method is clumsy but I think it can be proper though the results were not so well. I made a differential expression among all gleason grade group. V vs IV, IV vs III, III vs II, II vs I. and I take out all results from above compare and intersect them but there is no one existing in all group compare.
the code for the second method is blew:
## differential expression between 5 groups
mrna_gleason5g_de <- DESeq2::DESeq(mrna_dds5g)
### up
res_IIvsI <- results(mrna_gleason5g_de,
contrast = c("gleason_grade5g", "II", "I"),
alpha = 0.01)
summary(res_IIvsI)
IIvsI_gene <- tidy.DESeqResults(res_IIvsI) %>%
filter(estimate > 1, p.adjusted < 0.01) %>%
.$gene
res_IIIvsII <- results(mrna_gleason5g_de,
contrast = c("gleason_grade5g", "III", "II"),
alpha = 0.01)
summary(res_IIIvsII)
IIIvsII_gene <- tidy.DESeqResults(res_IIIvsII) %>%
filter(estimate > 1, p.adjusted < 0.01) %>%
.$gene
res_IVvsIII <- results(mrna_gleason5g_de,
contrast = c("gleason_grade5g", "IV", "III"),
alpha = 0.01)
summary(res_IVvsIII)
IVvsIII_gene <- tidy.DESeqResults(res_IVvsIII) %>%
filter(estimate > 1, p.adjusted < 0.01) %>%
.$gene
res_VvsIV <- results(mrna_gleason5g_de,
contrast = c("gleason_grade5g", "V", "IV"),
alpha = 0.01)
summary(res_VvsIV)
intersect(IIvsI_gene, IIIvsII_gene)
intersect(IIIvsII_gene, IVvsIII_gene)
intersect(IVvsIII_gene, VvsIV_gene)
for some question summarization:
1. Can the ordinal regression be a good methods for correlation between a ordered variable and a continuous variable?
2. what does the NA p value returned by nominal_test() function from R ordinal package mean? I Googled it but it seems no one come to such situation.
Thank you all!
Thank you, Kevin. Sorry for my unproper description and thanks for your correcting me.
I have just tried your code, but it returns me a error- Error in eval(predvars, data, env) : can't find object 'NKX3', It seems the gene symbol name cause such problems.
and my code
tcga_clm_logit <- apply(tcga_logit_data[-(1:2)], 2, function(gene){ logit <- clm(tcga_logit_data$gleason_grade ~ gene) logit_sum <- summary(logit) ord_logit <- list(clm = logit, summary = logit_sum) return(ord_logit) })
can give a result.for my differential expression operation. I just thought the gene having a up differential expression in all the group compare(for this, they were II vs I, III vs II, IV vs III, V vs IV) imply their positive correlation with gleason grade.
Kevin, the ordinal logistic regression assume the effects of any explanatory variables are consistent or proportional across the different thresholds I test the proportional assumption using nominal_test() function, but weird thing is that all p values is NA. do you know what it mean? and if this can influence my results?
Hey again, so, you are saying that the
apply(...
function produces results? Do they make sense?I am not quite sure about the NA p-values. The issue has been reported previously: https://github.com/runehaubo/ordinal/issues/15
Hey, Kevin, I solved it.
the colnames in my data are just gene symbol containing non-syntactic character, and R allow non-syntactic column names with backticks surrounding them. so I just add backticks to each gene symbol and it works. I changed the code to below:
and it works. Thanks for your help.
Yes, only the apply function give me a result, I tried the code you gave me again and I can understand the code, but it return a error(Error in eval(predvars, data, env) : object 'NKX3' not found). Maybe there are some conflicts with the package ordinal. I'll try another ordinal logistic regression package. Thanks for your patient instruction.
I have checked tcga_logit_data$gleason_grade and it's a ordered factor. and the number for each group respectively 23, 95, 52, 38, 65 for gleason grade I, II, III, IV, V. the number seems unbalanced.
Sorry, I also wanted to confirm my uncertain question, I thinks I should stablize the variance to do other analysis with expression values. in my ordinal regression workflow, the gene expression values I use were transformed by vst() function in DESeq2 package. Can this be a acceptable method?
and I have read a post in biostar concerning survival analysis where using voom().
I think the both can be used, Am I right? or there are other methods best to me? Thanks Kevin.
The transformed expression levels produced by
vst()
should be fine; however, you can also try those produced byrlog()
.Thanks for your help, Kevin.