Hi all,
I've been working with Pancancer's TOIL RSEM expected_count data downloaded from the Xena browser to compare two driver mutations (Mutation X vs. Mutation Y) using cBioPortal. After performing differential expression (DE) analysis using DEseq2, I was surprised to find more than 20k significant features out of a total of 60k, despite the combined sample size being less than 650 mutations.
A similar issue occurred when I compared a specific tissue type to all others using the same TOIL RSEM expected_count dataset, though this time with a larger sample size.
I'm concerned there might be a batch effect at play here. In my Pancancer analysis, DE results yielded "impossibly significant" p-values (e.g., <10^-99).
Does anyone have insights into what might be causing these unexpectedly significant results or advice on how to address potential batch effects in this context?
Loaded TOIL RSEM expected_count data, removed samples with therapy, and filtered low-variance genes. Processed mutation data to focus on two driver mutations. Set up DESeq2 analysis using the filtered mutation groups. Observed the high number of significant features.
Code:
Providing the first part (Pancancer):
library(DESeq2)
library(dplyr)
library(survival)
library(survminer)
library(sva)
library(ggplot2) library(limma)
library(readxl)
library(randomForest)
library(pROC)
library(caret)
library(biomaRt)
library(org.Hs.eg.db)
library(xCell)
library(NMF)
library(reshape2)
library(doParallel)
# Directory
clinical_path <- "C:/Users/lyan/Downloads/TCGA-multiple/combined_study_clinical_data.tsv"
mutation_path <- "C:/Users/lyan/Downloads/TCGA-multiple/mutation_data.csv"
file_path <- "C:/Users/lyan/Downloads/TCGA-multiple/tcga_pan_cancer_raw_counts" #TOIL RSEM expected_count data
# Load clinical data
clinical <- read.csv(clinical_path, sep = '\t', row.names = 3)
rownames(clinical) <- gsub("-", ".", rownames(clinical))
# HTseq_counts (it's expected counts but I am used to call it HTseq)
HTseq_counts <- read.table(file_path, header = TRUE, row.names = 1, stringsAsFactors = FALSE)
# Removed after therapy and known
samples_to_remove <- rownames(clinical)[
(clinical$Radiation.Therapy == "Yes" |
clinical$Neoadjuvant.Therapy.Type.Administered.Prior.To.Resection.Text == "Yes")]
HTseq_counts <- HTseq_counts[, !(colnames(HTseq_counts) %in% samples_to_remove)]
HTseq_counts <- HTseq_counts[, grep(colnames(HTseq_counts), pattern = ".01$", invert = FALSE, perl = TRUE)] # primary
HTseq.counts.raw <- 2^(HTseq_counts) - 1 # cancel log normalization (x+1)
rm(HTseq_counts)
HTseq.counts.raw <- HTseq.counts.raw[rowSums(HTseq.counts.raw) > 10, ] # remove low variable genes
# Load mutation data
mutations <- read.csv(mutation_path, stringsAsFactors = FALSE, row.names= 2 )
# SORRY for the mess, at the end I kept only 2 driver mutations here
mutations$Mut <- ifelse(mutations$EGFR!= "WT" & grepl("driver", mutations$EGFR) & mutations$KRAS == "WT",
"EGFR",
ifelse(mutations$KRAS!= "WT" & grepl("driver", mutations$KRAS ) & mutations$EGFR== "WT",
"KRAS",
ifelse(mutations$EGFR!= "WT" & grepl("driver", mutations$EGFR) & mutations$KRAS!= "WT" & !grepl("driver", mutations$KRAS),
NA, # Remove row
ifelse(mutations$KRAS!= "WT" & grepl("driver", mutations$KRAS) & mutations$EGFR!= "WT" & !grepl("driver", mutations$EGFR),
NA, # Remove row
ifelse(mutations$EGFR!= "WT" & grepl("driver", mutations$EGFR) & mutations$KRAS!= "WT" & grepl("driver", mutations$KRAS),
NA, # Remove row
ifelse(mutations$EGFR== "WT" & mutations$KRAS== "WT",
NA, # Remove row
NA # Default to NA for any other cases
))))))
rownames(mutations) <- gsub("-", ".", rownames(mutations))
matching_indices <- match(colnames(HTseq.counts.raw), rownames(mutations))
mutation_row <- rep(NA, ncol(HTseq.counts.raw))
mutation_row[!is.na(matching_indices)] <- mutations$Mut[matching_indices[!is.na(matching_indices)]] # remove NA or non intersected
# Add mutation_row as a new row in raw data
HTseq.counts_raw_with_mutations <- rbind(HTseq.counts.raw, mutation_type = mutation_row)
HTseq.counts_raw_with_mutations <- HTseq.counts_raw_with_mutations[, !is.na(HTseq.counts_raw_with_mutations["mutation_type", ])] # remove last row
# Dataframe with mutated
col_data <- data.frame(
SampleID = colnames(HTseq.counts_raw_with_mutations),
Group = NA # Initialize Group column
)
col_data$Group[HTseq.counts_raw_with_mutations["mutation_type", ] == "KRAS"] <- "KRAS"
col_data$Group[HTseq.counts_raw_with_mutations["mutation_type", ] == "EGFR"] <- "EGFR"
HTseq.counts_raw_with_mutations <- HTseq.counts_raw_with_mutations[!(rownames(HTseq.counts_raw_with_mutations) == "mutation_type"), , drop = FALSE]
rm(HTseq.counts.raw)
preserve_row_names <- rownames(HTseq.counts_raw_with_mutations)
count_data <- apply(HTseq.counts_raw_with_mutations, 2, as.numeric)
rownames(count_data) <- preserve_row_names
count_data <- round(count_data)
rownames(count_data) <- gsub("\\.\\d+", "", rownames(count_data))
# Create DESeq2 dataset
dds <- DESeqDataSetFromMatrix(countData = count_data, colData = col_data, design = ~ Group)
# Run DESeq
dds <- DESeq(dds)
res <- results(dds)
num_tests <- sum(!is.na(res$pvalue)) #valid results
res$bonferroni <- pmin(res$pvalue * num_tests, 1)
ensembl <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")
# Get gene names for Ensembl IDs
gene_info <- getBM(attributes = c("ensembl_gene_id", "external_gene_name"),
filters = "ensembl_gene_id",
values = rownames(res),
mart = ensembl)
# Add gene names to the results DataFrame
res$Gene_name <- gene_info$external_gene_name[match(rownames(res), gene_info$ensembl_gene_id)]
res
#####################################################################################
Part 2 GTEX:
file_path <- "C:/Users/lyan/Downloads/TCGA-multiple/gtex_raw_counts" TOIL RSEM expected_count dataset
tissue_info <- "C:/Users/lyan/Downloads/TCGA-multiple/GTEX_tissue_info"
# Load GTEX data
GTEX_counts <- read.table(file_path, header = TRUE, row.names = 1, stringsAsFactors = FALSE)
tissue = read.table(tissue_info, header = TRUE, row.names=1, sep = "\t", stringsAsFactors = FALSE)
rownames(tissue) <- gsub("-", ".", rownames(tissue))
matching_samples <- colnames(GTEX_counts) %in% rownames(tissue)
GTEX_counts <- 2^(GTEX_counts) - 1 # cancel log normalization (x+1)
GTEX_counts <- GTEX_counts[rowSums(GTEX_counts) > 10, ]
if (all(matching_samples)) {
# Create a new row for tissue names
tissue_names <- tissue[colnames(GTEX_counts), "X_primary_site"]
GTEX_counts <- rbind(GTEX_counts, tissue_name = tissue_names)
}
filtered_GTEX_counts <- GTEX_counts[, GTEX_counts["tissue_name", ] %in%
c("Colon", "Small Intestine","Adrenal Gland", "Liver")] # Tissues of interest
tissue_names_vector <- setNames(tissue[colnames(filtered_GTEX_counts), "X_primary_site"], colnames(filtered_GTEX_counts))
col_data <- data.frame(SampleID = colnames(filtered_GTEX_counts), stringsAsFactors = FALSE)
# Assign tissue names to col_data
col_data$tissue_name <- tissue_names_vector[col_data$SampleID]
# Dataframe represtning tissues of interest
col_data$Group <- ifelse(col_data$tissue_name %in% c("Colon", "Small Intestine"),
"KRAS_fan", # Mut
ifelse(col_data$tissue_name %in% c("Adrenal Gland", "Liver"),
"EGFR_fan", # WT
NA))
filtered_GTEX_counts = filtered_GTEX_counts[-nrow(filtered_GTEX_counts), ] # remove group last row
preserve_row_names <- rownames(filtered_GTEX_counts)
counts_data <- apply(filtered_GTEX_counts, 2, as.numeric)
rownames(counts_data) <- preserve_row_names
counts_data <- round(counts_data)
rownames(counts_data) <- gsub("\\.\\d+", "", rownames(counts_data))
dds <- DESeqDataSetFromMatrix(
countData = counts_data,
colData = col_data,
design = ~ Group
)
# DEseq
dds <- DESeq(dds)
res <- results(dds)
num_tests <- sum(!is.na(res$pvalue)) #valid results
res$bonferroni <- pmin(res$pvalue * num_tests, 1)
# Get gene names for Ensembl IDs
gene_info <- getBM(attributes = c("ensembl_gene_id", "external_gene_name"),
filters = "ensembl_gene_id",
values = rownames(res),
mart = ensembl)
# Add gene names to the results DataFrame
res$Gene_name <- gene_info$external_gene_name[match(rownames(res), gene_info$ensembl_gene_id)]
write.csv(res, 'C:/Users/lyan/Downloads/TCGA-multiple/GTEX_colon_intenstine-_vs_Liver_Uterus_Pituitary_AdrenalGland+.csv')
I uploaded the UMAP of mutated only
- Small update, I used Combat on the pancancer data and the number of significant features decreased to 8k with the top 10 padj between 10^-50 - 10^-82
Please show code and plots. The result is almost certainly flawed.
I am reluctant to go through all of this in detail, but you are basically downloading the entire TCGA, and stratify by mutations? Not correcting for any batch effects, tissue of origin, and then conduct a simple univariate analysis?
Yes exactly.
Any idea what could go wrong? I noticed that papers are using ready data from Xena
Please edit your post and add contents there. You could add a comment but it's better to add to the main post.
Why are you applying Bonferroni rather than just using the adjusted p-values provided by DESeq2 directly?
I usually use padj indeed but due to the larger number of features and highly significant results, I also added bonferroni.
I'm just commenting on this line:
Note that the small p-value indicates that the observed data is incompatible with the null hypothesis of the true gene expression being exactly the same in the two groups of samples being compared. With enough samples one should be able to convincingly reject that hypothesis even if the difference between groups is not huge. That could explain also why you get 20k genes passing some p-value cutoff.
For genes with tiny p-values, did you check whether the expression difference is unrealistically large and whether these genes are biologically reasonable? Sometimes these next-to-zero p-values are due to one group having virtually no detectable expression for that gene. Is this plausible in your case?
Thank you for the feedback. I understand that small p-values can indicate significant differences with large sample sizes. However, comparing around 400 to 200 samples, it seems implausible to get over 20,000 features with an adjusted p-value < 0.05, especially since this is nearly half of the data. I also filtered out features with low variability (fewer than 10 counts). Any advice on further checks?
Why do you feel this way? You're comparing different tissues. I'd expect broad differences in this manner. Regardless, you might consider using an effect size threshold during testing if you want more robustly different genes.
I agree that comparing different tissues can reveal broad differences, but I still find it unusual to see over 20,000 features with an adjusted p-value < 0.05, especially since this represents nearly half the data. Additionally, the extreme p-values (e.g., p < 10^-99) and the very high log fold change raise (321 features are either lower than -5 or higher than 5) concerns me, as I haven't encountered anything like this before.
Beyond the GTEX issue, although I didn't mention it before, my suspicions began when I compared mutation x in specific cancer to mutation x in other cancers (250 vs 60 samples). Here again, I found around 20,000 significant mutations, with the top 50 showing adjusted p-values < 10^-99. Although this resembles the tissue-specific issue, I still wouldn't expect to see so many significant results with such high significance levels in patients holding the same mutation. Although I would expect to see differences due to other passenger/driver mutations in addition to mutation x, the number of significant results still seems unusually high.
OP is not comparing tissues but the entire TCGA stratified only by mutations, not checking whether number of tissues, batches, anything is balenced between groups. It's an entirely meaningless analysis in the current state. You should compare at least within TCGA datasets, and if you really feel that this mutation has tissue-independent effects that can be generalized then aggregate individual results by meta-analysis.
I am looking for tissue-dependent effects. That's why I compared specific tissues. I agree about the balance between the groups, and thanks for that comment, I will fix this issue.