Question

DEseq2 on expected counts

0

Entering edit mode

4 weeks ago

lyan125 • 0

Hi all,

I've been working with Pancancer's TOIL RSEM expected_count data downloaded from the Xena browser to compare two driver mutations (Mutation X vs. Mutation Y) using cBioPortal. After performing differential expression (DE) analysis using DEseq2, I was surprised to find more than 20k significant features out of a total of 60k, despite the combined sample size being less than 650 mutations.

A similar issue occurred when I compared a specific tissue type to all others using the same TOIL RSEM expected_count dataset, though this time with a larger sample size.

I'm concerned there might be a batch effect at play here. In my Pancancer analysis, DE results yielded "impossibly significant" p-values (e.g., <10^-99).

Does anyone have insights into what might be causing these unexpectedly significant results or advice on how to address potential batch effects in this context?

Loaded TOIL RSEM expected_count data, removed samples with therapy, and filtered low-variance genes. Processed mutation data to focus on two driver mutations. Set up DESeq2 analysis using the filtered mutation groups. Observed the high number of significant features.

Code:

Providing the first part (Pancancer):

library(DESeq2)
library(dplyr)
library(survival)
library(survminer) 
library(sva)
library(ggplot2) library(limma) 
library(readxl)
library(randomForest)
library(pROC)
library(caret) 
library(biomaRt)
library(org.Hs.eg.db) 
library(xCell) 
library(NMF) 
library(reshape2)
library(doParallel)
# Directory
clinical_path <- "C:/Users/lyan/Downloads/TCGA-multiple/combined_study_clinical_data.tsv"
mutation_path <- "C:/Users/lyan/Downloads/TCGA-multiple/mutation_data.csv"
file_path <- "C:/Users/lyan/Downloads/TCGA-multiple/tcga_pan_cancer_raw_counts" #TOIL RSEM expected_count data

# Load clinical data
clinical <- read.csv(clinical_path, sep = '\t', row.names = 3)
rownames(clinical) <- gsub("-", ".", rownames(clinical))
# HTseq_counts (it's expected counts but I am used to call it HTseq)
HTseq_counts <- read.table(file_path, header = TRUE, row.names = 1, stringsAsFactors = FALSE)
# Removed after therapy and known
samples_to_remove <- rownames(clinical)[
  (clinical$Radiation.Therapy == "Yes" |
     clinical$Neoadjuvant.Therapy.Type.Administered.Prior.To.Resection.Text == "Yes")]
HTseq_counts <- HTseq_counts[, !(colnames(HTseq_counts) %in% samples_to_remove)]

HTseq_counts <- HTseq_counts[, grep(colnames(HTseq_counts), pattern = ".01$", invert = FALSE, perl = TRUE)] # primary

HTseq.counts.raw <- 2^(HTseq_counts) - 1 # cancel log normalization (x+1)
rm(HTseq_counts)


HTseq.counts.raw <- HTseq.counts.raw[rowSums(HTseq.counts.raw) > 10, ] # remove low  variable genes
# Load mutation data
mutations <- read.csv(mutation_path, stringsAsFactors = FALSE, row.names= 2 )

# SORRY for the mess, at the end I kept only 2 driver mutations here
mutations$Mut <- ifelse(mutations$EGFR!= "WT" & grepl("driver", mutations$EGFR) & mutations$KRAS == "WT",
                        "EGFR",
                        ifelse(mutations$KRAS!= "WT" & grepl("driver", mutations$KRAS ) & mutations$EGFR== "WT",
                               "KRAS",
                               ifelse(mutations$EGFR!= "WT" & grepl("driver", mutations$EGFR) & mutations$KRAS!= "WT" & !grepl("driver", mutations$KRAS),
                                      NA,  # Remove row
                                      ifelse(mutations$KRAS!= "WT" & grepl("driver", mutations$KRAS) & mutations$EGFR!= "WT" & !grepl("driver", mutations$EGFR),
                                             NA,  # Remove row
                                             ifelse(mutations$EGFR!= "WT" & grepl("driver", mutations$EGFR) & mutations$KRAS!= "WT" & grepl("driver", mutations$KRAS),
                                                    NA,  # Remove row
                                                    ifelse(mutations$EGFR== "WT" & mutations$KRAS== "WT",
                                                           NA,  # Remove row
                                                           NA  # Default to NA for any other cases
                                                    ))))))
rownames(mutations) <- gsub("-", ".", rownames(mutations))

matching_indices <- match(colnames(HTseq.counts.raw), rownames(mutations)) 
mutation_row <- rep(NA, ncol(HTseq.counts.raw))
mutation_row[!is.na(matching_indices)] <- mutations$Mut[matching_indices[!is.na(matching_indices)]] # remove NA or non intersected

# Add mutation_row as a new row in raw data
HTseq.counts_raw_with_mutations <- rbind(HTseq.counts.raw, mutation_type = mutation_row)
HTseq.counts_raw_with_mutations <- HTseq.counts_raw_with_mutations[, !is.na(HTseq.counts_raw_with_mutations["mutation_type", ])] # remove last row
# Dataframe with mutated
col_data <- data.frame(
  SampleID = colnames(HTseq.counts_raw_with_mutations),
  Group = NA  # Initialize Group column
)

col_data$Group[HTseq.counts_raw_with_mutations["mutation_type", ] == "KRAS"] <- "KRAS"
col_data$Group[HTseq.counts_raw_with_mutations["mutation_type", ] == "EGFR"] <- "EGFR"



HTseq.counts_raw_with_mutations <- HTseq.counts_raw_with_mutations[!(rownames(HTseq.counts_raw_with_mutations) == "mutation_type"), , drop = FALSE] 
rm(HTseq.counts.raw)
preserve_row_names <- rownames(HTseq.counts_raw_with_mutations)
count_data <- apply(HTseq.counts_raw_with_mutations, 2, as.numeric)
rownames(count_data) <- preserve_row_names
count_data <- round(count_data)
rownames(count_data) <- gsub("\\.\\d+", "", rownames(count_data))
# Create DESeq2 dataset
dds <- DESeqDataSetFromMatrix(countData = count_data, colData = col_data, design = ~ Group)

# Run DESeq
dds <- DESeq(dds)
res <- results(dds)

num_tests <- sum(!is.na(res$pvalue)) #valid results
res$bonferroni <- pmin(res$pvalue * num_tests, 1)

ensembl <- useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl")

# Get gene names for Ensembl IDs
gene_info <- getBM(attributes = c("ensembl_gene_id", "external_gene_name"),
                   filters = "ensembl_gene_id",
                   values = rownames(res),
                   mart = ensembl)

# Add gene names to the results DataFrame
res$Gene_name <- gene_info$external_gene_name[match(rownames(res), gene_info$ensembl_gene_id)]

res

#####################################################################################
Part 2 GTEX:
file_path <- "C:/Users/lyan/Downloads/TCGA-multiple/gtex_raw_counts" TOIL RSEM expected_count dataset
tissue_info <- "C:/Users/lyan/Downloads/TCGA-multiple/GTEX_tissue_info"

# Load GTEX data
GTEX_counts <- read.table(file_path, header = TRUE, row.names = 1, stringsAsFactors = FALSE)
tissue = read.table(tissue_info, header = TRUE, row.names=1, sep = "\t", stringsAsFactors = FALSE)
rownames(tissue) <- gsub("-", ".", rownames(tissue))
matching_samples <- colnames(GTEX_counts) %in% rownames(tissue)
GTEX_counts <- 2^(GTEX_counts) - 1  # cancel log normalization (x+1)
GTEX_counts <- GTEX_counts[rowSums(GTEX_counts) > 10, ]

if (all(matching_samples)) {
  # Create a new row for tissue names
  tissue_names <- tissue[colnames(GTEX_counts), "X_primary_site"]

  GTEX_counts <- rbind(GTEX_counts, tissue_name = tissue_names)

}
filtered_GTEX_counts <- GTEX_counts[, GTEX_counts["tissue_name", ] %in%
                                      c("Colon", "Small Intestine","Adrenal Gland", "Liver")]  # Tissues of interest
tissue_names_vector <- setNames(tissue[colnames(filtered_GTEX_counts), "X_primary_site"], colnames(filtered_GTEX_counts))

col_data <- data.frame(SampleID = colnames(filtered_GTEX_counts), stringsAsFactors = FALSE)

# Assign tissue names to col_data
col_data$tissue_name <- tissue_names_vector[col_data$SampleID]
# Dataframe represtning tissues of interest 
col_data$Group <- ifelse(col_data$tissue_name %in% c("Colon", "Small Intestine"),
                         "KRAS_fan", # Mut
                         ifelse(col_data$tissue_name %in% c("Adrenal Gland", "Liver"),
                                "EGFR_fan", #  WT
                                NA))

filtered_GTEX_counts = filtered_GTEX_counts[-nrow(filtered_GTEX_counts), ] # remove group last row
preserve_row_names <- rownames(filtered_GTEX_counts)
counts_data <- apply(filtered_GTEX_counts, 2, as.numeric)
rownames(counts_data) <- preserve_row_names
counts_data <- round(counts_data)
rownames(counts_data) <- gsub("\\.\\d+", "", rownames(counts_data))
dds <- DESeqDataSetFromMatrix(
  countData = counts_data,
  colData = col_data,                
  design = ~ Group                    
)
# DEseq
dds <- DESeq(dds)
res <- results(dds)

num_tests <- sum(!is.na(res$pvalue)) #valid results
res$bonferroni <- pmin(res$pvalue * num_tests, 1)

# Get gene names for Ensembl IDs
gene_info <- getBM(attributes = c("ensembl_gene_id", "external_gene_name"),
                   filters = "ensembl_gene_id",
                   values = rownames(res),
                   mart = ensembl)

# Add gene names to the results DataFrame
res$Gene_name <- gene_info$external_gene_name[match(rownames(res), gene_info$ensembl_gene_id)]

write.csv(res, 'C:/Users/lyan/Downloads/TCGA-multiple/GTEX_colon_intenstine-_vs_Liver_Uterus_Pituitary_AdrenalGland+.csv')

enter image description here

I uploaded the UMAP of mutated only

Small update, I used Combat on the pancancer data and the number of significant features decreased to 8k with the top 10 padj between 10^-50 - 10^-82

DEseq2 Normalize raw-counts • 1.1k views

ADD COMMENT • link updated 18 days ago by Kevin Blighe 88k • written 4 weeks ago by lyan125 • 0

0

Entering edit mode

Please show code and plots. The result is almost certainly flawed.

ADD REPLY • link 4 weeks ago by ATpoint 84k

0

Entering edit mode

I am reluctant to go through all of this in detail, but you are basically downloading the entire TCGA, and stratify by mutations? Not correcting for any batch effects, tissue of origin, and then conduct a simple univariate analysis?

ADD REPLY • link 4 weeks ago by ATpoint 84k

0

Entering edit mode

Yes exactly.

ADD REPLY • link 4 weeks ago by lyan125 • 0

0

Entering edit mode

Any idea what could go wrong? I noticed that papers are using ready data from Xena

ADD REPLY • link 29 days ago by lyan125 • 0

0

Entering edit mode

Please edit your post and add contents there. You could add a comment but it's better to add to the main post.

ADD REPLY • link 4 weeks ago by Ram 44k

0

Entering edit mode

Why are you applying Bonferroni rather than just using the adjusted p-values provided by DESeq2 directly?

ADD REPLY • link 4 weeks ago by jared.andrews07 ★ 17k

0

Entering edit mode

I usually use padj indeed but due to the larger number of features and highly significant results, I also added bonferroni.

ADD REPLY • link 4 weeks ago by lyan125 • 0

0

Entering edit mode

I'm just commenting on this line:

DE results yielded "impossibly significant" p-values (e.g., <10^-99)

Note that the small p-value indicates that the observed data is incompatible with the null hypothesis of the true gene expression being exactly the same in the two groups of samples being compared. With enough samples one should be able to convincingly reject that hypothesis even if the difference between groups is not huge. That could explain also why you get 20k genes passing some p-value cutoff.

For genes with tiny p-values, did you check whether the expression difference is unrealistically large and whether these genes are biologically reasonable? Sometimes these next-to-zero p-values are due to one group having virtually no detectable expression for that gene. Is this plausible in your case?

ADD REPLY • link 29 days ago by dariober 14k

0

Entering edit mode

Thank you for the feedback. I understand that small p-values can indicate significant differences with large sample sizes. However, comparing around 400 to 200 samples, it seems implausible to get over 20,000 features with an adjusted p-value < 0.05, especially since this is nearly half of the data. I also filtered out features with low variability (fewer than 10 counts). Any advice on further checks?

ADD REPLY • link 29 days ago by lyan125 • 0

0

Entering edit mode

However, comparing around 400 to 200 samples, it seems implausible to get over 20,000 features with an adjusted p-value < 0.05, especially since this is nearly half of the data.

Why do you feel this way? You're comparing different tissues. I'd expect broad differences in this manner. Regardless, you might consider using an effect size threshold during testing if you want more robustly different genes.

ADD REPLY • link 29 days ago by jared.andrews07 ★ 17k

0

Entering edit mode

I agree that comparing different tissues can reveal broad differences, but I still find it unusual to see over 20,000 features with an adjusted p-value < 0.05, especially since this represents nearly half the data. Additionally, the extreme p-values (e.g., p < 10^-99) and the very high log fold change raise (321 features are either lower than -5 or higher than 5) concerns me, as I haven't encountered anything like this before.

Beyond the GTEX issue, although I didn't mention it before, my suspicions began when I compared mutation x in specific cancer to mutation x in other cancers (250 vs 60 samples). Here again, I found around 20,000 significant mutations, with the top 50 showing adjusted p-values < 10^-99. Although this resembles the tissue-specific issue, I still wouldn't expect to see so many significant results with such high significance levels in patients holding the same mutation. Although I would expect to see differences due to other passenger/driver mutations in addition to mutation x, the number of significant results still seems unusually high.

ADD REPLY • link 28 days ago by lyan125 • 0

0

Entering edit mode

OP is not comparing tissues but the entire TCGA stratified only by mutations, not checking whether number of tissues, batches, anything is balenced between groups. It's an entirely meaningless analysis in the current state. You should compare at least within TCGA datasets, and if you really feel that this mutation has tissue-independent effects that can be generalized then aggregate individual results by meta-analysis.

ADD REPLY • link 28 days ago by ATpoint 84k

0

Entering edit mode

I am looking for tissue-dependent effects. That's why I compared specific tissues. I agree about the balance between the groups, and thanks for that comment, I will fix this issue.

ADD REPLY • link 27 days ago by lyan125 • 0

score 0 · Answer 1 · 2024-08-26

I've been working with Pancancer's TOIL RSEM expected_count data downloaded from the Xena browser to compare two driver mutations (Mutation X vs. Mutation Y) using cBioPortal.

I would perform the same analysis but stratified by cancer type. After this, one would have multiple differential expression results tables, on which one could perform a meta-analysis, thus avoiding the need to perform an actual batch-effect correction.

Kevin