Question

Question about edgeR p-values/FDR values

0

Entering edit mode

12 months ago

Ronin ▴ 10

I'm running edgeR for differencial accessibility of ATAC-seq peaks, and with my samples I noticed something with the output that I was hoping to get some explanation for. So my results from edgeR with 18 total samples looks like this:

    logFC   logCPM  LR  PValue  FDR
1   4.393527353 5.905829078 572.9930277 1.25E-126   9.64E-122
2   4.380895656 5.608644923 549.0927083 1.98E-121   7.62E-117
3   5.594629126 7.869247982 544.8831178 1.63E-120   4.18E-116
4   4.397254263 5.692977435 539.7046714 2.19E-119   4.20E-115
5   4.571165539 5.632301129 530.9367562 1.77E-117   2.72E-113

I would have expected p-values/FDR values for each of my 18 samples at this particular chromosome region (1-5 on the left side, I changed the chromosome number to be less ugly for the question), yet I only get one. And so my question is: Is this an issue with my code, as one of my advisors suggests, or is it that it gives a sort of mean value across all samples, thus giving only 1 value for P-value/FDR?

Thanks in advance for any insights you might be able to provide.

edgeR • 3.0k views

ADD COMMENT • link 12 months ago by Ronin ▴ 10

0

Entering edit mode

I guess more details on your experimental setup, and the code you used, would help. But here it seems you are just showing the DE test for 1 sample. Depending on how you ran the test, this could represent your 18 samples (but likely incorrectly), unless the 18 samples are supposed to be replicates, then this analysis may be fine.

If you share your code, and what the setup is supposed to be, then we might be able to help more specifically.

ADD REPLY • link 12 months ago by rfran010 ★ 1.7k

0

Entering edit mode

Sure, I am happy to share the code. I'll add a few comments here and there within the code:

library("tximport")
library("readr")
library("tximportData")
library("pasilla")
library("writexl")
library("DESeq2")
library("factoextra")

# Load the count matrix
# Assuming the first column of counts contains gene names, and the rest are sample counts
counts <- read.table("ATAC_all_merged.readCountInPeaks.txt", header = TRUE, row.names = 1)

# Extract only the count columns (starting from the 7th column)
count_data <- counts[, 6:ncol(counts)]

# Check the data to ensure it's loaded correctly
head(count_data)

# Set up the sample information (metadata)
sample_names <- colnames(count_data)  # Use the column names from count_data

# Load sample information (metadata)
groups <- read.table("sample-info.txt", header = TRUE, sep = "\t", stringsAsFactors = TRUE)

countDF = as.matrix(read.table("ATAC_all_merged.readCountInPeaks.txt", header = TRUE, sep = "\t", stringsAsFactors = TRUE))
groups = read.table("sample-info.txt", header = TRUE, sep = "\t", stringsAsFactors = TRUE)

individuals = c("ATAC_01_mapped.sorted.rem_dup.bam", "ATAC_03_mapped.sorted.rem_dup.bam", "ATAC_04_mapped.sorted.rem_dup.bam", "ATAC_05_mapped.sorted.rem_dup.bam","ATAC_07_mapped.sorted.rem_dup.bam", "ATAC_08_mapped.sorted.rem_dup.bam","ATAC_13_mapped.sorted.rem_dup.bam", "ATAC_15_mapped.sorted.rem_dup.bam","ATAC_16_mapped.sorted.rem_dup.bam", "ATAC_17_mapped.sorted.rem_dup.bam","ATAC_19_mapped.sorted.rem_dup.bam", "ATAC_20_mapped.sorted.rem_dup.bam","ATAC_25_mapped.sorted.rem_dup.bam", "ATAC_27_mapped.sorted.rem_dup.bam","ATAC_28_mapped.sorted.rem_dup.bam", "ATAC_29_mapped.sorted.rem_dup.bam","ATAC_31_mapped.sorted.rem_dup.bam","ATAC_32_mapped.sorted.rem_dup.bam")


#DESeq

rnaseqMatrix = count_data
rnaseqMatrix = round(rnaseqMatrix)
rnaseqMatrix = rnaseqMatrix[rowSums(cpm(rnaseqMatrix) > 1) >= 2,]
dim(rnaseqMatrix)

#DESeq ALL
rownames(groups) = colnames(rnaseqMatrix)
ddsFullCountTable.all <- DESeqDataSetFromMatrix(
  countData = rnaseqMatrix,
  colData = groups,
  design = ~ environment + tissue)
dds = DESeq(ddsFullCountTable.all)
resultsNames(dds)

So first I work with DESEq2. I'm working with 18 samples like I said, with 3 tissues (eye/liver/spleen). Maybe at the "DESEq ALL" step there is something I need to modify here, but it works just fine. So now we get to edgeR (see comment #2 due to character length being too long):

ADD REPLY • link 12 months ago by Ronin ▴ 10

0

Entering edit mode

(Part 2)

library(edgeR)

dge <- DGEList(counts = count_data, group = groups$tissue)
print(dge)
table(groups$tissue)
# Calculate normalization factors using TMM normalization
dge <- calcNormFactors(dge)

# Check the normalization factors (optional)
dge$samples

# Estimate common, trended, and tagwise dispersions
dge <- estimateDisp(dge)

# Fit a generalized linear model (GLM) for differential accessibility
fit <- glmFit(dge)

lrt <- glmLRT(fit)

# Filter results for statistically significant genes with FDR < 0.01
significant_results <- topTags(lrt, n = Inf)$table
significant_results <- significant_results[significant_results$FDR < 0.01, ]
significant_results$Category <- ifelse(significant_results$logFC > 0, "yes", "no")

# Save the filtered results to a CSV file
write.csv(significant_results, file = "edgeR_results-redo.csv")

# View the top differentially accessible peaks/regions (optional)
top_results <- head(significant_results, 20)
print(top_results)

# Create a new DGEList with the matched samples
dge <- DGEList(counts = count_data[, colnames(count_data) %in% matched_samples$sample], 
               group = matched_samples$environment)

# Check the new DGEList
print(dge)
# Ensure that the groups are aligned with the DGEList

I'm sure there's something wrong here to explain why I don't get 18 values and instead get one. I am wondering if maybe:

# Filter results for statistically significant genes with FDR < 0.01
significant_results <- topTags(lrt, n = Inf)$table
significant_results <- significant_results[significant_results$FDR < 0.01, ]
significant_results$Category <- ifelse(significant_results$logFC > 0, "yes", "no")

Might be part of it? When I look at 'lrt' I do see my 18 samples are still inside of it, under samples (a factor with 3 levels -- eye, liver, spleen). Perhaps if you want to try running this code, you can use some snippets from my original files. Here's the count matrix:

Geneid  Chr Start   End Strand  Length  ATAC_01_mapped.sorted.rem_dup.bam   ATAC_03_mapped.sorted.rem_dup.bam   ATAC_04_mapped.sorted.rem_dup.bam   ATAC_05_mapped.sorted.rem_dup.bam   ATAC_07_mapped.sorted.rem_dup.bam   ATAC_08_mapped.sorted.rem_dup.bam   ATAC_13_mapped.sorted.rem_dup.bam   ATAC_15_mapped.sorted.rem_dup.bam   ATAC_16_mapped.sorted.rem_dup.bam   ATAC_17_mapped.sorted.rem_dup.bam   ATAC_19_mapped.sorted.rem_dup.bam   ATAC_20_mapped.sorted.rem_dup.bam   ATAC_25_mapped.sorted.rem_dup.bam   ATAC_27_mapped.sorted.rem_dup.bam   ATAC_28_mapped.sorted.rem_dup.bam   ATAC_29_mapped.sorted.rem_dup.bam   ATAC_31_mapped.sorted.rem_dup.bam   ATAC_32_mapped.sorted.rem_dup.bam
Merged-CM020911.1-46586163-4    CM020911.1  46585970    46586361    +   392 111 71  18  75  81  36  91  81  64  77  71  68  51  117 95  58  38  89
Merged-CM020910.1-8392360-4 CM020910.1  8392227 8392504 +   278 55  64  17  52  103 28  124 73  74  45  89  15  45  78  72  119 59  26
Merged-VHII01000025.1-381685-4  VHII01000025.1  381572  381794  +   223 76  94  35  61  68  56  82  126 78  57  45  33  68  67  105 130 127 73
Merged-CM020915.1-24302766-6    CM020915.1  24302599    24302924    +   326 104 53  99  106 152 103 156 80  176 91  132 81  88  90  163 221 73  159
Merged-CM020910.1-30991952-9    CM020910.1  30991609    30992493    +   885 91  207 740 83  188 614 92  238 471 77  236 379 103 164 404 107 180 390
Merged-CM020911.1-24672943-9    CM020911.1  24672790    24673131    +   342 72  186 365 47  228 453 90  287 521 82  222 296 128 144 451 130 200 475
Merged-CM020910.1-22057543-9    CM020910.1  22057303    22057728    +   426 58  183 233 59  227 259 70  212 239 73  129 265 51  157 301 99  201 203
Merged-CM020928.1-33737934-12   CM020928.1  33737726    33738098    +   373 115 166 54  151 232 60  123 194 95  131 221 71  146 162 97  92  108 110

And here's the sample-info.txt:

sample  environment tissue
ATAC_01_mapped.sorted.rem_dup.bam   clear   eye
ATAC_03_mapped.sorted.rem_dup.bam   clear   liver
ATAC_04_mapped.sorted.rem_dup.bam   clear   spleen
ATAC_05_mapped.sorted.rem_dup.bam   clear   eye
ATAC_07_mapped.sorted.rem_dup.bam   clear   liver
ATAC_08_mapped.sorted.rem_dup.bam   clear   spleen
ATAC_13_mapped.sorted.rem_dup.bam   clear   eye
ATAC_15_mapped.sorted.rem_dup.bam   clear   liver
ATAC_16_mapped.sorted.rem_dup.bam   clear   spleen
ATAC_17_mapped.sorted.rem_dup.bam   dark    eye
ATAC_19_mapped.sorted.rem_dup.bam   dark    liver
ATAC_20_mapped.sorted.rem_dup.bam   dark    spleen
ATAC_25_mapped.sorted.rem_dup.bam   dark    eye
ATAC_27_mapped.sorted.rem_dup.bam   dark    liver
ATAC_28_mapped.sorted.rem_dup.bam   dark    spleen
ATAC_29_mapped.sorted.rem_dup.bam   dark    eye
ATAC_31_mapped.sorted.rem_dup.bam   dark    liver
ATAC_32_mapped.sorted.rem_dup.bam   dark    spleen

So maybe you could copy+paste the code and make some input files based on what I pasted, assuming you want to give it a go and assuming a count matrix file this small is allowed. Thanks for the help.

ADD REPLY • link 12 months ago by Ronin ▴ 10

0

Entering edit mode

You want people to help you, make it as easy as possible. Don't just dump the whole script. Pick out just the parts relevant to the issue. Your PCA and heat maps and ggplot calls are not necessary here.

ADD REPLY • link 12 months ago by swbarnes2 15k

0

Entering edit mode

That's true, my apologies for that. I'll edit this down a bit.

ADD REPLY • link 12 months ago by Ronin ▴ 10

score 1 · Answer 1 · 2024-10-29

1

Entering edit mode

12 months ago

i.sudbery 22k

edgeR will only ever give one P-value/FDR per gene - the P-value/FDR for the contrast you asked it about. In your LRT, you set the group to be "tissue". By default edgeR uses the last coefficient in the matrix, which I believe in this case will be "spleen", so you will be testing the difference between spleen and eye (the alphabetically first term). Effectively this means that the p-value/FDR is the p-value on the hypothesis that openness differs between eye and spleen tissue (ignoring environment).

You can tell edgeR to give you different contrasts - between different groups of samples, or even more complex designs. You could for example ask for differences between liver and eye (exactTest(dge, pairs=1:2) or glmLRT(fit, coef=2)), or even, regions whose difference between clear and dark was different in different tissues (mm <- model.matrix(~ environment + tissue + environment:tissue); glmFit(dge, mm)) but you will never get 18 seperate p-values out of this experimental design.

Intrinsically, what edgeR does is compare read counts in a region for differences. The p-value/FDR gives you a measure of how likely two different groups of counts are if they were drawn from the same distribution. To do that you need two groups of counts: you need at least 3 total samples to do a comparison in edgeR (two from one condition, one from the other), but the results are only really properly valid if you have multiple samples in each group.

The fact that you want a p-value for each sample, suggests to me that the question you are trying to answer is "Is there genuinely a peak in this sample?". EdgeR is not the right tool for that job. You want to be using a tool designed to do that. The tool i've used in the past is MACS.

ADD COMMENT • link 12 months ago by i.sudbery 22k

0

Entering edit mode

+1, in particular for this:

The fact that you want a p-value for each sample, suggests to me that the question you are trying to answer is "Is there genuinely a peak in this sample?"

In my experience that is kind of a mess to answer and perhaps not even that relevant. It depends a lot on what statistical model you use, the quality of the library and how you align peaks in different samples. But ultimately: do you really need to know whether a peak is completely absent in one sample as opposed to being just very weak? There is also the complication (I think) that while in RNAseq you know before hand where the genes are. In ChIP/ATACseq you use the same data to detect peaks and to quantify them. I can't really say why but I suspect this makes the underlying statistics deviate from the expected behaviour.

ADD REPLY • link 12 months ago by dariober 15k

0

Entering edit mode

Thank you for the suggestions and the reply. I'd like to do this command for fun:

mm <- model.matrix(~ environment + tissue + environment:tissue); glmFit(dge, mm)

So I'll need to modify my original dge to not involve just tissue, but also environment.

I'm confident that these peaks are true peaks from my previous steps, so this is more about looking for differences in a) tissue type, and b) environmental condition.

ADD REPLY • link 12 months ago by Ronin ▴ 10

0

Entering edit mode

Then you will get one p-value for tissue, and one p-value per environment.

If you want to test the interaction between tissue and environment, then you will also need to be careful about specifying the contrast in glmLRT. This is not a simply business (personally I think the syntax in DESeq is easier to understand here). I recommend reading Chapter 3 of the edgeR user manual.

ADD REPLY • link 12 months ago by i.sudbery 22k

0

Entering edit mode

Thanks for the recommended reading, I will do that today. My fingers are crossed that Chapter 3 can help explain why I can't get the model.matrix to work haha!

ADD REPLY • link 12 months ago by Ronin ▴ 10

score 0 · Answer 2 · 2024-10-29

0

Entering edit mode

12 months ago

dariober 15k

Note: Posted this answer just when OP posted additional info. So this answer may be outdated.

I would have expected p-values/FDR values for each of my 18 samples at this particular chromosome region

tl;dr: In my guess your output looks fine, at least in format. Can't comment on the biology of course.

Maybe there is a bit of confusion about the word sample here (in biology it usually means a specimen, in statistics it means a collection of data points). If you have, say, 9 controls and 9 treated and you do differential analysis you get one estimate of difference per gene or, in your case, per ATAC region. You don't get 18 results per ATAC region. So your output table will have one row per ATAC region.

Each row answers the questions (roughly, roughly!): At this ATAC region what difference do I observe between the mean of the 9 controls and the mean of the 9 treated? How surprising would it be to see such difference in world where the treatment has no effect compared to control? The answers are in the logFC and FDR columns, respectively. Because you compare one mean vs another mean, you get one row per region, not 18.

ADD COMMENT • link 12 months ago by dariober 15k

0

Entering edit mode

Hi, thanks for the comment. My suspicion is that what you said here:

Maybe there is a bit of confusion about the word sample here (in biology it usually means a specimen, in statistics it means a collection of data points). If you have, say, 9 controls and 9 treated and you do differential analysis you get one estimate of difference per gene or, in your case, per ATAC region. You don't get 18 results per ATAC region. So your output table will have one row per ATAC region.

Is correct. But then again, what do I know. I'm trying to make sure I don't screw up my analysis based off some silly mistake, however! We consider clear/dark (my two environmental conditions), but we also have data regarding the tissue sample type.

ADD REPLY • link 12 months ago by Ronin ▴ 10

0

Entering edit mode

I don't understand where you expect to see the 18 samples? In the ouptut of topTags or in the PCA plot? The topTags will give one row per region for the reason I explain in my answer. topTags doesn't give you the expression (or accessibility in case of ATAC) in each sample. If that is what you want, you need additional steps to merge the output of toptags with the matrix of accessibility (a matrix where each row is a gene and each column is a sample, so a matrix with 18 columns). Maybe post a made-up table with the output you expect/want.

ADD REPLY • link 12 months ago by dariober 15k

0

Entering edit mode

I guess my advisor is expecting to see something like this in the edgeR_results.csv file:

    logFC   logCPM  LR  sample1PValue   sample1FDR  sample2PValue   sample2FDR  sample...18PValue   sample...18FDR
Merged-CM020919.1-24169541-6    4.72198508  5.632719224 273.1022299 2.39E-61    1.84E-56    2.39E-61    1.84E-56    2.39E-61    1.84E-56
Merged-CM020909.1-35020143-6    4.248238473 5.119366262 262.1709025 5.77E-59    2.22E-54    5.77E-59    2.22E-54    5.77E-59    2.22E-54
Merged-CM020917.1-28953145-6    4.245528146 5.906171719 258.5623559 3.53E-58    9.05E-54    3.53E-58    9.05E-54    3.53E-58    9.05E-54
Merged-CM020910.1-37768898-8    5.340979945 7.869341544 253.6236375 4.21E-57    8.09E-53    4.21E-57    8.09E-53    4.21E-57    8.09E-53
Merged-CM020928.1-22321886-6    4.048628622 5.554370254 248.0491441 6.91E-56    1.06E-51    6.91E-56    1.06E-51    6.91E-56    1.06E-51
Merged-CM020927.1-28495144-6    4.328959623 5.280633949 247.6908509 8.28E-56    1.06E-51    8.28E-56    1.06E-51    8.28E-56    1.06E-51
Merged-CM020922.1-26324753-6    4.394874595 5.272263773 239.8025921 4.34E-54    4.77E-50    4.34E-54    4.77E-50    4.34E-54    4.77E-50
Merged-CM020916.1-41472586-6    4.293360341 5.60905425  238.0342196 1.06E-53    1.01E-49    1.06E-53    1.01E-49    1.06E-53    1.01E-49

Where we have P-values and FDR values at every ATAC-seq region for each of the 18 samples. What do you think?

ADD REPLY • link 12 months ago by Ronin ▴ 10

0

Entering edit mode

Where we have P-values and FDR values at every ATAC-seq region for each of the 18 samples. What do you think?

No, that doesn't make sense to me. You get one pvalue and FDR for each hypothesis you test, not for each sample. If you do a t.test (and edgeR/DESeq could be thought of as highly sophisticated t-tests) between 9 controls and 9 treated you get one estimate of difference and one associated p-value, not 18.

ADD REPLY • link 12 months ago by dariober 15k

0

Entering edit mode

Got it. Thank you for describing it that way, when I think about it as a sophisticated t-test it makes total sense to me why we get one p-value/FDR value for each row, because each row gives us 1 t-test!

ADD REPLY • link 12 months ago by Ronin ▴ 10