Hi, I am trying to find differentially bound regions (ChIPseq data) by edgeR. The experiment has two groups (A and B) with only one replicate in each.
I used the following code for the group B vs. group A analysis:
reads <- readCounts.txt
group <- factor(c("A", "B"))
y <- DGEList(counts = reads, group = group)
keep <- filterByExpr(y)
y <- y[keep,,keep.lib.sizes=FALSE]
y <- calcNormFactors(y, method = "TMM")
bcv <- 0.2
et <- exactTest(y, dispersion=bcv^2, pair = c("A", "B"))
FDR <- p.adjust(et$table$PValue, method="BH")
et$table$FDR <- FDR
results_EXT <- as.data.frame(et$table)
results_EXT_sig_dn <- filter(results_EXT, FDR < 0.05 & logFC < 0)
The problem I am facing is, for some regions in the decreased binding (results_EXT_sig_dn) results, the logFC is negative even when the group B reads are higher than group A . I have also looked at the scaled reads (reads multiplied by normalization factors) which are labelled as "_s"
"chr" "start" "end" "logFC" "logCPM" "PValue" "FDR" "reads_A" "reads_B" "reads_A_s" "reads_B_s"
"chr10" "42683394" "42683726" "-1.627117" "8.793078" "0.0001153746" "0.002275910" "2691" "3136" "2852.46" "2947.84"
"chr11" "92788101" "92788347" "-1.604364" "6.331717" "0.0001889196" "0.003331863" "490" "580" "519.40" "545.20"
"chr13" "73985586" "73985745" "-6.205887" "4.309577" "1.710055e-23" "6.610901e-19" "168" "8" "178.08" "7.52"
"chr22" "46897721" "46897850" "-1.748511" "3.734153" "0.0002410538" "0.004000832" "86" "92" "91.16" "86.48"
In the above data, line 1 is the header, lines 2 & 3 are anomalies, and lines 4 & 5 are as expected
I will be thankful if anyone can explain this anomaly.
Thank you for your reply. Please have look at the below and let me know your view. The code I used to join two data frames:
line 5 from two tables:
The logFC calculation by exactTest uses 4 variables: dispersion, offset, prior count, and reads. Dispersion depends on BCV (0.2). Offset and prior.count values are dependent on the library size and norm.factors. Reads are as provided in the table.
I tried my code with another set of data (RNAseq with no replicates) and I did not find this anomaly. This gives me confidence that my code works. I am not able to comprehend why this anomaly is happening to certain regions in the downregulated. Other regions in the downregulated and all upregulated regions' logFC direction (positive or negative) is in coherence with the reads (more or less reads in B).
I think the discrepancy of a
-0.08
log2FoldChange showing up as-1.7
is far more worrisome than whether the gene is up or down-regulatedThe latter it may feel like a big deal because it looks like a binary choice: up or down, plus or minus, but remember that is just an artifact of the logarithm switching sign at 1.
A small change around 1 is meaningless because tiny changes can make it look up or down-regulated.
A huge change, such a gene that show little change showing up with 4x fold change ... now that is a big error
I would suggest looking closely at how you combined the data. Whether or not it seems to work elsewhere is not all that comforting; the whole point of bugs is that they only manifest themselves only in some cases,
Thank you for your reply. I do not disagree with you on the biological importance of the level of fold change. Both the fold change direction and the fold change level are equally important. It would be excellent if I can find solutions for both. I looked at the source code and I am not able to figure out where it can be corrected. I am seeking for pointers toward solutions.
We can't easily troubleshoot because this is about combining data sets, your edgeR data is already combined with coordinates. The problem could also in producing that data.
in general, my advice in such cases is to merge on a single unique column, that way it becomes pretty clear what is going on
just recently, I had a very similar error when I merged gene names onto an edgeR object, I ended up with wrong gene names and noticed it during testing only.
I would go back to the edgeR and closely investigate how you are merging the chromosomal coordinates onto the edgeR result file
Hi, I could not figure it out with edgeR. I was reading multiple posts suggesting NOISeq as an option to analyze the "no replicate" data. The
noiseq
function seems to be specifically designed for "no replicate" data. I didn't find any discrepancies that I found using edgeR for my data. The direction and the level of fold change agree with the reads (normalized) provided. Thank you for the conversation.