Hi everybody, I'm new in this field. I'm trying to replicate a paper to train my self . The results come out pretty the same but not exactly the same, so I wanted to know if all my steps are right or if I'm missing something ( or even completely doing it wrong ). I did follow edgeR guide for help.
Those are the steps ( I jump not important code lines ) :
1 - Prior to computation of differentially expressed transcripts (‘ANOVA testing for differences’), genes with less than five (non-normalized) read counts in all samples were excluded from analyses.
counts <- counts[!apply(counts<5,1,all,na.rm=TRUE),]
2 - Next, data normalization factors were calculated that are incorporated in further analyses by the weighted trimmed mean of M-values (”TMM”) method.
list <- DGEList(counts=counts,group=group)
list <- calcNormFactors(list,method = "TMM")
3 - After fitting of negative binominal models and both common, tagwise and trended dispersion estimates were obtained, differentially expressed transcripts were determined using a generalized linear model (GLM) likelihood ratio test.
design <- model.matrix(~ 0 + group)
list <- estimateDisp(list,design,robust = TRUE)
fit <- glmQLFit(list,design,robust = TRUE)
single.contrast <- makeContrasts(Tumor - HC, levels = design)
test <- glmQLFTest(fit,contrast = single.contrast)
4 - To minimize the effect of batch effect, resulting in possibly false positive results, ‘sequencing batch’ was included as a covariate in the GLM likelihood ratio test design matrix. DONT KNOW HOW TO DO
5 - To determine differentially expressed transcripts, we only focused on transcripts with expression levels of logarithmic counts per million (LogCPM) > 3.
tags <- topTags(test, n=nrow(test$table))$table
tags.logcpm <- tagsA[tags$logCPM>3,]
At this point they obtain 5003 genes , I do obtain 4986 instead. But the logCPM values come out the same , just differences between FDR values (not so much tho).
Important Question : They do mention another thing that is the following -> To determine the specific input gene lists for the classifying algorithms we performed ANOVA testing for differences (as implemented in the R-package edgeR), yielding classifier-specific gene lists. Here they instead retrive 1072 genes that are actually inside my 4986 but I dont really understand what is this ANOVA testing for differences , and on what they did this (step 1? or after step 5?). This ANOVA testing is made to obtain genes for SVM classification so I assume they wanted genes with major differences.
Kevin Blighe please review this post.
I am in the lengthy process of retiring from the field; so, my time for Biostars has now decreased to almost nil.
okay, thank you, Kevin Blighe I can understand. If possible, would you please recommend someone who can review that post for us? If anybody else could help with this then that would be great for us.
My good colleague ATpoint seems to have responded.
thank you so much