There's definitely some issues with what you're asking, I'll try and hit them off one by one.
Minfi will not remove probes with SNPs by default in the CpG, probe sequence or SBE. You'll need to use the dropLociWithSnps()
function, with an additional maf
argument, which specifies your minor allele frequency cutoff.
For detection P value filtering, the older versions of the minfi guide do include some clues as to how to do it. The idea is to identify these probes prior to normalisation, and remove them post-normalisation. Here's an example where raw_idat
is the raw data read using read.metharray.exp
, which removes probes where their detection p value is >0.01 in 50% of samples:
lumi_dpval <- detectionP(raw_idat, type = "m+u")
lumi_failed <- lumi_dpval > 0.01
lumi_dpval_remove <- names(which(rowMeans(lumi_failed)>0.5, TRUE))
rm(lumi_dpval, lumi_failed); gc(); set.seed(73)
norm_data <- preprocessFunnorm(raw_idat, bgCorr = T, dyeCorr = T,verbose = T)
remove <- match(lumi_dpval_remove,rownames(norm_data))) %>% unique %>% na.omit
norm_data_f <- norm_data[-remove,]
In terms of normalisation, you do a single method, do not combine them unless in very specific circumstances. I believe that preprocessFunnorm()
does SWAN
, but with extra steps to regress out technical variation based on control probes. Also, it should be noted that while I believe SWAN
normalisation is deterministic, the preprocessFunnorm()
method is not, so set the seed first as per my example above.
If you're still convinced that you should be using both preprocessFunnorm
and preprocessSWAN
, then please expand on why.
For bead count information you need to load in your idats as an extended rgset and then either use getNBeads() to get a matrix of bead counts, or wateRmelon::beadcount(). I would recommend the latter because getNBeads() returns bead count on a per-probe basis instead of a per-cpg site basis (use dim() on matrices returned from both functions and see what I mean).
Once you get beadcount info you can filter based on your own thresholds using subsetByLoci().
Most people only use one normalization, chosen dependent on dataset characteristics and personal preference (at least that's what it seems like to me). I wouldn't try to use two if you are not sure what you are doing.
Good luck