Entering edit mode
9.1 years ago
ha.hassanzadeh
•
0
Hello guys,
I have a the RNA-seq normalized data as well as methylation data for a couple of hundred samples, for each sample there are a couple of hundred thousand features. However, before I do a feature selection, I need to pre-filter the features so that at least 90% of the useless features removed. What method is best for that? Are there any R script or package that does that?
I think the most important question to you is: What is defined as a useless feature? Do you mean something that doesn't contribute to a treatment or condition? In that case, you can perform differential expression analysis between conditions to "preselect" those features. Or maybe you want to see if there is some relationship between your methylation and rna-seq data? Then maybe you setup a correlation matrix between all RNA Seq count and the methylation peaks (assuming it is ChIP-Seq?) then only look at features with high enough correlations (e.g. I am thinking of something similar to the eQTL analysis)
Aside from removing features that are not expressed at all (simple R commands to do that are easy to find), you can filter based on variance or median absolute deviation. For instance, the M3C package includes a function to do this, you can see section 5.2 of the package vignette (https://bioconductor.org/packages/devel/bioc/html/M3C.html).
Although it is relatively simple to write the commands yourself as well.