I have a set of genotypic and phenotypic features like this:
SNP1 SNP2 SNP3 survival blood_pressure gender
patient1 0 1 1 23 24 0
patient2 1 0 2 34 4 1
patient3 1 1 2 43 23 1
patient4 2 1 0 23 3 2
I want to do feature selection on these mixed continuous and categorical data features before inputting into a machine learning algorithm. Would someone know of a library in python (or python code) that is suitable for this?
Amazing links and advice, just on categorical variables, SNPs and Gender are categorical, aren't they?
The way data is formatted in the original post, all features are numerical. As you correctly noted, SPNs and Gender are discreetly numerical. It is safe to assume that the Gender column doesn't have many unique states, but impossible to know how many there are for SNPs.
The tree methods I mentioned above, for example gradient boosting machines, can be instructed specifically to consider columns categorical even when their contents are purely numerical. To them it probably wouldn't make much of a difference.
To linear models, however, it does matter when numerical values are meant to represent categories rather than smaller/greater relationships. In such a case some kind feature encoding is needed, such as weight of evidence or frequency encoding.