I've been working with huge datasets since, well, i started with bioinformatics, but now i face a problem with a new dataset with very little samples.
I have 5 groups, Control, and 4 Diseases, wich frecuencies vary from a set of 10 features corresponding to the $log_2(1+2^{-Delta CT})$ values of gene expression (I had to use a pseudocount, to "nullify" my 0, preventing them to become a NA or -Inf).
Yet i only have a maximum of 20 entries per group (and a minimum of 3, because the data is full of NA's in some features). My plan is to cross use some of this features with clinical values in order to fit a model; i have a complete dataset of 87 of them with very few little Na's.
But I'm stuck with:
a) How do i divide a train-test dataset to fit my models with this very few little data?
b) How can i do the feature selection with my 8 firsts gene-features? I did some ANOVA (despise they are not normal, and the dataset its full of extreme outliers detected by 'identify_outliers()' and easily visible by boxplots; and some manovas with very few little features that are "significant" (Despise the data dont fullfill the asumptions, like normality).
c) Should i use a multinomial logistic regression? By the rule of thumb, i need about 10 samples per feature, but i dont know any more multiclass models that assign a probability.
Any recommendations?