Entering edit mode
6.2 years ago
druggable
▴
60
Hi,
I have multiple features as input for machine learning. My input features are normalized feature counts based on ChIP-Seq data, which I wish to use for logistic regression and neural network. However, the different features have different distributions, and I am new to machine learning and I would like to get insights on whether it is valid to use some of the read counts as they are (because they are already normally distributed), while the others are log-transformed.
Thanks, teabonng
Hi teabonng,
It is unclear how this question is related to bioinformatics, which is the scope of Biostars. Please elaborate or this question might get closed for being off topic.
Cheers,
Wouter
Hi Wouter,
My input features are normalized feature counts based on ChIP-Seq data, which I wish to use for logistic regression. However, the different features have different distributions, and I am new to machine learning and I would like to get insights on whether it is valid to use some of the read counts as they are, while the others are log-transformed.
Thanks.
Then why don't you mention that?
Please plot the distributions. It is not really good practice to use different distributions, as the models will [by default] assume that they are the same. Ÿou will have to standardise the 2 distributions.
Hi Kevin,
Thanks for your reply. I have decided to log transform all the features. Then do the standardization so that they have more or less the same range. Then use as input for the neural network. Would this make sense?
Thanks, teabonng
Sounds good. I have done this before for metabolomics datasets. Just be aware that there is still likely bias in the data somewhere when you do this.
It would still help to clarify what you mean by "the different features have different distributions"? Are these ChIP-seq normalised counts and metadata? Presumably, at least all of the ChIP-seq data has been processed in the same way.