Data Transformations for machine learning
0
0
Entering edit mode
6.2 years ago
druggable ▴ 60

Hi,

I have multiple features as input for machine learning. My input features are normalized feature counts based on ChIP-Seq data, which I wish to use for logistic regression and neural network. However, the different features have different distributions, and I am new to machine learning and I would like to get insights on whether it is valid to use some of the read counts as they are (because they are already normally distributed), while the others are log-transformed.

Thanks, teabonng

machine learning data transformation • 1.9k views
ADD COMMENT
0
Entering edit mode

Hi teabonng,

It is unclear how this question is related to bioinformatics, which is the scope of Biostars. Please elaborate or this question might get closed for being off topic.

Cheers,
Wouter

ADD REPLY
0
Entering edit mode

Hi Wouter,

My input features are normalized feature counts based on ChIP-Seq data, which I wish to use for logistic regression. However, the different features have different distributions, and I am new to machine learning and I would like to get insights on whether it is valid to use some of the read counts as they are, while the others are log-transformed.

Thanks.

ADD REPLY
0
Entering edit mode

Then why don't you mention that?

ADD REPLY
0
Entering edit mode

Please plot the distributions. It is not really good practice to use different distributions, as the models will [by default] assume that they are the same. Ÿou will have to standardise the 2 distributions.

ADD REPLY
0
Entering edit mode

Hi Kevin,

Thanks for your reply. I have decided to log transform all the features. Then do the standardization so that they have more or less the same range. Then use as input for the neural network. Would this make sense?

Thanks, teabonng

ADD REPLY
1
Entering edit mode

Sounds good. I have done this before for metabolomics datasets. Just be aware that there is still likely bias in the data somewhere when you do this.

ADD REPLY
0
Entering edit mode

It would still help to clarify what you mean by "the different features have different distributions"? Are these ChIP-seq normalised counts and metadata? Presumably, at least all of the ChIP-seq data has been processed in the same way.

ADD REPLY

Login before adding your answer.

Traffic: 1667 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6