How to reduce protein sequence dataset size for machine learning?
2
0
Entering edit mode
8.0 years ago

Hello everyone,

I have to train neural network for protein class prediction. I am using protein sequences for training and testing of this network. Problem i am facing is i have too many sequences for each class. What is the best way to reduce number of sequences in dataset? I used clustering using program CD-HIT with 40% identity which reduced dataset size to around 12000 protein sequences (class A) and hundred thousand protein sequences (class B) but still dataset is too big? What should be the dataset size for this purpose? Can i choose some hundred protein sequences arbitrarily?

protein machine learning • 1.7k views
ADD COMMENT
0
Entering edit mode
8.0 years ago

I am assuming that your problem is with the training set being too big. A typical way to solve this in general is to select a random sample of the training data. However, looking at your numbers, it seems the problem is class imbalance (class A: 1e^4, class B: 1e^5). You should look at balancing the data and/or weight the error by the class sizes. Also keep in mind that, for neural networks, it is recommended to have many times (>10x) more training samples than weights in the networks. You could try progressive sampling: iterate training with increasing training set size, checking each time that the accuracy improves.

ADD COMMENT
0
Entering edit mode

Thanks for the help.

one thing that you said about balancing data should i take both classes in 1:1 ?

ADD REPLY
0
Entering edit mode

Yes. In your case with class B representing 90% of the data, a classifier could learn to always return a label of class B because this will always have high accuracy. For ideas on how to deal with imbalanced data, have a look here.

ADD REPLY
0
Entering edit mode
8.0 years ago

Thanks for the help.

one thing that you said about balancing data should i take both classes in 1:1 ?

ADD COMMENT

Login before adding your answer.

Traffic: 2012 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6