Question

How to reduce protein sequence dataset size for machine learning?

0

Entering edit mode

8.0 years ago

rajeshkumar_vinod ▴ 30

Hello everyone,

I have to train neural network for protein class prediction. I am using protein sequences for training and testing of this network. Problem i am facing is i have too many sequences for each class. What is the best way to reduce number of sequences in dataset? I used clustering using program CD-HIT with 40% identity which reduced dataset size to around 12000 protein sequences (class A) and hundred thousand protein sequences (class B) but still dataset is too big? What should be the dataset size for this purpose? Can i choose some hundred protein sequences arbitrarily?

protein machine learning • 1.7k views

ADD COMMENT • link 8.0 years ago by rajeshkumar_vinod ▴ 30

score 0 · Answer 1 · 2016-11-28

0

Entering edit mode

8.0 years ago

Jean-Karim Heriche 27k

I am assuming that your problem is with the training set being too big. A typical way to solve this in general is to select a random sample of the training data. However, looking at your numbers, it seems the problem is class imbalance (class A: 1e^4, class B: 1e^5). You should look at balancing the data and/or weight the error by the class sizes. Also keep in mind that, for neural networks, it is recommended to have many times (>10x) more training samples than weights in the networks. You could try progressive sampling: iterate training with increasing training set size, checking each time that the accuracy improves.

ADD COMMENT • link 8.0 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for the help.

one thing that you said about balancing data should i take both classes in 1:1 ?

ADD REPLY • link 8.0 years ago by rajeshkumar_vinod ▴ 30

0

Entering edit mode

Yes. In your case with class B representing 90% of the data, a classifier could learn to always return a label of class B because this will always have high accuracy. For ideas on how to deal with imbalanced data, have a look here.

ADD REPLY • link 8.0 years ago by Jean-Karim Heriche 27k

score 0 · Answer 2 · 2016-11-29

0

Entering edit mode

8.0 years ago

rajeshkumar_vinod ▴ 30

Thanks for the help.

one thing that you said about balancing data should i take both classes in 1:1 ?

ADD COMMENT • link 8.0 years ago by rajeshkumar_vinod ▴ 30