Hello everyone,
I have to train neural network for protein class prediction. I am using protein sequences for training and testing of this network. Problem i am facing is i have too many sequences for each class. What is the best way to reduce number of sequences in dataset? I used clustering using program CD-HIT with 40% identity which reduced dataset size to around 12000 protein sequences (class A) and hundred thousand protein sequences (class B) but still dataset is too big? What should be the dataset size for this purpose? Can i choose some hundred protein sequences arbitrarily?
Thanks for the help.
one thing that you said about balancing data should i take both classes in 1:1 ?
Yes. In your case with class B representing 90% of the data, a classifier could learn to always return a label of class B because this will always have high accuracy. For ideas on how to deal with imbalanced data, have a look here.