Hello everyone. I have a data set containing 1406 class 0 and 1406 class 1 instances. I want to split my data set to training and test data set by python's sklearn library and at the same time, I want my training data set to be balanced after splitting. I'm wondering whether this issue would be handled by sklearn package or not? I would appreciate your help.
If you are splitting proteins into a training and test set you also want to eliminate pairs of homologous proteins across the training/test set, otherwise you might just end up learning how to recognise homology. As an absolute minimum there shouldn't be any sequences in the test set with >30% sequence identity to the training set (e.g. using blastclust). However it is better to split taking into account evolutionary classifications such as ECOD/CATH, as proteins can be homologous below 30% sequence identity. See https://www.nature.com/articles/s41580-019-0176-5 for more.
I have removed similar sequences already by CD-HIT to reduce redundancy. I used a 40% cutoff. But I will read the article. Thank you so much for your valuable help.
Is this a bioinformatics question? It's not obvious what your data are from the description "class 0" and "class 1".
I'm sorry. I should have mentioned the types of my data. Yes it’s a bioinformatics question. My classes represent thermophilic and mesophilic proteins and the length of each feature vector is 20 (amino acid composition) for each protein.
Can you also define what you mean by having 'balanced' data sets in this context?
If you have more instances of thermophilic class relative to instances from the another class (here the mesophile class) your results will be biased toward the class that has majority (in this example it would bias toward the thermophilic proteins). Hence, in order to obtain reliable results you should balance your data set before training. You can read more here
I see, you mean balanced in terms of pure numbers.
I'm no ML expert, but intuitively I would assume you can simply randomly choose an equal number from each class since your input data is already balanced?
Yes it’s already balanced. But I’m not sure that if it will remain balanced after splitting. It should be noted that I can do it by myself but I want to do it via python’s scikit-learn library and I’m not sure whether scikit-learn will handle this issue or not.
You can follow this link here and look for the response by Guiem Bosch. If you try that, it might work. I could have tested it. However, you did not provide a small example of the code and data that you tried.
Yes, you can split your train and test data with sklearn. https://machinelearningmastery.com/evaluate-performance-machine-learning-algorithms-python-using-resampling/
You can check the above site for many other examples with code