Question

How to split dataset as train and test data not randomly in Python

0

Entering edit mode

4.8 years ago

necnec • 0

How can I split my dataset into train and test data sets by deciding certain data should be in the training set and the rest in testing data? I do not want phyton to select randomly, asking about the user to decide it. is it possible in phyton?

I have a small dataset (20 datapoints grouped into two (10 data points in class-1, 10 data points in class-2). I have 30 features of them. I have the second dataset which is even smaller (10 datapoints again grouped into two classes). I want to generate my model by using the first dataset and then use the second (small dataset) to validate my model externally. the aim is seeing how accurate the model for new datasets that is why I don't want to mix the datasets.

thanks in advance.

python machine learning • 1.9k views

ADD COMMENT • link updated 4.8 years ago by Mensur Dlakic ★ 28k • written 4.8 years ago by necnec • 0

1

Entering edit mode

I don't think you should manually decide which data points go into the test vs. training set, doesn't that defeat the point of training an algorithm?

But ok. What do you want to separate on?

ADD REPLY • link 4.8 years ago by N15 ▴ 160

0

Entering edit mode

Hi NRC,

thank you for the reply. maybe I couldn't tell my problem clearly as I am new to this area. I have a small dataset (20 datapoints grouped into two (10 data points in class-1, 10 data points in class-2). I have 30 features of them. I have the second dataset which is even smaller (10 datapoints again grouped into two classes). I want to generate my model by using the first dataset and then use the second (small dataset) to validate my model externally. the aim is seeing how accurate the model for new datasets that is why I don't want to mix the datasets. I hope it is clear now. please let me know if it is not.

ADD REPLY • link 4.8 years ago by necnec • 0

score 3 · Answer 1 · 2020-04-26

Maybe you already know this, but I will say it just in case: it is unlikely that you will be able to make a model that will generalize well based on 20 data points.

With such small datasets, a commonly used approach is leave-one-out (LOO) cross-validation (CV). It is a special case of N-fold CV, where the number of fold is equal to the number of data points - see the LOO section for details. In your case, that means taking 1 data point out of 20 to use for internal validation, and training on the remaining 19. Repeat that another 19 times, each time taking out a different data point for internal validation. You will have 20 models when you are done, which means 20 predictions on your external validation data. Those will be averaged to give you a final prediction. Sklearn's model selection module has a LOO section that will automate most of this process for you.