Splitting data into training and test sets
- Having completed the EDA, the next step is to split the dataset into training and test sets. The idea is to have two separate datasets:
- Training set — On this part of the data, we train a machine learning model
- Test set — This part of the data was not seen by the model during training, and is used to evaluate the performance
- What we want to achieve by splitting the data is preventing overfitting. Overfitting is a phenomenon whereby a model finds too many patterns in data used for training and performs well only on that particular data. In other words, it fails to generalize to unseen data.
- This is a very important step in the analysis, as doing it incorrectly can introduce bias, for example, in the form of data leakage. Data leakage can occur when, during the training phase, a model observes information to which it should not have access. We follow up with an example. A common scenario is that of imputing missing values with the feature's average. If we had done this before splitting the data, we would have also used data from the test set to calculate the average, introducing data leakage. That is why the proper order would be to split the data into training and test sets first and then carry out the imputation, using the data observed in the training set.
- Additionally, this approach ensures consistency, as unseen data in the future (new customers that will be scored by the model) will be treated in the same way as the ones data in the test set.
How to do it
Execute the following steps to split the dataset into training and test sets.
- Import the function from sklearn:
from sklearn.model_selection import train_test_split
- Split the data into training and test sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
- Verify that the ratio of the target is preserved:
print(y_train.value_counts(normalize=True))
print(y_test.value_counts(normalize=True))
0 0.980213 1 0.019787 Name: target, dtype: float64 0 0.980211 1 0.019789 Name: target, dtype: float64
How it works...
- We first imported the train_test_split function from the model_selection module of scikit-learn.
- In the second step, we showed how to do the most basic split. We passed X and y objects to the train_test_split function. Additionally, we specified the size of the test set, as a fraction of all observations. For reproducibility, we also specified the random state. We also had to assign the output of the function to four new objects.
- In the last step, we also specified the stratification argument by passing the target variable (stratify=y). Splitting the data with stratification means that both the training and test sets will have a possibly identical distribution of the specified variable. This parameter is very important when dealing with imbalanced data
There's more...
- It is also common to split data into three sets: training, validation, and test. The validation set is used for frequent evaluation and tuning of the model's hyperparameters. Suppose we want to train a XG Boosting classifier and find the optimal value of the max_depth hyperparameter, which decides the maximum depth of the tree. To do so, we can train the model multiple times using the training set, and each time with a different value of the hyperparameter. Then, we can evaluate the performance of all these models, using the validation set. We pick the best model of those, and then, finally, evaluate its performance on the test set.
- In the following code block, we illustrate a possible way of creating a train- validation-test split, using the same train_test_split function:
# define the size of the validation and test sets
VALID_SIZE = 0.1
TEST_SIZE = 0.2
# create the initial split - training and temp
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=(VALID_SIZE + TEST_SIZE), stratify=y, random_state=42)
# calculate the new test size
NEW_TEST_SIZE = np.around(TEST_SIZE / (VALID_SIZE + TEST_SIZE), 2)
# create the valid and test sets
X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp,
test_size=NEW_TEST_SIZE, stratify=y_temp, random_state=42)
- We basically ran train_test_split; however, we had to adjust the sizes of the test_size input in such a way that the initially defined proportions (70-10- 20) were preserved.
- Sometimes, we do not have enough data to split it into three sets, either because we do not have that many observations in general or because the data can be highly imbalanced, and we would remove valuable training samples from the training set.