Test/Train Split


When building a machine learning algorithm, a critical technique used to evaluate how well a model generalizes to unseen data is to reserve a portion of your training dataset as a "test" dataset not used for training. This concept is know as a test/train split.

This results in having two subsets of your data:

  • the training set, which is used to train the model, and
  • the test set, which is used to assess its performance.

Typically, the split is often "80/20" or "70/30", meaning 80% of the data is used for training and 20% for testing. This separation ensures that the model is not evaluated on the same data it was trained on.

The test/train split is important since it simulates new data that has never been seen before, since the "test set" is data not found in the "training set". By evaluating the accuracy of a model on the test set, we get an estimate of its predictive accuracy to never-before seen data and robustness to new data.