Train-Test Split Explained for ML Models

August 5, 2025

Artificial Intelligence, Machine Learning, Today I Learned

We split data in ML into training and testing set. Training set is used to train the model and testing set to evaluate the model performance on unseen data. Typical split is 70-30% or 80-20% training-testing, depends upon the size of data. Testing on unseen data simulates real-world performance and ensures the model generalizes well. It also prevent over-fitting by separating data used for training and testing. By separating the data, it forces to evaluate the model on the data it has never seen before.

Scikit-learn’s train_test_split randomly splits data while maintaining class distribution (with stratify option). stratify ensures that training and testing have same proportion of classes as original dataset.

Code

import pandas as pd
from sklearn.model_selection import train_test_split

# Load Iris dataset
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

# Define features (X) and target (y)
X = df.drop('species', axis=1)  # Features: sepal_length, sepal_width, etc.
y = df['species']  # Target: species

# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Verify split sizes
print("Training set shape (X_train, y_train):", X_train.shape, y_train.shape)
print("Test set shape (X_test, y_test):", X_test.shape, y_test.shape)

# Check class distribution in training and test sets
print("\nTraining set class distribution:\n", y_train.value_counts())
print("Test set class distribution:\n", y_test.value_counts())

Expected Output

Training set shape (X_train, y_train): (120, 4) (120,)
Test set shape (X_test, y_test): (30, 4) (30,)

Training set class distribution:
species
versicolor    40
virginica     40
setosa        40
Name: count, dtype: int64
Test set class distribution:
species
versicolor    10
virginica     10
setosa        10
Name: count, dtype: int64

Discover more from PRADOSH

Subscribe to get the latest posts to your email.

Train-Test Split Explained for ML Models

Code

Expected Output

Share this:

Discover more from PRADOSH

Leave a comment Cancel reply