We split data in ML into training and testing set. Training set is used to train the model and testing set to evaluate the model performance on unseen data. Typical split is 70-30% or 80-20% training-testing, depends upon the size of data. Testing on unseen data simulates real-world performance and ensures the model generalizes well. It also prevent over-fitting by separating data used for training and testing. By separating the data, it forces to evaluate the model on the data it has never seen before.
Scikit-learn’s train_test_split randomly splits data while maintaining class distribution (with stratify option). stratify ensures that training and testing have same proportion of classes as original dataset.
Code
import pandas as pd
from sklearn.model_selection import train_test_split
# Load Iris dataset
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
# Define features (X) and target (y)
X = df.drop('species', axis=1) # Features: sepal_length, sepal_width, etc.
y = df['species'] # Target: species
# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Verify split sizes
print("Training set shape (X_train, y_train):", X_train.shape, y_train.shape)
print("Test set shape (X_test, y_test):", X_test.shape, y_test.shape)
# Check class distribution in training and test sets
print("\nTraining set class distribution:\n", y_train.value_counts())
print("Test set class distribution:\n", y_test.value_counts())
Expected Output
Training set shape (X_train, y_train): (120, 4) (120,)
Test set shape (X_test, y_test): (30, 4) (30,)
Training set class distribution:
species
versicolor 40
virginica 40
setosa 40
Name: count, dtype: int64
Test set class distribution:
species
versicolor 10
virginica 10
setosa 10
Name: count, dtype: int64
Leave a comment