52. The Rule That Prevents You From Cheating Your Own Model

Series: How Machines Learn: A Complete Guide from Zero to AI Engineer
Phase 6: Machine Learning (The Core)

You built your first model. You trained it. You tested it on the same data. It got 98% accuracy.

You felt great.

Then you tried it on real data. It bombed.

That moment right there is what this post is about. It happens to almost every beginner and it happens because of one mistake: you let your model peek at the answer key before the exam.

What You'll Learn Here

Why you can't test a model on data it already saw
What a train/test split is and how to do it right
What data leakage is and the different ways it sneaks in
Cross-validation, the upgrade to a single split
Code for all of it, with real examples

The Exam Analogy

Imagine you're a teacher. You want to know if your students actually learned the material.

You hand them a test. But here's the twist: you gave them that exact test as homework last week and they all memorized the answers.

Every student scores 100%. Did they learn? No. They just memorized.

Your ML model does the exact same thing if you train and test on the same data. It memorizes the training examples. It doesn't generalize. It doesn't actually learn the pattern. It just stores what it saw.

And you'd never catch it, because it looks perfect on paper.

The Fix: Keep Test Data Locked Away

The rule is simple. Before you touch your data, split it into two piles:

Training set - The model sees this. It learns from this.
Test set - The model never sees this during training. You use it at the very end to evaluate.

That's it. The test set is like a sealed envelope. You only open it once, when you're done building the model.

from sklearn.model_selection import train_test_split
import numpy as np

# Fake dataset: 1000 examples, 5 features
X = np.random.rand(1000, 5)
y = np.random.randint(0, 2, 1000)

# 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,      # 20% goes to testing
    random_state=42     # Makes the split reproducible
)

print(f"Training size: {X_train.shape[0]}")   # 800
print(f"Testing size:  {X_test.shape[0]}")    # 200

The random_state=42 just means: every time you run this, you get the same split. Without it, you'd get a different random split each time, and your results would change every run. That makes debugging a nightmare.

Why the Split Size Matters

A common question: how much should I put in test?

The usual answer is 80/20 or 70/30. Here's the thinking:

Too little training data: model doesn't learn enough
Too little test data: your accuracy estimate isn't trustworthy (you're evaluating on just 10 examples)

With smaller datasets (under 1000 rows), lean toward 70/30. With big datasets (100k+ rows), you can go 90/10 because 10% is still a lot of data.

# Small dataset - use 70/30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Big dataset - 90/10 is fine
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

What Data Leakage Actually Is

Data leakage is when information from your test set sneaks into your training process. It makes your model look better than it actually is.

There are two kinds.

Leakage Type 1: Training on test data directly

The obvious one. You forget to split and train on everything.

# WRONG - never do this
model.fit(X, y)          # trained on ALL data
score = model.score(X, y)  # tested on same data
print(score)  # Looks amazing. Means nothing.

# RIGHT
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print(score)  # This number actually tells you something

Leakage Type 2: Preprocessing before splitting

This one is sneaky. You scale your data before splitting. Sounds harmless. It's not.

from sklearn.preprocessing import StandardScaler

# WRONG - scaling before split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)   # uses ALL data to calculate mean/std

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)
# The test set influenced the scaler. Leakage.

# RIGHT - split first, then scale
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # learns from train only
X_test_scaled  = scaler.transform(X_test)       # applies same scaling to test

See the difference? In the wrong version, when you called fit_transform(X), the scaler calculated mean and standard deviation using the test data too. That information then flowed into how your model was trained. The test set is no longer truly unseen.

Always: split first, preprocess second.

The Problem With a Single Split

Even with a proper train/test split, there's still a risk. If your dataset is small or your split was unlucky, the test set might not represent your real data well.

You could split and happen to get all the "easy" examples in your test set. Your accuracy looks great. But you got lucky, not good.

The solution is cross-validation.

Instead of one split, you make K splits. You train and test K times. You average the results.

Example with K=5 (called 5-fold cross-validation):

Fold 1:  [TEST ] [train] [train] [train] [train]
Fold 2:  [train] [TEST ] [train] [train] [train]
Fold 3:  [train] [train] [TEST ] [train] [train]
Fold 4:  [train] [train] [train] [TEST ] [train]
Fold 5:  [train] [train] [train] [train] [TEST ]

Final score = average of all 5 test scores

Every example gets used for testing exactly once. The final score is much more reliable than a single split.

from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris

iris = load_iris()
X, y = iris.data, iris.target

model = KNeighborsClassifier(n_neighbors=3)

# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)

print(f"Scores per fold: {scores.round(3)}")
print(f"Mean accuracy:   {scores.mean():.3f}")
print(f"Std deviation:   {scores.std():.3f}")

Output:

Scores per fold: [0.967 1.    0.933 0.967 1.   ]
Mean accuracy:   0.973
Std deviation:   0.027

The mean gives you the best estimate of real-world performance. The standard deviation tells you how consistent the model is. Small std = reliable. Large std = something is off.

Stratified Splits: When Class Balance Matters

Imagine you have a dataset where 95% of examples are "not fraud" and 5% are "fraud." If your random split is unlucky, your test set might have only 1% fraud. Your accuracy numbers will be misleading.

Stratified splitting fixes this. It makes sure both train and test have the same percentage of each class.

from sklearn.model_selection import train_test_split, StratifiedKFold
import numpy as np

# Imbalanced dataset: 950 negatives, 50 positives
X = np.random.rand(1000, 4)
y = np.array([0]*950 + [1]*50)

# stratify=y makes sure both sets keep the 95/5 ratio
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y       # <-- this line
)

print(f"Train class 1 ratio: {y_train.mean():.3f}")  # ~0.050
print(f"Test class 1 ratio:  {y_test.mean():.3f}")   # ~0.050

Without stratify=y, you might get 0.04 in one set and 0.08 in another. With it, both sets reflect the real distribution. Always use stratify=y for classification problems.

Putting It All Together: The Right Workflow

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# 1. Load data
data = load_breast_cancer()
X, y = data.data, data.target

# 2. Split FIRST
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# 3. Preprocess AFTER splitting
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)  # fit only on train
X_test  = scaler.transform(X_test)       # transform only

# 4. Check with cross-validation on training data
model = KNeighborsClassifier(n_neighbors=5)
cv_scores = cross_val_score(model, X_train, y_train, cv=5)
print(f"CV accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")

# 5. Final train on full training set
model.fit(X_train, y_train)

# 6. Evaluate on test set ONCE at the end
y_pred = model.predict(X_test)
final_accuracy = accuracy_score(y_test, y_pred)
print(f"Final test accuracy: {final_accuracy:.3f}")

This is the correct order. Every step is in the right place. No leakage.

The Things Everyone Gets Wrong

Mistake 1: Looking at test results and then tweaking the model

The moment you check your test score and adjust your model based on it, you've contaminated the test set. You're now fitting to the test set, just indirectly. Use cross-validation for tuning. Use the test set only once.

Mistake 2: Scaling before splitting

Covered above. Split first, always.

Mistake 3: Forgetting random_state

Without a fixed random state, your results change every run. You can't tell if your improvements are real or just lucky splits.

Mistake 4: Tiny test sets

Evaluating on 10 examples tells you nothing. One or two random flukes and your accuracy is off by 10-20%. Always make sure your test set has enough examples to be statistically meaningful (at least 100, ideally 200+).

Quick Cheat Sheet

Situation	What to do
Default split	`train_test_split(X, y, test_size=0.2, random_state=42)`
Imbalanced classes	Add `stratify=y` to the split
Need reliable estimate	Use `cross_val_score(model, X, y, cv=5)`
Preprocessing	`fit_transform` on train, `transform` on test
Final evaluation	Only use test set once, at the very end

Practice Challenges

Level 1:
Load load_wine() from sklearn. Split it 80/20 with stratify. Train a KNN model. Print the test accuracy.

Level 2:
Try different test sizes: 0.1, 0.2, 0.3, 0.4. Plot how the test accuracy changes. Does more training data always help?

Level 3:
Run 5-fold cross-validation on the breast cancer dataset. Then change to 10-fold. Compare the mean and standard deviation. Which gives a more stable estimate?

References

Next up, Post 53: Overfitting: When Your Model Is Too Good at Being Wrong. We go deep into bias, variance, and the tradeoff that decides everything in ML