0/70 completed
Machine Learning Interactive

Cross-Validation

Estimate model performance on unseen data. Use all data for both training and testing by rotating through folds.

๐Ÿ“Š How K-Fold CV Works

  1. 1. Split: Divide data into K equal folds
  2. 2. Rotate: Use K-1 folds for training, 1 for testing
  3. 3. Repeat: Train/test K times, each fold as test once
  4. 4. Average: Report mean ยฑ std of K scores
Fold 1
Fold 2
Fold 3
Fold 4
Fold 5
โ–  Train โ–  Test

CV Settings

Number of Folds (K) 5
2 10
Dataset Size 100
50 500

๐Ÿ“Š Data Split

Training samples 80
Test samples 20
Test % 20.0%

Accuracy Per Fold

Cross-Validation Summary

Mean Accuracy
63.3%
Std Dev
ยฑ2.7%
95% CI
58.0-68.6%

Report as: 63.3% ยฑ 2.7% (5-fold CV)

๐Ÿ“‹ CV Methods

K-Fold

Split into K equal parts, rotate test set

Best for: General purpose

โœ“ Balanced, uses all data

Leave-One-Out

K = N, one sample per test fold

Best for: Small datasets

โœ“ Max training data

Time Series

Train on past, test on future

Best for: Sequential data

โœ“ Prevents leakage

Stratified

Maintain class proportions in folds

Best for: Imbalanced data

โœ“ Representative folds

โš ๏ธ Time Series Data Warning

Standard K-Fold CV shuffles data randomly. For sports betting (sequential games):

  • โ€ข Problem: Future games could appear in training, past in test (leakage)
  • โ€ข Solution: Use time-series CV: always train on past, test on future
  • โ€ข Expanding window: Train on all data before test period
  • โ€ข Rolling window: Fixed training window, slides forward

R Code Equivalent

# K-Fold Cross-Validation
library(caret)

# Create folds
folds <- createFolds(y, k = 5, returnTrain = TRUE)

# Manual CV loop
cv_results <- sapply(folds, function(train_idx) { 
  model <- train_model(X[train_idx, ], y[train_idx])
  predictions <- predict(model, X[-train_idx, ])
  accuracy <- mean(predictions == y[-train_idx])
  return(accuracy)
})

cat(sprintf("CV Accuracy: %.1f%% ยฑ %.1f%%\n", 
            mean(cv_results) * 100, sd(cv_results) * 100))

# Time-series CV (expanding window)
library(rsample)
ts_cv <- rolling_origin(data, initial = 80, assess = 20, cumulative = TRUE)

# Or using caret trainControl
ctrl <- trainControl(method = "timeslice", 
                     initialWindow = 80, horizon = 20, fixedWindow = FALSE)

โœ… Key Takeaways

  • โ€ข CV uses all data for training AND testing
  • โ€ข Report mean ยฑ std across folds
  • โ€ข K=5 or K=10 is standard
  • โ€ข Use time-series CV for sequential data
  • โ€ข Stratified CV for imbalanced classes
  • โ€ข LOOCV for small datasets

Pricing Models & Frameworks Tutorial

Built for mastery ยท Interactive learning