Machine Learning Interactive

Model Calibration

Ensure predicted probabilities match observed frequencies. A 70% prediction should win 70% of the time. Critical for pricing and risk assessment.

🎯 Why Calibration Matters

📊

Accurate Pricing

Miscalibrated probabilities = mispriced lines. You either give edge or lose customers.

⚖️

Risk Assessment

VaR and stress tests need calibrated probabilities for valid estimates.

🎰

Bettor Trust

Well-calibrated odds feel fair. Builds long-term customer relationships.

Model Parameters

0 30

-20 40

Overconfidence: Model pushes predictions away from 50%

📊 Metrics

ECE (Expected Cal. Error) 7.1%

Brier Score 0.181

Acceptable calibration

Reliability Diagram

Perfect calibration: line follows diagonal. Above = underconfident, Below = overconfident.

Reading the Chart

Well Calibrated

Green (actual) line follows yellow (predicted) line closely. 70% predictions win ~70% of the time.

Overconfident

Actual outcomes lower than predictions for high probabilities. Model says 80%, actually happens 60%.

🔧 Calibration Methods

Platt Scaling

Parametric

Fit sigmoid to outputs

Isotonic Regression

Non-parametric

Non-parametric monotonic fit

Temperature Scaling

Parametric

Single parameter softmax

Histogram Binning

Non-parametric

Bin-specific corrections

💡 Practical Tips

When to Calibrate

→ After training any classifier (SVM, NN, tree-based)
→ Before using probabilities for pricing
→ Regularly as model drifts

Best Practices

→ Use holdout set for calibration (not training)
→ Platt scaling works well with enough data
→ Temperature scaling for neural networks

R Code Equivalent

# Calibration analysis
library(CalibrationCurves)

# Calculate Expected Calibration Error
calculate_ece <- function(predictions, actuals, n_bins = 10) { 
  bins <- cut(predictions, breaks = seq(0, 1, length.out = n_bins + 1))
  
  ece <- 0
  n_total <- length(predictions)
  
  for (b in levels(bins)) { 
    in_bin <- which(bins == b)
    if (length(in_bin) > 0) { 
      avg_pred <- mean(predictions[in_bin])
      avg_actual <- mean(actuals[in_bin])
      weight <- length(in_bin) / n_total
      ece <- ece + weight * abs(avg_pred - avg_actual)
    }
  }
  
  return(ece)
}

# Platt scaling calibration
platt_calibrate <- function(predictions, actuals) { 
  model <- glm(actuals ~ predictions, family = binomial)
  calibrated <- predict(model, type = "response")
  return(calibrated)
}

# Temperature scaling (for neural nets)
temperature_scale <- function(logits, temperature) { 
  scaled <- logits / temperature
  probs <- exp(scaled) / (1 + exp(scaled))
  return(probs)
}

✅ Key Takeaways

• 70% prediction should win 70% of the time
• Use reliability diagrams to visualize calibration
• ECE measures overall calibration quality

• Platt scaling and isotonic regression are common fixes
• Always calibrate on held-out data
• Re-calibrate as model drifts