Machine Learning Interactive
Feature Engineering
Transform raw data into predictive features. Good features are the difference between a mediocre model and a winning one.
๐ก The Core Principle
๐
Raw Data
Points scored, game logs, box scores
๐ง
Feature Engineering
Rolling averages, pace-adjusted, contextual
๐ฏ
Predictive Features
Signals that improve predictions
Feature Selection
๐ Model Performance
60.0%
Prediction Accuracy
Features: 4
Feature Importance
Season Average
35%
Base
Rolling Avg (L5)
18%
Trend
Matchup Stats
15%
Context
Home/Away
8%
Context
Rest Days
6%
Context
โ Base โ Trend โ Context
Engineering Best Practices
โ Do
- โข Create rolling averages at multiple windows
- โข Normalize features to same scale
- โข Create interaction terms
- โข Handle missing values explicitly
โ Don't
- โข Use future information (data leakage)
- โข Create too many features (overfit)
- โข Ignore correlation between features
- โข Use raw counts without normalization
๐ Feature Categories
Base Statistics
Historical performance baselines
- โข Season average
- โข Career average
- โข Position average
Trend Features
Recent form and direction
- โข Rolling averages (L5, L10)
- โข Momentum
- โข Hot/cold streak
Context Features
Situational adjustments
- โข Home/away
- โข Rest days
- โข Back-to-back
- โข Opponent strength
Pace/Tempo
Opportunity-based scaling
- โข Team pace
- โข Opponent pace
- โข Projected possessions
Usage Features
Role and opportunity
- โข Projected minutes
- โข Usage rate
- โข Lineup impact
Derived/Interaction
Combined effects
- โข Pace ร Minutes
- โข Home ร Rest
- โข Matchup ร Usage
๐ Player Props Feature Set
| Feature | Formula | Why It Matters |
|---|---|---|
| Pace-Adjusted Avg | season_avg ร (matchup_pace / league_pace) | Scales for fast/slow games |
| Minutes-Weighted | per_min_rate ร proj_minutes | Accounts for opportunity |
| Matchup Factor | opp_def_rating / league_avg | Defense quality adjustment |
| Rest Impact | 1 + 0.02 ร (rest_days - 1) | Back-to-back penalty |
| Trend Score | (L5_avg - season_avg) / season_std | Hot/cold streak signal |
R Code Equivalent
# Feature engineering for player props
library(dplyr)
library(zoo)
create_features <- function(player_games) {
player_games %>%
arrange(game_date) %>%
mutate(
# Rolling averages
roll_5 = rollmean(pts, k = 5, fill = NA, align = "right"),
roll_10 = rollmean(pts, k = 10, fill = NA, align = "right"),
# Trend
trend = (roll_5 - season_avg) / season_std,
# Pace adjustment
pace_factor = team_pace / league_avg_pace,
pace_adj_pts = season_avg * pace_factor,
# Minutes projection
per_min = pts / minutes,
min_proj_pts = per_min * proj_minutes,
# Matchup adjustment
matchup_factor = opp_pts_allowed / league_avg,
# Combined projection
projection = pace_adj_pts * matchup_factor *
(1 + trend * 0.1)
)
}
# Feature importance
library(randomForest)
rf_model <- randomForest(pts ~ ., data = features)
importance(rf_model)โ Key Takeaways
- โข Features matter more than model choice
- โข Rolling averages capture recent form
- โข Pace/tempo adjusts for opportunity
- โข Watch for data leakage (using future info)
- โข More features โ better model
- โข Domain knowledge guides feature creation