Dynamic Pricing Interactive
Multi-Armed Bandit
Balance exploration vs exploitation for learning optimal pricing in real-time. Test different strategies while minimizing regret.
๐ฐ The Exploration-Exploitation Tradeoff
Exploitation
Use current best option. Maximize immediate reward.
- โข "Play it safe with what works"
- โข Risk: stuck on suboptimal arm
- โข Example: Always use 1.9x payout
Exploration
Try alternatives to learn more. Sacrifice short-term for information.
- โข "Gather more data on options"
- โข Risk: wasting resources on bad arms
- โข Example: Test 2.1x on 10% of users
True Arm Probabilities
(Hidden from algorithm)
0.01 0.2
0.01 0.2
0.01 0.2
Algorithm Settings
0 0.3
100 5000
๐ Results
Total Regret 15.5
Regret/Trial 0.0155
Best Arm Pulls 679
Learned vs True Probabilities
Arm 1
True: 8.0%
Estimated: 9.4%
Pulls: 232
Arm 2
True: 12.0%
Estimated: 13.4%
Pulls: 679
Arm 3
True: 5.0%
Estimated: 4.4%
Pulls: 89
Cumulative Regret Over Time
Regret = sum of (best arm reward - chosen arm reward). Good algorithms have sublinear regret.
๐ฌ Bandit Algorithms
ฮต-Greedy
Explore ฮต% of time, exploit best (1-ฮต)%
โ Simple to implement
โ Fixed exploration, not adaptive
Thompson Sampling
Sample from posterior, choose highest
โ Naturally adaptive, theoretically optimal
โ Requires Bayesian framework
UCB
Upper Confidence Bound - optimism in uncertainty
โ Deterministic, strong guarantees
โ Less practical for complex rewards
๐ฐ Pricing Applications
What to Test
- โ Different payout multipliers (1.8x vs 1.9x vs 2.0x)
- โ Hold rates by market type
- โ Promotional offers and bonuses
- โ UI variations for conversion
Why Bandits Beat A/B Tests
- โ Less regretโshift to winners faster
- โ Continuous learning, not fixed sample
- โ Handle changing environments
- โ Optimal for limited traffic
R Code Equivalent
# Thompson Sampling for multi-armed bandit
thompson_sampling <- function(true_probs, n_trials, epsilon = 0.1) {
n_arms <- length(true_probs)
alpha <- rep(1, n_arms) # Beta prior
beta <- rep(1, n_arms)
pulls <- rep(0, n_arms)
rewards <- rep(0, n_arms)
for (t in 1:n_trials) {
# Sample from posterior
samples <- rbeta(n_arms, alpha, beta)
# Epsilon-greedy exploration
if (runif(1) < epsilon) {
chosen <- sample(n_arms, 1)
} else {
chosen <- which.max(samples)
}
# Observe reward
reward <- rbinom(1, 1, true_probs[chosen])
# Update beliefs
alpha[chosen] <- alpha[chosen] + reward
beta[chosen] <- beta[chosen] + 1 - reward
pulls[chosen] <- pulls[chosen] + 1
rewards[chosen] <- rewards[chosen] + reward
}
return(list(pulls = pulls, rewards = rewards,
estimated = alpha / (alpha + beta)))
}
# Run simulation
result <- thompson_sampling(c(0.08, 0.12, 0.05), 1000, 0.1)
print(result)โ Key Takeaways
- โข Balance exploration vs exploitation
- โข Thompson Sampling is Bayesian and adaptive
- โข Minimize regret, not just find best arm
- โข Bandits beat A/B for limited traffic
- โข Use for real-time pricing optimization
- โข Algorithms converge to best arm over time