0/70 completed
Dynamic Pricing Interactive

Multi-Armed Bandit

Balance exploration vs exploitation for learning optimal pricing in real-time. Test different strategies while minimizing regret.

๐ŸŽฐ The Exploration-Exploitation Tradeoff

Exploitation

Use current best option. Maximize immediate reward.

  • โ€ข "Play it safe with what works"
  • โ€ข Risk: stuck on suboptimal arm
  • โ€ข Example: Always use 1.9x payout

Exploration

Try alternatives to learn more. Sacrifice short-term for information.

  • โ€ข "Gather more data on options"
  • โ€ข Risk: wasting resources on bad arms
  • โ€ข Example: Test 2.1x on 10% of users

True Arm Probabilities

(Hidden from algorithm)

Arm 1 (Payout A) 0.08
0.01 0.2
Arm 2 (Payout B) 0.12
0.01 0.2
Arm 3 (Payout C) 0.05
0.01 0.2

Algorithm Settings

Epsilon (exploration rate) 0.1
0 0.3
Total Trials 1000
100 5000

๐Ÿ“Š Results

Total Regret 15.5
Regret/Trial 0.0155
Best Arm Pulls 679

Learned vs True Probabilities

Arm 1
True: 8.0%
Estimated: 9.4%
Pulls: 232
Arm 2
True: 12.0%
Estimated: 13.4%
Pulls: 679
Arm 3
True: 5.0%
Estimated: 4.4%
Pulls: 89

Cumulative Regret Over Time

Regret = sum of (best arm reward - chosen arm reward). Good algorithms have sublinear regret.

๐Ÿ”ฌ Bandit Algorithms

ฮต-Greedy

Explore ฮต% of time, exploit best (1-ฮต)%

โœ“ Simple to implement
โœ— Fixed exploration, not adaptive

Thompson Sampling

Sample from posterior, choose highest

โœ“ Naturally adaptive, theoretically optimal
โœ— Requires Bayesian framework

UCB

Upper Confidence Bound - optimism in uncertainty

โœ“ Deterministic, strong guarantees
โœ— Less practical for complex rewards

๐Ÿ’ฐ Pricing Applications

What to Test

  • โ†’ Different payout multipliers (1.8x vs 1.9x vs 2.0x)
  • โ†’ Hold rates by market type
  • โ†’ Promotional offers and bonuses
  • โ†’ UI variations for conversion

Why Bandits Beat A/B Tests

  • โ†’ Less regretโ€”shift to winners faster
  • โ†’ Continuous learning, not fixed sample
  • โ†’ Handle changing environments
  • โ†’ Optimal for limited traffic

R Code Equivalent

# Thompson Sampling for multi-armed bandit
thompson_sampling <- function(true_probs, n_trials, epsilon = 0.1) { 
  n_arms <- length(true_probs)
  alpha <- rep(1, n_arms)  # Beta prior
  beta <- rep(1, n_arms)
  pulls <- rep(0, n_arms)
  rewards <- rep(0, n_arms)
  
  for (t in 1:n_trials) { 
    # Sample from posterior
    samples <- rbeta(n_arms, alpha, beta)
    
    # Epsilon-greedy exploration
    if (runif(1) < epsilon) { 
      chosen <- sample(n_arms, 1)
    } else { 
      chosen <- which.max(samples)
    }
    
    # Observe reward
    reward <- rbinom(1, 1, true_probs[chosen])
    
    # Update beliefs
    alpha[chosen] <- alpha[chosen] + reward
    beta[chosen] <- beta[chosen] + 1 - reward
    pulls[chosen] <- pulls[chosen] + 1
    rewards[chosen] <- rewards[chosen] + reward
  }
  
  return(list(pulls = pulls, rewards = rewards, 
              estimated = alpha / (alpha + beta)))
}

# Run simulation
result <- thompson_sampling(c(0.08, 0.12, 0.05), 1000, 0.1)
print(result)

โœ… Key Takeaways

  • โ€ข Balance exploration vs exploitation
  • โ€ข Thompson Sampling is Bayesian and adaptive
  • โ€ข Minimize regret, not just find best arm
  • โ€ข Bandits beat A/B for limited traffic
  • โ€ข Use for real-time pricing optimization
  • โ€ข Algorithms converge to best arm over time

Pricing Models & Frameworks Tutorial

Built for mastery ยท Interactive learning