Dynamic Pricing Interactive

Multi-Armed Bandit

Balance exploration vs exploitation for learning optimal pricing in real-time. Test different strategies while minimizing regret.

🎰 The Exploration-Exploitation Tradeoff

Exploitation

Use current best option. Maximize immediate reward.

• "Play it safe with what works"
• Risk: stuck on suboptimal arm
• Example: Always use 1.9x payout

Exploration

Try alternatives to learn more. Sacrifice short-term for information.

• "Gather more data on options"
• Risk: wasting resources on bad arms
• Example: Test 2.1x on 10% of users

True Arm Probabilities

(Hidden from algorithm)

0.01 0.2

Algorithm Settings

0 0.3

100 5000

📊 Results

Total Regret 15.5

Regret/Trial 0.0155

Best Arm Pulls 679

Learned vs True Probabilities

Arm 1

True: 8.0%

Estimated: 9.4%

Pulls: 232

Arm 2

True: 12.0%

Estimated: 13.4%

Pulls: 679

Arm 3

True: 5.0%

Estimated: 4.4%

Pulls: 89

Cumulative Regret Over Time

Regret = sum of (best arm reward - chosen arm reward). Good algorithms have sublinear regret.

🔬 Bandit Algorithms

ε-Greedy

Explore ε% of time, exploit best (1-ε)%

✓ Simple to implement

✗ Fixed exploration, not adaptive

Thompson Sampling

Sample from posterior, choose highest

✓ Naturally adaptive, theoretically optimal

✗ Requires Bayesian framework

UCB

Upper Confidence Bound - optimism in uncertainty

✓ Deterministic, strong guarantees

✗ Less practical for complex rewards

💰 Pricing Applications

What to Test

→ Different payout multipliers (1.8x vs 1.9x vs 2.0x)
→ Hold rates by market type
→ Promotional offers and bonuses
→ UI variations for conversion

Why Bandits Beat A/B Tests

→ Less regret—shift to winners faster
→ Continuous learning, not fixed sample
→ Handle changing environments
→ Optimal for limited traffic

R Code Equivalent

# Thompson Sampling for multi-armed bandit
thompson_sampling <- function(true_probs, n_trials, epsilon = 0.1) { 
  n_arms <- length(true_probs)
  alpha <- rep(1, n_arms)  # Beta prior
  beta <- rep(1, n_arms)
  pulls <- rep(0, n_arms)
  rewards <- rep(0, n_arms)
  
  for (t in 1:n_trials) { 
    # Sample from posterior
    samples <- rbeta(n_arms, alpha, beta)
    
    # Epsilon-greedy exploration
    if (runif(1) < epsilon) { 
      chosen <- sample(n_arms, 1)
    } else { 
      chosen <- which.max(samples)
    }
    
    # Observe reward
    reward <- rbinom(1, 1, true_probs[chosen])
    
    # Update beliefs
    alpha[chosen] <- alpha[chosen] + reward
    beta[chosen] <- beta[chosen] + 1 - reward
    pulls[chosen] <- pulls[chosen] + 1
    rewards[chosen] <- rewards[chosen] + reward
  }
  
  return(list(pulls = pulls, rewards = rewards, 
              estimated = alpha / (alpha + beta)))
}

# Run simulation
result <- thompson_sampling(c(0.08, 0.12, 0.05), 1000, 0.1)
print(result)

✅ Key Takeaways

• Balance exploration vs exploitation
• Thompson Sampling is Bayesian and adaptive
• Minimize regret, not just find best arm

• Bandits beat A/B for limited traffic
• Use for real-time pricing optimization
• Algorithms converge to best arm over time