Optimization Interactive

Reinforcement Learning

Learn optimal strategies through trial and error. The agent explores, receives rewards, and gradually discovers the best policy.

📊 The RL Framework

State

Current situation (bankroll, game state)

Bankroll = $1000

Action

Decision to make (bet size, hedge)

Bet $50

Reward

Immediate feedback (+win, -loss)

+$50 or -$50

Policy

Strategy mapping states to actions

Kelly fraction

Hyperparameters

0.01 0.5

0.8 0.99

0.05 0.5

50 200

📊 Training Results

Final Avg Reward 2.3

Episodes Run 100

✓ Agent learned profitable strategy

Learning Curve

Higher learning rate → faster but noisier. Higher exploration → more diverse experience but slower convergence.

Exploration vs Exploitation

🔍 Exploration

Try new actions to discover better strategies

20% random actions

💰 Exploitation

Use best known action to maximize reward

80% greedy actions

🤖 RL Algorithms

Q-Learning

Value-based

Discrete actions

DQN

Deep Q-Network

Large state spaces

PPO

Policy gradient

Continuous actions

A2C

Actor-Critic

Stable learning

🎰 Betting Applications

Dynamic Bet Sizing

Learn optimal Kelly fraction based on bankroll state

Live Betting

When to bet in-play based on game state

Market Making

Dynamic line adjustment based on order flow

Python Code

# Q-Learning for bet sizing
import numpy as np

class BettingAgent:
    def __init__(self, states=10, actions=3, lr=0.1, gamma=0.95, eps=0.2):
        self.q_table = np.zeros((states, actions))
        self.lr = lr
        self.gamma = gamma
        self.eps = eps
    
    def choose_action(self, state):
        if np.random.random() < self.eps:
            return np.random.randint(len(self.q_table[state]))
        return np.argmax(self.q_table[state])
    
    def learn(self, state, action, reward, next_state):
        max_next_q = np.max(self.q_table[next_state])
        td_target = reward + self.gamma * max_next_q
        td_error = td_target - self.q_table[state, action]
        self.q_table[state, action] += self.lr * td_error

# Train
agent = BettingAgent()
for episode in range(100):
    state = 5  # Mid bankroll
    for step in range(20):
        action = agent.choose_action(state)
        # ... execute action, get reward, next_state
        agent.learn(state, action, reward, next_state)

✅ Key Takeaways

• RL learns from interaction, not labeled data
• Balance exploration (discover) vs exploitation (profit)
• Q-learning for discrete, policy gradient for continuous

• Requires many episodes to converge
• Great for sequential decision problems
• Simulator needed for safe training