0/70 completed
Optimization Interactive

Reinforcement Learning

Learn optimal strategies through trial and error. The agent explores, receives rewards, and gradually discovers the best policy.

๐Ÿ“Š The RL Framework

State

Current situation (bankroll, game state)

Bankroll = $1000

Action

Decision to make (bet size, hedge)

Bet $50

Reward

Immediate feedback (+win, -loss)

+$50 or -$50

Policy

Strategy mapping states to actions

Kelly fraction

Hyperparameters

Learning Rate (ฮฑ) 0.1
0.01 0.5
Discount Factor (ฮณ) 0.95
0.8 0.99
Exploration (ฮต) 0.2
0.05 0.5
Episodes 100
50 200

๐Ÿ“Š Training Results

Final Avg Reward 2.3
Episodes Run 100
โœ“ Agent learned profitable strategy

Learning Curve

Higher learning rate โ†’ faster but noisier. Higher exploration โ†’ more diverse experience but slower convergence.

Exploration vs Exploitation

๐Ÿ” Exploration

Try new actions to discover better strategies

20% random actions

๐Ÿ’ฐ Exploitation

Use best known action to maximize reward

80% greedy actions

๐Ÿค– RL Algorithms

Q-Learning

Value-based

Discrete actions

DQN

Deep Q-Network

Large state spaces

PPO

Policy gradient

Continuous actions

A2C

Actor-Critic

Stable learning

๐ŸŽฐ Betting Applications

Dynamic Bet Sizing

Learn optimal Kelly fraction based on bankroll state

Live Betting

When to bet in-play based on game state

Market Making

Dynamic line adjustment based on order flow

Python Code

# Q-Learning for bet sizing
import numpy as np

class BettingAgent:
    def __init__(self, states=10, actions=3, lr=0.1, gamma=0.95, eps=0.2):
        self.q_table = np.zeros((states, actions))
        self.lr = lr
        self.gamma = gamma
        self.eps = eps
    
    def choose_action(self, state):
        if np.random.random() < self.eps:
            return np.random.randint(len(self.q_table[state]))
        return np.argmax(self.q_table[state])
    
    def learn(self, state, action, reward, next_state):
        max_next_q = np.max(self.q_table[next_state])
        td_target = reward + self.gamma * max_next_q
        td_error = td_target - self.q_table[state, action]
        self.q_table[state, action] += self.lr * td_error

# Train
agent = BettingAgent()
for episode in range(100):
    state = 5  # Mid bankroll
    for step in range(20):
        action = agent.choose_action(state)
        # ... execute action, get reward, next_state
        agent.learn(state, action, reward, next_state)

โœ… Key Takeaways

  • โ€ข RL learns from interaction, not labeled data
  • โ€ข Balance exploration (discover) vs exploitation (profit)
  • โ€ข Q-learning for discrete, policy gradient for continuous
  • โ€ข Requires many episodes to converge
  • โ€ข Great for sequential decision problems
  • โ€ข Simulator needed for safe training

Pricing Models & Frameworks Tutorial

Built for mastery ยท Interactive learning