Reinforcement Learning
Learn optimal strategies through trial and error. The agent explores, receives rewards, and gradually discovers the best policy.
๐ The RL Framework
State
Current situation (bankroll, game state)
Bankroll = $1000
Action
Decision to make (bet size, hedge)
Bet $50
Reward
Immediate feedback (+win, -loss)
+$50 or -$50
Policy
Strategy mapping states to actions
Kelly fraction
Hyperparameters
๐ Training Results
Learning Curve
Higher learning rate โ faster but noisier. Higher exploration โ more diverse experience but slower convergence.
Exploration vs Exploitation
๐ Exploration
Try new actions to discover better strategies
20% random actions
๐ฐ Exploitation
Use best known action to maximize reward
80% greedy actions
๐ค RL Algorithms
Q-Learning
Value-based
Discrete actions
DQN
Deep Q-Network
Large state spaces
PPO
Policy gradient
Continuous actions
A2C
Actor-Critic
Stable learning
๐ฐ Betting Applications
Dynamic Bet Sizing
Learn optimal Kelly fraction based on bankroll state
Live Betting
When to bet in-play based on game state
Market Making
Dynamic line adjustment based on order flow
Python Code
# Q-Learning for bet sizing
import numpy as np
class BettingAgent:
def __init__(self, states=10, actions=3, lr=0.1, gamma=0.95, eps=0.2):
self.q_table = np.zeros((states, actions))
self.lr = lr
self.gamma = gamma
self.eps = eps
def choose_action(self, state):
if np.random.random() < self.eps:
return np.random.randint(len(self.q_table[state]))
return np.argmax(self.q_table[state])
def learn(self, state, action, reward, next_state):
max_next_q = np.max(self.q_table[next_state])
td_target = reward + self.gamma * max_next_q
td_error = td_target - self.q_table[state, action]
self.q_table[state, action] += self.lr * td_error
# Train
agent = BettingAgent()
for episode in range(100):
state = 5 # Mid bankroll
for step in range(20):
action = agent.choose_action(state)
# ... execute action, get reward, next_state
agent.learn(state, action, reward, next_state)โ Key Takeaways
- โข RL learns from interaction, not labeled data
- โข Balance exploration (discover) vs exploitation (profit)
- โข Q-learning for discrete, policy gradient for continuous
- โข Requires many episodes to converge
- โข Great for sequential decision problems
- โข Simulator needed for safe training