Infrastructure
Data Pipelines
Move data from sources to destinations reliably. The backbone of any data-driven betting operation.
๐ ETL/ELT Pipeline Stages
๐ฅ
Ingest
Collect raw data from sources
KafkaAirflowFivetran
โ
๐
Transform
Clean, validate, feature eng
dbtSparkPandas
โ
๐พ
Store
Data warehouse/lake
SnowflakeBigQueryS3
โ
๐
Serve
APIs, dashboards, models
FastAPITableauMLflow
๐ฅ Betting Data Sources
Odds Feeds
Frequency Real-time
Volume High
Pinnacle, Betfair API
Stats Providers
Frequency Daily
Volume Medium
Sportradar, Stats LLC
Betting Activity
Frequency Real-time
Volume Very High
Internal DB
External Data
Frequency Weekly
Volume Low
Weather, injuries
๐ฅ๏ธ Pipeline Status
| Pipeline | Status | Last Run | Duration |
|---|---|---|---|
| odds_ingest | running | 2 min ago | 45s |
| player_stats_daily | success | 4h ago | 12m |
| feature_engineering | success | 1h ago | 8m |
| model_training | pending | scheduled | - |
| liability_snapshot | failed | 30 min ago | 2m |
โ Best Practices
Idempotency
Same input โ same output, safe to retry
Backfill Support
Ability to reprocess historical data
Schema Evolution
Handle changing data formats gracefully
Data Quality
Validation, anomaly detection, alerts
Lineage Tracking
Know where data came from
Monitoring
Latency, freshness, error rates
Python / Airflow Example
# Airflow DAG for odds ingestion
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
default_args = {
'retries': 3,
'retry_delay': timedelta(minutes=5),
}
def ingest_odds():
"""Fetch odds from API and store in raw layer"""
import requests
response = requests.get("https://api.oddsapi.com/v4/sports/...")
# Store to S3/GCS
def transform_odds():
"""Clean and standardize odds data"""
# dbt run --select odds_staging
def validate_data():
"""Run data quality checks"""
# Check for nulls, outliers, staleness
with DAG('odds_pipeline',
schedule_interval='*/5 * * * *', # Every 5 min
default_args=default_args) as dag:
ingest = PythonOperator(task_id='ingest', python_callable=ingest_odds)
transform = PythonOperator(task_id='transform', python_callable=transform_odds)
validate = PythonOperator(task_id='validate', python_callable=validate_data)
ingest >> transform >> validateโ Key Takeaways
- โข Pipelines: Extract โ Transform โ Load
- โข Real-time for odds, batch for stats
- โข Idempotency enables safe retries
- โข Use Airflow/Prefect for orchestration
- โข dbt for transformations in warehouse
- โข Monitor freshness and data quality