Building a Horse Racing Prediction Model with Python and Renavon Data

· 5 min read

Horse racing prediction is one of those problems that sounds impossible until you start looking at the data. The market favourite wins 31% of the time, but with 360,000+ race entries in Renavon's hkjc_race_results dataset, there's enough signal to build a model that can identify edges.

This isn't a guide to building a profitable betting system. It's a practical walkthrough of the process: pulling data, engineering features, training a model, and evaluating its predictions. Think of it as a starting point for your own experiments.

Getting the data

Connect to your Renavon data via DuckDB and pull what we need:

import duckdb
import pandas as pd

conn = duckdb.connect("md:?motherduck_token=YOUR_TOKEN")

# Pull recent seasons with key features
df = conn.sql("""
    SELECT
        race_date,
        race_number,
        horse_name,
        horse_number,
        finishing_position,
        is_winner,
        odds,
        draw,
        weight,
        declared_horse_weight,
        horse_age,
        jockey,
        trainer,
        venue,
        distance,
        surface_type,
        race_class,
        field_size,
        entry_rating,
        career_win_rate,
        career_total_starts
    FROM renavon_hkjc_race_results.main.hkjc_race_results
    WHERE race_date >= '2022-09-01'
      AND finishing_position > 0
      AND odds > 0
    ORDER BY race_date, race_number, finishing_position
""").fetchdf()

print(f"Loaded {len(df):,} entries across {df['race_date'].nunique():,} race days")

Feature engineering

Raw data needs to be transformed into features a model can learn from. Here are the features that tend to matter in Hong Kong racing:

import numpy as np

# Binary target
df['won'] = df['is_winner'].astype(int)

# Odds-implied probability (market's view)
df['implied_prob'] = 1 / df['odds']

# Draw advantage: lower draws are generally better in HK
# Normalise by field size
df['draw_pct'] = df['draw'] / df['field_size']

# Weight carried relative to field
df['weight_diff'] = df.groupby(['race_date', 'race_number'])['weight'].transform(
    lambda x: x - x.mean()
)

# Horse weight (condition indicator)
df['horse_weight_diff'] = df.groupby(['race_date', 'race_number'])['declared_horse_weight'].transform(
    lambda x: x - x.mean()
)

# Experience
df['log_starts'] = np.log1p(df['career_total_starts'].fillna(0))

# Rating (handicap mark)
df['rating_diff'] = df.groupby(['race_date', 'race_number'])['entry_rating'].transform(
    lambda x: x - x.mean()
)

# Encode categoricals
df['is_happy_valley'] = (df['venue'] == 'Happy Valley').astype(int)
df['is_turf'] = (df['surface_type'] == 'Turf').astype(int)

# Distance buckets
df['is_sprint'] = (df['distance'] <= 1200).astype(int)
df['is_mile'] = ((df['distance'] > 1200) & (df['distance'] <= 1600)).astype(int)
df['is_middle'] = (df['distance'] > 1600).astype(int)

Train-test split

In racing prediction, you must split by time, not randomly. You can't train on future data and predict the past.

from sklearn.model_selection import train_test_split

# Use the most recent season as test data
train = df[df['race_date'] < '2025-09-01']
test = df[df['race_date'] >= '2025-09-01']

features = [
    'implied_prob', 'draw_pct', 'weight_diff', 'horse_weight_diff',
    'log_starts', 'career_win_rate', 'rating_diff',
    'is_happy_valley', 'is_turf', 'is_sprint', 'is_mile', 'is_middle',
    'horse_age', 'field_size'
]

X_train = train[features].fillna(0)
y_train = train['won']
X_test = test[features].fillna(0)
y_test = test['won']

print(f"Train: {len(X_train):,} entries, {y_train.sum():,} winners")
print(f"Test:  {len(X_test):,} entries, {y_test.sum():,} winners")

Training a model

Start simple. Logistic regression is interpretable and gives you a baseline. If it can't beat the market, fancier models probably can't either.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, roc_auc_score

model = LogisticRegression(max_iter=1000, C=0.1)
model.fit(X_train, y_train)

# Predict probabilities
train_probs = model.predict_proba(X_train)[:, 1]
test_probs = model.predict_proba(X_test)[:, 1]

print(f"Train AUC: {roc_auc_score(y_train, train_probs):.4f}")
print(f"Test AUC:  {roc_auc_score(y_test, test_probs):.4f}")
print(f"Train log loss: {log_loss(y_train, train_probs):.4f}")
print(f"Test log loss:  {log_loss(y_test, test_probs):.4f}")

What the features tell us

After training, check which features the model weights most:

feature_importance = pd.DataFrame({
    'feature': features,
    'coefficient': model.coef_[0]
}).sort_values('coefficient', key=abs, ascending=False)

print(feature_importance.to_string(index=False))

You'll likely find that implied_prob (the odds-implied probability) is by far the strongest predictor. This makes sense — the betting market already aggregates the opinions of thousands of punters. The question is whether the other features can add anything the market misses.

Evaluating against the market

The real test isn't AUC — it's whether the model identifies value that the market doesn't. Compare your model's probability estimates against the implied probabilities from the odds:

# For each race, rank horses by model probability
test_with_preds = test.copy()
test_with_preds['model_prob'] = test_probs

# Find cases where model disagrees with market
test_with_preds['value'] = test_with_preds['model_prob'] - test_with_preds['implied_prob']

# Top picks: model thinks horse is undervalued
top_picks = test_with_preds[test_with_preds['value'] > 0.05]
print(f"Model's value picks: {len(top_picks)} entries")
print(f"Win rate of picks: {top_picks['won'].mean():.1%}")
print(f"Avg odds of picks: {top_picks['odds'].mean():.1f}")

# Simulate flat betting on value picks
if len(top_picks) > 0:
    pnl = (top_picks['won'] * (top_picks['odds'] - 1) - (1 - top_picks['won'])).sum()
    roi = pnl / len(top_picks) * 100
    print(f"Flat bet P&L: {pnl:.1f} units")
    print(f"ROI: {roi:.1f}%")

Where to go from here

This baseline model is deliberately simple. Directions to explore:

  1. More features: Jockey/trainer form over the last N races, track condition preferences, barrier performance by distance, days since last run
  2. Better models: Gradient boosting (XGBoost, LightGBM) can capture non-linear interactions between features
  3. Race-level modelling: Instead of predicting each horse independently, model the race as a competition (conditional logit, softmax)
  4. Odds data: Renavon's hkjc_odds_combinations dataset has minute-by-minute odds snapshots. Early odds vs final odds contain information about market movement and late money.
  5. Sectional times: The hkjc_race_results dataset includes sectional times — actual pace data that reveals how a horse ran, not just where it finished.

The data is there. The question is what patterns you can find in it.

Get data insights in your inbox

New datasets, analysis, and Hong Kong market updates. No spam.

Explore HKJC racing data

Race results, entries, odds, jockey and trainer statistics — updated after every meeting.

View Racing Data
← Back to blog