Building a Horse Racing Prediction Model with Python and Renavon Data

Horse racing prediction is one of those problems that sounds impossible until you start looking at the data. The market favourite wins 31% of the time, but with 360,000+ race entries in Renavon's hkjc_race_results dataset, there's enough signal to build a model that can identify edges.

This isn't a guide to building a profitable betting system. It's a practical walkthrough of the process: pulling data, engineering features, training a model, and evaluating its predictions. Think of it as a starting point for your own experiments.

Getting the data¶

Connect to your Renavon data via the REST API and pull what we need.

The /api/v1/query endpoint caps responses at 10,000 rows per request. That's plenty for an exploratory query but well below the ~360,000 entries we need to train on, so we fetch in keyset-paginated batches anchored on (race_date, race_number, finishing_position).

import requests
import pandas as pd

API_KEY = "YOUR_API_KEY"
ENDPOINT = "https://renavon.com/api/v1/query"

COLUMNS = """
    race_date, race_number, horse_name, horse_number,
    finishing_position, is_winner, odds, draw, weight,
    declared_horse_weight, horse_age, jockey, trainer,
    venue, distance, surface_type, race_class, field_size,
    entry_rating, career_win_rate, career_total_starts
"""

def query(sql: str) -> dict:
    """POST a SQL query and raise loudly on truncation.

    `truncated=True` in the response means the API trimmed rows to the
    per-request cap. For a training-data pull, silently accepting a
    truncated payload would feed the model a biased slice — so we
    paginate explicitly instead.
    """
    r = requests.post(ENDPOINT, headers={"X-API-KEY": API_KEY},
                      json={"query": sql}, timeout=30)
    r.raise_for_status()
    payload = r.json()
    if not payload.get("data") and payload.get("truncated"):
        raise RuntimeError("Empty page but truncated=True — adjust filters.")
    return payload

def fetch_all_results() -> pd.DataFrame:
    """Keyset-paginate all completed runs since 2022-09-01."""
    chunks = []
    cursor = "'2022-09-01', 0, 0"  # (race_date, race_number, finishing_position)
    while True:
        payload = query(f"""
            SELECT {COLUMNS}
            FROM hkjc_race_results
            WHERE (race_date, race_number, finishing_position) > ({cursor})
              AND finishing_position > 0
              AND odds > 0
            ORDER BY race_date, race_number, finishing_position
            LIMIT 10000
        """)
        cols = [c["name"] for c in payload["columns"]]
        rows = payload["data"]
        if not rows:
            break
        chunk = pd.DataFrame(rows, columns=cols)
        chunks.append(chunk)
        last = rows[-1]
        # Quote the date so ClickHouse parses it as a Date literal.
        cursor = (
            f"'{last[cols.index('race_date')]}', "
            f"{last[cols.index('race_number')]}, "
            f"{last[cols.index('finishing_position')]}"
        )
        if not payload.get("truncated"):
            break  # Final page — fewer rows than the cap means we're done.
    return pd.concat(chunks, ignore_index=True)

df = fetch_all_results()
print(f"Loaded {len(df):,} entries across {df['race_date'].nunique():,} race days")

Watch the truncated field on every response. When truncated is True, the result set was capped at 10,000 rows; False means you got everything that matched. For exploration in a notebook a single call is fine — for any data pipeline that downstream code depends on, paginate until truncated is False, the way fetch_all_results does above.

Feature engineering¶

Raw data needs to be transformed into features a model can learn from. Here are the features that tend to matter in Hong Kong racing:

import numpy as np

# Binary target
df['won'] = df['is_winner'].astype(int)

# Odds-implied probability (market's view)
df['implied_prob'] = 1 / df['odds']

# Draw advantage: lower draws are generally better in HK
# Normalise by field size
df['draw_pct'] = df['draw'] / df['field_size']

# Weight carried relative to field
df['weight_diff'] = df.groupby(['race_date', 'race_number'])['weight'].transform(
    lambda x: x - x.mean()
)

# Horse weight (condition indicator)
df['horse_weight_diff'] = df.groupby(['race_date', 'race_number'])['declared_horse_weight'].transform(
    lambda x: x - x.mean()
)

# Experience
df['log_starts'] = np.log1p(df['career_total_starts'].fillna(0))

# Rating (handicap mark)
df['rating_diff'] = df.groupby(['race_date', 'race_number'])['entry_rating'].transform(
    lambda x: x - x.mean()
)

# Encode categoricals
df['is_happy_valley'] = (df['venue'] == 'Happy Valley').astype(int)
df['is_turf'] = (df['surface_type'] == 'Turf').astype(int)

# Distance buckets
df['is_sprint'] = (df['distance'] <= 1200).astype(int)
df['is_mile'] = ((df['distance'] > 1200) & (df['distance'] <= 1600)).astype(int)
df['is_middle'] = (df['distance'] > 1600).astype(int)

Train-test split¶

In racing prediction, you must split by time, not randomly. You can't train on future data and predict the past.

from sklearn.model_selection import train_test_split

# Use the most recent season as test data
train = df[df['race_date'] < '2025-09-01']
test = df[df['race_date'] >= '2025-09-01']

features = [
    'implied_prob', 'draw_pct', 'weight_diff', 'horse_weight_diff',
    'log_starts', 'career_win_rate', 'rating_diff',
    'is_happy_valley', 'is_turf', 'is_sprint', 'is_mile', 'is_middle',
    'horse_age', 'field_size'
]

X_train = train[features].fillna(0)
y_train = train['won']
X_test = test[features].fillna(0)
y_test = test['won']

print(f"Train: {len(X_train):,} entries, {y_train.sum():,} winners")
print(f"Test:  {len(X_test):,} entries, {y_test.sum():,} winners")

Training a model¶

Start simple. Logistic regression is interpretable and gives you a baseline. If it can't beat the market, fancier models probably can't either.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, roc_auc_score

model = LogisticRegression(max_iter=1000, C=0.1)
model.fit(X_train, y_train)

# Predict probabilities
train_probs = model.predict_proba(X_train)[:, 1]
test_probs = model.predict_proba(X_test)[:, 1]

print(f"Train AUC: {roc_auc_score(y_train, train_probs):.4f}")
print(f"Test AUC:  {roc_auc_score(y_test, test_probs):.4f}")
print(f"Train log loss: {log_loss(y_train, train_probs):.4f}")
print(f"Test log loss:  {log_loss(y_test, test_probs):.4f}")

What the features tell us¶

After training, check which features the model weights most:

feature_importance = pd.DataFrame({
    'feature': features,
    'coefficient': model.coef_[0]
}).sort_values('coefficient', key=abs, ascending=False)

print(feature_importance.to_string(index=False))

You'll likely find that implied_prob (the odds-implied probability) is by far the strongest predictor. This makes sense — the betting market already aggregates the opinions of thousands of punters. The question is whether the other features can add anything the market misses.

Evaluating against the market¶

The real test isn't AUC — it's whether the model identifies value that the market doesn't. Compare your model's probability estimates against the implied probabilities from the odds:

# For each race, rank horses by model probability
test_with_preds = test.copy()
test_with_preds['model_prob'] = test_probs

# Find cases where model disagrees with market
test_with_preds['value'] = test_with_preds['model_prob'] - test_with_preds['implied_prob']

# Top picks: model thinks horse is undervalued
top_picks = test_with_preds[test_with_preds['value'] > 0.05]
print(f"Model's value picks: {len(top_picks)} entries")
print(f"Win rate of picks: {top_picks['won'].mean():.1%}")
print(f"Avg odds of picks: {top_picks['odds'].mean():.1f}")

# Simulate flat betting on value picks
if len(top_picks) > 0:
    pnl = (top_picks['won'] * (top_picks['odds'] - 1) - (1 - top_picks['won'])).sum()
    roi = pnl / len(top_picks) * 100
    print(f"Flat bet P&L: {pnl:.1f} units")
    print(f"ROI: {roi:.1f}%")

Where to go from here¶

This baseline model is deliberately simple. Directions to explore:

More features: Jockey/trainer form over the last N races, track condition preferences, barrier performance by distance, days since last run
Better models: Gradient boosting (XGBoost, LightGBM) can capture non-linear interactions between features
Race-level modelling: Instead of predicting each horse independently, model the race as a competition (conditional logit, softmax)
Odds data: Renavon's hkjc_odds_combinations dataset has minute-by-minute odds snapshots. Early odds vs final odds contain information about market movement and late money.
Sectional times: The hkjc_race_results dataset includes sectional times — actual pace data that reveals how a horse ran, not just where it finished.

The data is there. The question is what patterns you can find in it.