I Built a Cricket Statistics Brain for Claude (And You Can Too)

Mihir Wagle 5 min read
#cricket#duckdb#claude #mcp

If you know me, I'm not religious but cricket is effectively my religion. So why did I wait so long for a cricket post? Because I thought I'd build something fun! Thanks to Tom Peplow and Mim for inspiring me to do this.

I've always wanted to ask Claude cricket trivia the way I'd ask a friend who's memorized Wisden cover to cover. "How does Kohli bat against Hazlewood in ODIs?" "Who has Bumrah dismissed the most in T20s?" Not vague, hallucinated answers - real numbers from real matches.

So I built one.

cricket-mcp is an MCP server that ingests every ball bowled in 21,000+ cricket matches (courtesy of Cricsheet), loads them into a DuckDB database, and exposes 23 query tools. Ask a question in English, Claude picks the right tool, and you get actual stats back.

The Stack

  • Data: Cricsheet - ball-by-ball JSON for every international and major domestic match
  • Database: DuckDB - columnar, vectorized, handles GROUP BY batter, bowler over 11M rows without breaking a sweat. Ingest all the data in less than 2 minutes on a Mac Mini.
  • Protocol: MCP - Anthropic's Model Context Protocol
  • Language: TypeScript, ~3,500 lines. Most of it SQL.

How It Works

Cricsheet's all_json.zip (~94 MB) contains every match they've processed - Tests, ODIs, T20Is, IPL, BBL, The Hundred, PSL, all of it. The ingest command downloads, extracts 21,270 JSON files, and flattens them into four DuckDB tables: matches, innings, deliveries (10.9M rows), and players (14,406). Three minutes, 600 MB on disk, sub-second queries.

After the initial ingest, npm run update keeps things current. Cricsheet publishes recently_played_N_json.zip files (N = 2, 7, or 30 days). The update command downloads the delta ZIP, checks each match_id against what's already in the DB, and inserts only new ones. Seconds, not minutes. Full rebuild via ingest --force for when Cricsheet corrects historical data.

The MCP server exposes 23 tools over stdio. Each takes structured input, builds a parameterized SQL query, runs it, returns JSON. Claude sees tool descriptions and schemas, picks the right one for the question, calls it with the right parameters.

I got the cricket math right, mostly, but please correct me if you find glitches.

Kohli vs Hazlewood in ODIs

Ask "How does Kohli bat against Hazlewood in ODIs?" and get_batter_vs_bowler returns: 106 balls faced, 67 runs, 5 dismissals. Strike rate 63.21 - well below Kohli's career ODI SR of ~93. Average 13.40. Mostly caught dismissals.

HazleGod's metronomic length and away shape from the right-hander makes Kohli play and nick. One of the few bowlers who consistently wins that battle, and the numbers make it concrete.

Kohli vs Bumrah in T20s

Different picture. RCB vs MI, year after year. Strike rate suppressed compared to Kohli's T20 norm, high dot ball percentage, but relatively few dismissals. Bumrah restricts rather than removes. Classic death-bowling supremacy vs elite batting defense - the data makes the distinction visible.

A Few Interesting Tools

There are 23 tools in total - the full list is on GitHub. A few worth calling out:

get_what_if - The fun one. "What would Kohli average without Hazlewood?" runs two parallel aggregations (full career vs career-minus-exclusion) and returns the delta. Hazlewood's 5 dismissals for 67 runs come out, Kohli's average ticks up, and you see exactly how much one bowler drags the number. Works with any combination of excluded opponents, bowlers, venues, or tournaments.

get_situational_stats - Format-aware. "Chasing" means 4th innings in Tests, innings 2 in LOIs. "Pressure" means 3+ wickets down in the first 10 overs. You can also slice by batting position.

get_phase_stats - Powerplay/middle/death splits for batting and bowling. The kind of thing that settles arguments about who's actually good at the death vs who just has a good reputation. The powerplay is limited to T20s since ODI powerplay rules have changed so much. Open to feedback on fixing this.

get_emerging_players - Window function comparing recent season to career baseline, surfaces players whose form is diverging from their career numbers. Rising stocks.

Every tool shares the same filter set: format, gender, team, opposition, venue, city, season, tournament, date range. One MatchFilterSchema → SQL WHERE builder, composed everywhere.

Lessons

DuckDB over SQLite, easily. Columnar storage and vectorized execution make aggregation queries over 11M rows come back in milliseconds. SQLite would choke on these workloads.

The Appender API is fast but strict. Demands exact type matching - schema says DATE, you pass a string, it throws. I switched dates to VARCHAR. String comparison works fine for YYYY-MM-DD and saved a lot of headaches.

Tool descriptions matter more than schemas. With 23 tools, Claude needs to pick the right one. Each description includes example natural language queries ("Use for 'Best death bowlers in IPL'"). Claude pattern-matches against descriptions, not schemas. Time spent on descriptions pays off immediately.

Write evals I have 55 evals. Cricket is so richly documented, its trivial to expand this eval set.

Try It

Open source: github.com/mavaali/cricket-mcp

git clone https://github.com/mavaali/cricket-mcp.git
cd cricket-mcp && npm install
npm run ingest   # downloads Cricsheet data, builds DB (~3 min)
npm run update   # pull recent matches (daily driver)

Add to Claude Desktop config:

{
  "mcpServers": {
    "cricket": {
      "command": "npx",
      "args": ["tsx", "/path/to/cricket-mcp/src/index.ts", "serve"]
    }
  }
}

So this MCP is limited by my imagination. But cricket fanatics are welcome to expand it. PRs are welcome!!!


All data sourced from Cricsheet. Built with Claude, DuckDB, and an unhealthy obsession with cricket.

← Back to blog

Enjoyed this post? Get new ones in your inbox.