This dataset was collected for and supports the analysis in The Microstructure of Wealth Transfer in Prediction Markets.
A framework for analyzing Kalshi prediction market data. Includes tools for data collection, storage, and running analysis scripts that generate figures and statistics.
The dataset was acquired from Kalshi's public REST API, and spans from 16:09 ET 2021-06-30 to 17:00 ET 2025-11-25. All market and trade data during this period is included.
Setup
Requires Python 3.9+. Install dependencies with uv:
1uv syncRunning Analyses
The data is stored as compressed chunks (data.zip.*). The analysis framework handles extraction and cleanup automatically.
Run all analyses
1make analysisThis will:
- Reassemble and extract the data archive
- Run all scripts in
research/analysis/in parallel - Clean up the extracted data when complete
Run a single analysis
1make analyze <script_name>For example:
1make analyze mispricing_by_price
2make analyze total_volume_by_price.py # .py extension is optionalManual commands
You can also run the CLI directly:
1uv run main.py setup # Extract data
2uv run main.py analysis # Run all analyses
3uv run main.py analysis mispricing_by_price # Run single analysis
4uv run main.py teardown # Clean up dataData Schemas
Data is stored as Parquet files. When extracted, the directory structure is:
1data/
2 markets/
3 markets_0_10000.parquet
4 markets_10000_20000.parquet
5 ...
6 trades/
7 <TICKER>_trades.parquet
8 ...Markets Schema
Each row represents a prediction market contract.
| Column | Type | Description |
|---|---|---|
ticker | string | Unique market identifier (e.g., PRES-2024-DJT) |
event_ticker | string | Parent event identifier, used for categorization |
market_type | string | Market type (typically binary) |
title | string | Human-readable market title |
yes_sub_title | string | Label for the "Yes" outcome |
no_sub_title | string | Label for the "No" outcome |
status | string | Market status: open, closed, finalized |
yes_bid | int (nullable) | Best bid price for Yes contracts (cents, 1-99) |
yes_ask | int (nullable) | Best ask price for Yes contracts (cents, 1-99) |
no_bid | int (nullable) | Best bid price for No contracts (cents, 1-99) |
no_ask | int (nullable) | Best ask price for No contracts (cents, 1-99) |
last_price | int (nullable) | Last traded price (cents, 1-99) |
volume | int | Total contracts traded |
volume_24h | int | Contracts traded in last 24 hours |
open_interest | int | Outstanding contracts |
result | string | Market outcome: yes, no, or empty if unresolved |
created_time | datetime | When the market was created |
open_time | datetime (nullable) | When trading opened |
close_time | datetime (nullable) | When trading closed |
_fetched_at | datetime | When this record was fetched |
Trades Schema
Each row represents a single trade execution.
| Column | Type | Description |
|---|---|---|
trade_id | string | Unique trade identifier |
ticker | string | Market ticker this trade belongs to |
count | int | Number of contracts traded |
yes_price | int | Yes contract price (cents, 1-99) |
no_price | int | No contract price (cents, 1-99), always 100 - yes_price |
taker_side | string | Which side the taker bought: yes or no |
created_time | datetime | When the trade occurred |
_fetched_at | datetime | When this record was fetched |
Note on prices: Prices are in cents. A yes_price of 65 means the contract costs 1.00 if the outcome is "Yes" (implied probability: 65%). The no_price is always 100 - yes_price.
Writing Analysis Scripts
Analysis scripts live in research/analysis/ and output to research/fig/.
Basic template
1#!/usr/bin/env python3
2"""Brief description of what this analysis does."""
3
4from pathlib import Path
5
6import duckdb
7import matplotlib.pyplot as plt
8
9
10def main():
11 # Standard path setup
12 base_dir = Path(__file__).parent.parent.parent
13 trades_dir = base_dir / "data" / "trades"
14 markets_dir = base_dir / "data" / "markets"
15 fig_dir = base_dir / "research" / "fig"
16 fig_dir.mkdir(parents=True, exist_ok=True)
17
18 # Connect to DuckDB (in-memory)
19 con = duckdb.connect()
20
21 # Query parquet files directly with glob patterns
22 df = con.execute(
23 f"""
24 SELECT
25 yes_price,
26 count,
27 taker_side
28 FROM '{trades_dir}/*.parquet'
29 WHERE yes_price BETWEEN 1 AND 99
30 LIMIT 1000
31 """
32 ).df()
33
34 # Save data output
35 df.to_csv(fig_dir / "my_analysis.csv", index=False)
36
37 # Create visualization
38 fig, ax = plt.subplots(figsize=(10, 6))
39 ax.bar(df["yes_price"], df["count"])
40 ax.set_xlabel("Price (cents)")
41 ax.set_ylabel("Count")
42 ax.set_title("My Analysis")
43
44 plt.tight_layout()
45 fig.savefig(fig_dir / "my_analysis.png", dpi=300, bbox_inches="tight")
46 fig.savefig(fig_dir / "my_analysis.pdf", bbox_inches="tight")
47 plt.close(fig)
48
49 print(f"Outputs saved to {fig_dir}")
50
51
52if __name__ == "__main__":
53 main()Common query patterns
Join trades with market outcomes:
1WITH resolved_markets AS (
2 SELECT ticker, result
3 FROM '{markets_dir}/*.parquet'
4 WHERE status = 'finalized'
5 AND result IN ('yes', 'no')
6)
7SELECT
8 t.yes_price,
9 t.count,
10 t.taker_side,
11 m.result,
12 CASE WHEN t.taker_side = m.result THEN 1 ELSE 0 END AS taker_won
13FROM '{trades_dir}/*.parquet' t
14INNER JOIN resolved_markets m ON t.ticker = m.tickerAnalyze both taker and maker positions:
1WITH all_positions AS (
2 -- Taker positions
3 SELECT
4 CASE WHEN taker_side = 'yes' THEN yes_price ELSE no_price END AS price,
5 count,
6 'taker' AS role
7 FROM '{trades_dir}/*.parquet'
8
9 UNION ALL
10
11 -- Maker positions (counterparty)
12 SELECT
13 CASE WHEN taker_side = 'yes' THEN no_price ELSE yes_price END AS price,
14 count,
15 'maker' AS role
16 FROM '{trades_dir}/*.parquet'
17)
18SELECT price, role, SUM(count) AS total_contracts
19FROM all_positions
20GROUP BY price, role
21ORDER BY priceExtract category from event_ticker:
1SELECT
2 CASE
3 WHEN event_ticker IS NULL OR event_ticker = '' THEN 'independent'
4 ELSE regexp_extract(event_ticker, '^([A-Z0-9]+)', 1)
5 END AS category,
6 COUNT(*) AS market_count
7FROM '{markets_dir}/*.parquet'
8GROUP BY categoryUsing the categories utility
For grouping markets into high-level categories (Sports, Politics, Crypto, etc.):
1from research.analysis.util.categories import get_group, get_hierarchy, GROUP_COLORS
2
3# Get high-level group
4group = get_group("NFLGAME") # Returns "Sports"
5
6# Get full hierarchy (group, category, subcategory)
7hierarchy = get_hierarchy("NFLGAME") # Returns ("Sports", "NFL", "Games")
8
9# Use predefined colors for consistent visualizations
10color = GROUP_COLORS["Sports"] # Returns "#1f77b4"Output conventions
- Save CSV/JSON for raw data:
fig_dir / "analysis_name.csv" - Save PNG at 300 DPI for presentations:
fig_dir / "analysis_name.png" - Save PDF for papers:
fig_dir / "analysis_name.pdf" - Print a completion message:
print(f"Outputs saved to {fig_dir}")
Dependencies available
Scripts have access to these libraries (see pyproject.toml):
duckdb- SQL queries on Parquet filespandas- DataFramesmatplotlib- Plottingscipy- Statistical functionsbrokenaxes- Plots with broken axessquarify- Treemap visualizations