Back to rust
January 2026
49 min read

Rust Project Specifications

Technical specifications and requirements for Rust projects
  • rust
  • specifications

Rust Project Specifications — Detailed Reference

Companion to: The Rust Learning Program (1-Year Guide)

Ashwin Hebbar | March 2026


This document contains the full specifications, expectations, concepts covered, and success criteria for every project in the Rust Learning Program. Use this as your reference while building.


Table of Contents

  1. Phase 1 — csvq: Command-Line CSV Query Tool
  2. Phase 2A — schemaguard: Streaming Data Contract Validator
  3. Phase 2B — logr: Terminal-Native Activity Tracker
  4. Phase 3A — HTTP/1.1 Server From Scratch
  5. Phase 3B — ticknorm: Real-Time Market Data Feed Normalizer ⚠️ COI risk — see notice
  6. Phase 3B (Replacement) — airpost: Real-Time Air Quality Aggregator
  7. Phase 4A — riskbook: Real-Time Portfolio Risk Engine
  8. Phase 4B — xlsxfmt: High-Fidelity Bidirectional Excel Library
  9. Phase 5 — docforge: Document Conversion Library (Capstone)Option A
  10. Phase 5 — formulaengine: Spreadsheet Formula Computation Engine (Capstone)Option B
  11. Phase 5 — ruleforge: Embeddable Business Rules Engine (Capstone)Option C

Phase 1 — csvq

Command-Line CSV Query Tool

Timeline: Weeks 1–6 (primarily weekend work in weeks 4–6)


The Problem It Solves

Data engineers and analysts regularly need to do quick, ad-hoc analysis on CSV files: filter rows, select columns, compute aggregates. The current options are all unsatisfying:

  • Python/Pandas: Requires a full Python environment, slow startup (2–5 seconds even for small files), heavy memory usage
  • awk/sed: Fragile, no header awareness, painful syntax for anything beyond trivial operations
  • DuckDB CLI: Excellent but is a full database engine — overkill for “show me the top 10 rows where age > 25”
  • xsv: The original Rust CSV tool by BurntSushi, but it’s been unmaintained since 2018
  • qsv: A maintained fork of xsv, but complex and not beginner-friendly in API

csvq fills the space of “fast, single-binary, zero-dependency CLI for CSV analysis” — think ripgrep for tabular data.


Functional Requirements

Core CLI Interface:

csvq <file> [options]

Options:
  --select <columns>        Comma-separated column names to display
  --filter <expression>     Filter rows (e.g., "age > 25", "name == 'Alice'")
  --sort <column> [asc|desc]  Sort by column
  --limit <n>               Show first n results
  --group-by <column>       Group rows by column
  --agg <expressions>       Aggregations: count(*), sum(col), avg(col), min(col), max(col)
  --stats                   Show per-column statistics (min, max, mean, std, null count)
  --head <n>                Show first n rows (default 10)
  --tail <n>                Show last n rows
  --count                   Count total rows
  --output <format>         Output format: table (default), csv, json

Example Usage:

# Basic filtering and selection
csvq employees.csv --filter "department == 'Engineering'" --select "name,salary" --sort "salary desc" --limit 10

# Aggregation
csvq sales.csv --group-by "region" --agg "sum(revenue),avg(deal_size),count(*)"

# Quick statistics
csvq dataset.csv --stats

# Pipe-friendly
cat large_file.csv | csvq --filter "status == 'active'" --count

Technical Requirements

Must implement yourself (before reaching for crates):

  • CSV parsing: handle quoted fields, escaped commas, newlines within quotes, different delimiters
  • Type inference: detect integer, float, string, date columns from data
  • Expression parsing: build a simple parser for filter expressions (comparison operators, AND/OR)
  • Sorting: implement at least one sorting algorithm yourself before using .sort()

Allowed crates:

  • clap — CLI argument parsing
  • csv — you may use this AFTER you’ve written your own parser and understand the edge cases
  • serde — for JSON output format

Not allowed:

  • polars, datafusion, or any dataframe library (defeats the purpose)

Expected Concepts Covered

ConceptHow It Appears
Ownership & borrowingPassing CSV rows through filter → sort → select pipeline without unnecessary clones
String vs &strColumn names as owned strings, cell values as borrowed references during processing
EnumsValue::Int(i64), Value::Float(f64), Value::Text(String) for type-inferred cells
Pattern matchingMatching on Value variants for comparison operations
Result<T, E>CSV parsing errors, file I/O errors, invalid filter expressions
StructsRow, Column, Filter, AggregateResult
Iterators.filter(), .map(), .fold() chains for the query pipeline
Vec<T> and HashMapRow storage, group-by accumulation
CLI designclap derive macros, subcommands, argument validation

Definition of Done

  • Can parse a 1M-row CSV file and filter it in under 2 seconds
  • Handles malformed CSV gracefully (quoted fields, embedded commas, missing values)
  • --stats produces correct min/max/mean/std for numeric columns
  • --group-by with --agg produces correct grouped aggregations
  • Error messages are clear and helpful (not just “Error: something went wrong”)
  • Has at least 10 unit tests covering edge cases
  • README.md with usage examples
  • Published as a binary you can install with cargo install

Stretch Goals

  • Support reading from stdin (piped input)
  • Support TSV and custom delimiters (--delimiter '\t')
  • Column type override (--types "name:string,age:int")
  • Basic --join support (join two CSV files on a key column)


Phase 2A — schemaguard

Streaming Data Contract Validator

Timeline: Weeks 11–13 (primary project work on weekends)


The Problem It Solves

Data pipelines break silently. An upstream schema changes — a field is renamed, a type changes from integer to string, a required field becomes nullable — and the downstream pipeline produces garbage without crashing. This is the #1 source of data quality issues in production systems.

Current solutions and their gaps:

  • Great Expectations (Python): Powerful but extremely slow at scale. Validating 1M rows takes minutes. Configuration is verbose.
  • Pandera (Python): Lighter than Great Expectations but still Python-speed. Tied to Pandas DataFrames.
  • Kafka Schema Registry: Enterprise-grade but requires the full Confluent platform. Overkill for simple validation.
  • JSON Schema validators: Exist in every language but validate individual documents, not streaming data at high throughput.
  • dbt tests: SQL-based, post-load only. Cannot validate data in-flight.

The gap: There is no fast, standalone, language-agnostic CLI tool that validates streaming data (NDJSON, CSV) against a schema contract at high throughput and produces actionable error reports. schemaguard fills this gap.


Functional Requirements

Core CLI Interface:

schemaguard validate --schema contract.json --input data.ndjson [--report errors.csv]
schemaguard validate --schema contract.json --input data.csv --format csv
schemaguard diff schema_v1.json schema_v2.json
schemaguard generate --input sample.ndjson --output inferred_schema.json

Schema Contract Format (JSON):

{
  "name": "user_events",
  "version": "1.2.0",
  "fields": [
    {
      "name": "user_id",
      "type": "string",
      "required": true,
      "pattern": "^USR-[0-9]{6}$"
    },
    {
      "name": "event_type",
      "type": "string",
      "required": true,
      "allowed_values": ["click", "view", "purchase", "signup"]
    },
    {
      "name": "amount",
      "type": "float",
      "required": false,
      "min": 0.0,
      "max": 100000.0
    },
    {
      "name": "timestamp",
      "type": "datetime",
      "required": true,
      "format": "ISO8601"
    }
  ],
  "constraints": [
    {
      "type": "conditional_required",
      "condition": "event_type == 'purchase'",
      "required_fields": ["amount"]
    }
  ]
}

Validation Behavior:

  • Read input line-by-line (streaming — never load entire file into memory)
  • For each record: check all field types, required fields, value ranges, regex patterns, conditional constraints
  • Emit violations to stderr or a structured report file
  • Exit code 0 = all valid, exit code 1 = violations found (CI/CD friendly)
  • Print summary at end: Validated 1,000,000 records in 2.3s. 47 violations found.

Diff Mode:

  • Compare two schema versions
  • Report: added fields, removed fields, type changes, constraint changes
  • Classify each as “breaking” or “non-breaking”
  • Example output:
Breaking changes:
  - Field 'user_id' type changed: string → integer
  - Field 'email' removed (was required)

Non-breaking changes:
  - Field 'preferences' added (optional)
  - Field 'amount' max changed: 100000.0 → 500000.0

Schema Inference:

  • Given a sample data file, infer a schema contract
  • Detect types, required/optional status, value ranges, common patterns
  • Output a draft schema that the user can refine

Technical Requirements

Performance target: Validate at least 100,000 NDJSON records per second on a single core.

Required crates:

  • serde + serde_json — JSON parsing and schema deserialization
  • csv — CSV format support
  • clap — CLI
  • regex — pattern validation
  • chrono — datetime validation
  • thiserror — structured error types

Architecture:

main.rs          → CLI entry point (clap)
schema/
  mod.rs         → Schema types (structs, enums for field types)
  parser.rs      → Schema file parser
  diff.rs        → Schema diff engine
validator/
  mod.rs         → Core validation engine (trait-based)
  json.rs        → NDJSON validator
  csv.rs         → CSV validator
report/
  mod.rs         → Error report generation
  formats.rs     → CSV, JSON, plain text report formatters
infer/
  mod.rs         → Schema inference from sample data

Expected Concepts Covered

ConceptHow It Appears
Traitstrait Validator with implementations for JSON and CSV formats
Genericsfn validate<R: BufRead>(reader: R, schema: &Schema) — generic over input source
Enums (rich)FieldType::String, FieldType::Integer, FieldType::Float, FieldType::DateTime, FieldType::Boolean with validation logic per variant
LifetimesSchema references passed into validation functions without cloning
Iterators (lazy)Streaming line-by-line without loading entire file — .lines().filter_map().for_each()
Error typesCustom ValidationError enum with thiserror, aggregated into a report
serde (advanced)Deserializing the schema contract, handling optional fields, custom deserializers
Module organizationMulti-file crate structure with clean public API
TestingUnit tests for each field type validation, integration tests with sample files

Definition of Done

  • Validates NDJSON at 100k+ records/second
  • Validates CSV files with header detection
  • Schema diff correctly identifies breaking vs non-breaking changes
  • Schema inference generates a usable draft from sample data
  • Exit codes are CI/CD compatible (0 = pass, 1 = fail)
  • Error report includes line numbers and specific violation details
  • 20+ unit tests, 5+ integration tests with test fixture files
  • README.md with clear examples for each command
  • Clean public library API (not just a CLI binary)

Real-World Value

This tool would be immediately useful in:

  • CI/CD pipelines (validate data contracts before deployment)
  • Kafka consumer validation layers
  • Data warehouse ingestion checks
  • ML feature store quality gates
  • Any team that has ever been bitten by a silent schema change


Phase 2B — logr

Terminal-Native Personal Activity Tracker

Timeline: Weeks 15–16 (two focused weekends + evenings)


The Problem It Solves

Your own Quantified Self Flask app requires a running server, a browser, and a database daemon. For a daily habit tracker, that is massive overhead. logr is the terminal-native, zero-dependency replacement you will actually use every day.


Functional Requirements

# Log an activity
logr log "study rust" --duration 90 --tags "phase2,lifetimes" --notes "finally understood lifetimes"

# View today's log
logr today

# View weekly summary
logr stats --week

# View streak for an activity
logr streak "study rust"

# List all tracked activities
logr activities

# Export data
logr export --format csv --range "2026-03-01..2026-03-31" > march.csv

# Edit last entry
logr edit --last

# Delete an entry by ID
logr delete <id>

Data Storage: Local JSON files in ~/.local/share/logr/ (XDG compliant). One file per month: 2026-03.json. No database, no server.

Entry format:

{
  "id": "a7f3b2c1",
  "activity": "study rust",
  "duration_minutes": 90,
  "tags": ["phase2", "lifetimes"],
  "notes": "finally understood lifetimes",
  "timestamp": "2026-03-15T21:30:00+05:30"
}

Expected Concepts Covered

ConceptHow It Appears
File I/Ostd::fs for reading/writing JSON data files
Serializationserde_json for reading and writing entry data
Date/time handlingchrono for timestamps, date ranges, streak calculation
Structs with derives#[derive(Serialize, Deserialize, Debug, Clone)] on entry types
Error handlingGraceful handling of missing files, corrupt data, invalid dates
CLI designclap with subcommands (log, stats, streak, export)
IteratorsFiltering entries by date range, activity, tags
String formattingPretty-printing tables and statistics to terminal

Definition of Done

  • All CRUD operations work (log, view, edit, delete)
  • Stats show correct totals, averages, and streaks
  • Data persists between sessions in XDG-compliant directory
  • CSV export works with configurable date range
  • Handles empty state gracefully (first run, no data yet)
  • 10+ unit tests
  • You are actually using it daily to track your Rust study time


Phase 3A — HTTP/1.1 Server From Scratch

No Frameworks, No Libraries — Raw TCP

Timeline: Weeks 21–23 (spread across 3 weekends — one per week)


Why This Project Is Non-Negotiable

When Flask, FastAPI, or Axum handle a request, there are ~15 layers of abstraction between your handler function and the raw TCP bytes. This project opens them all. After building this, every web framework will feel transparent rather than magical.


Functional Requirements

Build a minimal HTTP/1.1 server using ONLY std::net::TcpListener. It must:

  1. Accept TCP connections on a configurable port
  2. Parse raw HTTP/1.1 request bytes into a structured type:
    struct Request {
        method: Method,      // GET, POST, etc.
        path: String,        // "/api/logs"
        headers: HashMap<String, String>,
        body: Option<String>,
    }
    
  3. Route requests to handler functions based on method + path
  4. Return proper HTTP responses with status line, headers, and body:
    HTTP/1.1 200 OK\r\n
    Content-Type: application/json\r\n
    Content-Length: 42\r\n
    \r\n
    {"message": "hello"}
    
  5. Handle concurrent connections using std::thread::spawn
  6. Serve your logr data as a JSON API:
    • GET /logs — list all entries
    • GET /logs?activity=rust — filter by activity
    • GET /stats/week — weekly summary
    • POST /logs — add a new entry (parse JSON body)

Technical Constraints

Allowed:

  • std::net (TcpListener, TcpStream)
  • std::thread
  • std::sync (Arc, Mutex)
  • std::io (Read, Write, BufReader)
  • serde_json (for JSON serialization only — you parse HTTP yourself)

NOT allowed:

  • Any HTTP framework (axum, actix, warp, hyper, rocket)
  • Any HTTP parsing library
  • tokio or any async runtime

Expected Concepts Covered

ConceptHow It Appears
TCP networkingRaw TcpListener::bind(), accepting connections, reading bytes
Byte parsingReading HTTP request as raw bytes, splitting on \r\n, parsing headers
Threadingthread::spawn for concurrent connection handling
Arc<Mutex<T>>Shared state (logr data) across threads
String manipulationParsing HTTP method, path, headers from raw text
Trait implementationimpl Display for Response for formatting HTTP responses
Error handlingMalformed requests, connection drops, parse failures

Definition of Done

  • Server accepts concurrent HTTP connections on a port
  • Correctly parses GET and POST requests with headers and body
  • Returns properly formatted HTTP/1.1 responses
  • Serves logr data through at least 4 endpoints
  • Handles malformed requests with 400 Bad Request
  • Returns 404 for unknown paths
  • Does not crash on connection drop or malformed input
  • Can be tested with curl and a web browser


Phase 3B — ticknorm

⚠️ CONFLICT OF INTEREST NOTICE

This project poses a potential conflict of interest with your employment at LSEG. LSEG’s core business is market data normalization and distribution (via Refinitiv / LSEG Data & Analytics). Building an open-source market data feed normalizer — even a simplified one — while employed at LSEG could violate IP assignment clauses, non-compete clauses, or moonlighting policies in your employment contract. The optics of an LSEG data scientist publishing an “open-source market data normalizer” are also problematic regardless of legal technicalities.

This project is retained for reference only. Use airpost (below) instead. It teaches identical Rust concepts (Tokio async, multi-source TCP ingestion, channels, sqlx, axum, error isolation) in a domain with zero LSEG conflict.

Real-Time Market Data Feed Normalizer

Timeline: (Retained for reference only — see airpost below. Not on the active schedule.)


The Problem It Solves

Financial market data arrives from multiple exchanges in incompatible formats. Exchange A sends price as an integer in basis points. Exchange B sends it as a decimal string. Timestamps come in Unix milliseconds from one, ISO8601 from another, and epoch nanoseconds from a third. Field names differ (last_price vs lastTrade vs lt). Normalizing this in real-time is critical infrastructure at every financial data company, and it is directly relevant to LSEG.

Current landscape:

  • Python solutions are too slow for tick-by-tick real-time normalization (GIL + overhead per record)
  • Java solutions (common in finance) work but are verbose and memory-heavy
  • Existing Rust financial tools (like hftbacktest, barter-rs) focus on backtesting, not feed normalization
  • Most firms build this as internal proprietary infrastructure. There is no good open-source version.

Functional Requirements

Architecture:

[Mock Exchange Feed A] ──TCP──┐
[Mock Exchange Feed B] ──TCP──┤──→ ticknorm ──→ [PostgreSQL]
[Mock Exchange Feed C] ──TCP──┘        │
                                       └──→ [JSON API via axum]

Mock Exchange Feeds (you build these too): Three simulated exchange feeds, each sending tick data in a different format:

Feed A (JSON over TCP):

{"sym": "AAPL", "px": 18750, "px_denom": 100, "qty": 200, "ts": 1711234567890, "side": "B"}

Feed B (CSV over TCP):

AAPL,187.50,200,2026-03-15T14:30:00.123Z,BUY

Feed C (JSON over TCP, different schema):

{"ticker": "AAPL", "lastTrade": "187.50", "volume": 200, "timestamp": "1711234567", "direction": "buy"}

Canonical Output Format:

struct NormalizedTick {
    symbol: String,          // "AAPL"
    exchange: Exchange,      // FeedA, FeedB, FeedC
    timestamp_ns: i64,       // nanoseconds since epoch, always UTC
    bid: Option<f64>,
    ask: Option<f64>,
    last_price: f64,         // always as f64 decimal
    volume: u64,
    side: Side,              // Buy, Sell, Unknown
}

Service Behavior:

  • Connect to all three feeds simultaneously using Tokio async tasks
  • Parse each feed’s proprietary format
  • Normalize to the canonical NormalizedTick struct
  • Write normalized ticks to PostgreSQL via sqlx (async, compile-time checked)
  • Expose a JSON API via axum:
    • GET /ticks/:symbol — last N ticks for a symbol
    • GET /stats/:symbol — VWAP, spread, volume over configurable window
    • GET /feeds/status — health check for each connected feed
  • Log throughput: print ticks/second every 10 seconds

Performance target: Process and normalize at least 10,000 ticks/second across all feeds combined.


Technical Requirements

Required crates:

  • tokio — async runtime (tasks, TCP, channels)
  • serde + serde_json — JSON parsing
  • sqlx — async PostgreSQL with compile-time query checking
  • axum — HTTP API (you’ve earned the right to use a framework now)
  • chrono — timestamp normalization
  • tracing — structured logging

Architecture (Rust modules):

main.rs              → Tokio runtime setup, task orchestration
feeds/
  mod.rs             → Feed trait definition
  feed_a.rs          → Feed A parser (JSON, basis points)
  feed_b.rs          → Feed B parser (CSV, ISO timestamps)
  feed_c.rs          → Feed C parser (JSON, different schema)
normalizer.rs        → Core normalization logic
storage.rs           → PostgreSQL writer (sqlx)
api/
  mod.rs             → Axum router setup
  handlers.rs        → API endpoint handlers
mock_exchange.rs     → Mock feed generators (separate binary)

Expected Concepts Covered

ConceptHow It Appears
Tokio async tasksOne tokio::spawn per feed connection, shared state via channels
Channels (mpsc)Feed tasks send NormalizedTick to a central processor via channel
Arc<T> for shared stateShared connection pool, shared tick buffer for API queries
Traitstrait FeedParser with fn parse(&self, raw: &[u8]) -> Result<NormalizedTick>
Async I/OTcpStream reading with tokio::io::AsyncReadExt
sqlx compile-time queriessqlx::query!("INSERT INTO ticks ...") — queries checked at compile time
axumRouter, extractors, state management, JSON responses
Error handling (production)Per-feed error isolation (one feed crashing doesn’t kill others)
Structured loggingtracing spans for each feed, throughput metrics
Enums with dataenum Exchange { FeedA, FeedB, FeedC }, enum Side { Buy, Sell, Unknown }

Definition of Done

  • Three mock exchange generators produce tick data at configurable rates
  • ticknorm connects to all three simultaneously and normalizes in real-time
  • Normalized data is written to PostgreSQL with correct types
  • JSON API returns correct data for all three endpoints
  • If one feed disconnects, the others continue processing
  • Achieves 10k+ ticks/second throughput
  • sqlx queries are compile-time verified (not runtime string SQL)
  • 15+ tests (unit tests for each parser, integration test for the pipeline)
  • README with architecture diagram and setup instructions

Real-World Value

This is a simplified version of infrastructure that runs in production at LSEG, Bloomberg, ICE, and every financial data vendor. Building it teaches you the core challenges of real-time financial data: schema normalization, timestamp alignment, feed resilience, and high-throughput database writes. Directly applicable to your current role.



Phase 3B (Replacement) — airpost

Real-Time Air Quality Aggregator

Timeline: Weeks 24–28 (primary weekend work + evening coding)

This project replaces ticknorm to avoid conflict of interest with LSEG. It teaches identical Rust concepts in the domain of public environmental data.


The Problem It Solves

Air quality data is fragmented across dozens of incompatible sources. Researchers, journalists, public health advocates, and concerned citizens cannot easily answer the question: “What is the air quality in my city right now, according to all available sensors?”

Each data source uses different formats, different pollutant measurements, different units, different scales, and different update frequencies:

  • OpenAQ (global): JSON REST API. Reports PM2.5, PM10, O3, NO2, SO2, CO in µg/m³. Timestamps in ISO 8601 UTC. Station IDs are alphanumeric strings.
  • India CPCB (Central Pollution Control Board): India’s government sensor network. Reports AQI (a composite index, not raw pollutant concentrations). Different station naming conventions. Data published via web scraping or XML feeds.
  • PurpleAir (citizen science): JSON API. Reports raw PM2.5 particle counts and applies correction factors. Uses sensor IDs (integers). Timestamps in Unix epoch seconds. Data quality varies wildly (sensors are installed by individuals, some are indoors, some are malfunctioning).
  • WAQI (World Air Quality Index): Another JSON API. Reports AQI values. Different station identifiers. Rate-limited.

A researcher in Bengaluru who wants to compare government CPCB readings with nearby PurpleAir citizen sensors and OpenAQ global data currently needs to write custom scripts for each source, handle unit conversions manually, and maintain their own storage.


Functional Requirements

Architecture:

[OpenAQ API] ──HTTP──┐
[CPCB Data]  ──HTTP──┤──→ airpost ──→ [PostgreSQL]
[PurpleAir]  ──HTTP──┤        │
[WAQI API]   ──HTTP──┘        └──→ [JSON API via axum]

Data Sources (you implement adapters for each):

Source A — OpenAQ (JSON REST):

{
  "location": "Silk Board Junction, Bengaluru",
  "parameter": "pm25",
  "value": 85.3,
  "unit": "µg/m³",
  "date": {"utc": "2026-03-15T14:30:00.000Z"}
}

Source B — CPCB India (different JSON structure):

{
  "station_name": "BTM Layout, Bangalore",
  "pollutant_id": "PM2.5",
  "pollutant_avg": "142",
  "last_update": "15-Mar-2026 20:00"
}

Source C — PurpleAir (different JSON, raw particle counts):

{
  "sensor_index": 131075,
  "name": "Koramangala Sensor",
  "pm2.5": 45.2,
  "pm2.5_cf_1": 52.1,
  "humidity": 65,
  "temperature": 82,
  "last_seen": 1710512400
}

Canonical Output Format:

struct AirReading {
    station_id: String,           // normalized unique ID
    station_name: String,         // human-readable name
    source: DataSource,           // OpenAQ, CPCB, PurpleAir, WAQI
    latitude: f64,
    longitude: f64,
    city: String,
    pollutant: Pollutant,         // PM25, PM10, O3, NO2, SO2, CO
    value_ugm3: f64,              // always in µg/m³ (converted if needed)
    aqi: Option<u32>,             // computed AQI if raw value available
    timestamp_utc: DateTime<Utc>, // always UTC
    quality_flag: QualityFlag,    // Good, Suspect, Calibrating, Unknown
}

Service Behavior:

  • Spawn one Tokio task per data source, each polling on its own schedule (OpenAQ every 5 min, PurpleAir every 2 min, CPCB every 15 min)
  • Each adapter parses its source’s proprietary format and converts to AirReading
  • Apply unit conversions: AQI → µg/m³ (using EPA breakpoint tables), raw PM counts → corrected µg/m³ (PurpleAir correction factor)
  • Quality flagging: mark PurpleAir sensors with humidity > 70% as “Suspect” (high humidity inflates particle counts)
  • Write normalized readings to PostgreSQL via sqlx
  • Expose JSON API via axum:
    • GET /stations?city=bengaluru — list all stations in a city
    • GET /readings/:station_id?hours=24 — last 24 hours for a station
    • GET /aqi/current?city=bengaluru — current AQI for all stations in a city
    • GET /compare?sources=openaq,purpleair&city=bengaluru — compare readings across sources for the same location
    • GET /sources/status — health check for each data source connection
  • Log throughput: print readings ingested/second every polling cycle

Performance target: Handle 50+ concurrent source polls, normalize and store 1,000+ readings per polling cycle, API response time < 50ms.


Technical Requirements

Required crates:

  • tokio — async runtime (tasks, timers, HTTP client via reqwest)
  • reqwest — async HTTP client for API polling
  • serde + serde_json — JSON deserialization from different schemas
  • sqlx — async PostgreSQL with compile-time query checking
  • axum — HTTP API server
  • chrono — timestamp normalization and timezone conversion
  • tracing — structured logging

Architecture (Rust modules):

main.rs              → Tokio runtime setup, task orchestration
sources/
  mod.rs             → DataSource trait definition
  openaq.rs          → OpenAQ adapter (JSON, µg/m³)
  cpcb.rs            → CPCB India adapter (different JSON, AQI scale)
  purpleair.rs       → PurpleAir adapter (raw particle counts)
  waqi.rs            → WAQI adapter (AQI values)
normalizer.rs        → Unit conversion, AQI calculation, quality flagging
storage.rs           → PostgreSQL writer (sqlx)
api/
  mod.rs             → Axum router setup
  handlers.rs        → API endpoint handlers
  responses.rs       → JSON response types
config.rs            → Source URLs, polling intervals, API keys

Expected Concepts Covered

ConceptHow It Appears
Tokio async tasksOne tokio::spawn per data source, each on its own polling interval
tokio::time::intervalConfigurable polling schedules per source
Channels (mpsc)Source tasks send AirReading to a central processor via channel
Arc<T> for shared stateShared config, shared connection pool for API and storage
Traitstrait DataSource with async fn poll(&self) -> Result<Vec<AirReading>>
Async HTTPreqwest::Client for concurrent API calls
sqlx compile-time queriessqlx::query!("INSERT INTO readings ...") — queries checked at compile time
axumRouter, extractors, query parameters, JSON responses, shared state
Error handling (production)Per-source error isolation (one API being down doesn’t kill others)
Structured loggingtracing spans for each source, poll timing metrics
Enums with dataenum Pollutant { PM25, PM10, O3, NO2, SO2, CO }, enum QualityFlag { Good, Suspect, ... }
Unit conversionAQI breakpoint tables, PurpleAir correction factors — real math, not toy examples

Definition of Done

  • Four data source adapters implemented (OpenAQ, CPCB, PurpleAir, WAQI)
  • All readings normalized to common schema with correct unit conversions
  • AQI calculation from raw µg/m³ values using EPA breakpoint tables
  • PurpleAir readings flagged when humidity > 70%
  • Normalized data written to PostgreSQL with correct types
  • JSON API returns correct data for all five endpoints
  • If one data source API is down, the others continue polling
  • sqlx queries are compile-time verified
  • 15+ tests (unit tests for each adapter + unit conversion, integration test for pipeline)
  • README with architecture diagram, setup instructions, and API examples

Real-World Value

Why this matters:

  • Bengaluru’s air quality has deteriorated significantly in recent years. Citizens and researchers need access to unified, real-time data from all available sources.
  • India’s CPCB network has limited coverage. PurpleAir citizen sensors fill gaps but report in incompatible units.
  • No open-source tool currently unifies these fragmented sources into a single, queryable, self-hostable service.
  • This could be deployed by environmental NGOs, journalism organizations investigating pollution, or citizen science communities.

Why it teaches the same concepts as ticknorm: The architecture is structurally identical — multiple concurrent data sources with incompatible formats, normalization to a canonical schema, async database writes, and a query API. The domain is different; the Rust learning is the same.



Phase 4A — riskbook

Real-Time Portfolio Risk Engine

Timeline: Weeks 31–35 (primary weekend project)


The Problem It Solves

Computing portfolio risk metrics — Value-at-Risk (VaR), Conditional VaR (CVaR), P&L attribution, and options Greeks — in real-time requires processing thousands of positions against continuously updating market data. Every bank, asset manager, and hedge fund needs this. The landscape:

  • Python: Too slow for real-time. Even with NumPy vectorization, per-call overhead and the GIL make tick-by-tick risk updates infeasible. Used only for end-of-day batch risk.
  • Java/C++: Common in production trading systems but heavy and proprietary.
  • Commercial solutions: Bloomberg PORT, MSCI RiskMetrics, Axioma — all extremely expensive.
  • Open-source: QuantLib (C++) is comprehensive but monstrous in complexity. There is no lean, focused, modern open-source risk engine. Rust has nothing in this space.

riskbook fills this gap: a fast, focused, well-documented Rust library for portfolio risk computation, with optional PyO3 bindings.


Functional Requirements

CLI Interface:

# One-shot risk computation
riskbook compute --portfolio positions.csv --prices market_data.csv --var-confidence 0.99

# Streaming mode: update risk as prices change
riskbook stream --portfolio positions.csv --price-feed stdin

# Options Greeks for a single position
riskbook greeks --type call --strike 185.0 --spot 187.50 --vol 0.25 --rate 0.05 --expiry 30d

Portfolio File Format (positions.csv):

symbol,type,quantity,entry_price,strike,expiry,option_type
AAPL,equity,1000,185.00,,,
MSFT,equity,500,420.00,,,
AAPL,option,10,5.50,190.00,2026-06-19,call
TSLA,option,5,12.00,250.00,2026-04-17,put

Output — Risk Report:

Portfolio Risk Report (2026-03-15 14:30:00 UTC)
================================================
Positions:     4
Total Notional: $587,500.00
Mark-to-Market: $591,250.00
Unrealized P&L: +$3,750.00 (+0.64%)

Value-at-Risk (99%, 1-day):
  Historical VaR:    -$8,420.00
  Parametric VaR:    -$7,890.00

Options Greeks (Portfolio):
  Delta:  1,342.5
  Gamma:     12.3
  Theta:   -145.2
  Vega:     890.4

Concentration:
  AAPL: 63.4% of notional
  MSFT: 35.6% of notional
  TSLA:  1.0% of notional

Top Risk Contributors:
  1. AAPL equity    (-$5,200 marginal VaR)
  2. AAPL call      (-$1,890 marginal VaR)
  3. MSFT equity    (-$1,120 marginal VaR)

Mathematical Implementations (YOU MUST IMPLEMENT THESE FROM THE FORMULAS — NO LIBRARIES)

1. Black-Scholes Option Pricing:

C = S·N(d1) - K·e^(-rT)·N(d2)
P = K·e^(-rT)·N(-d2) - S·N(-d1)

where:
  d1 = (ln(S/K) + (r + σ²/2)·T) / (σ·√T)
  d2 = d1 - σ·√T
  N(x) = cumulative normal distribution function

You must implement:

  • The Gaussian CDF N(x) yourself (use the Abramowitz & Stegun rational approximation or the Horner form polynomial)
  • ln(), exp(), sqrt() from std::f64 are allowed (these are CPU instructions)
  • The full Black-Scholes formula for calls and puts

2. Greeks (partial derivatives of Black-Scholes):

  • Delta: ∂C/∂S = N(d1) for calls, N(d1) - 1 for puts
  • Gamma: ∂²C/∂S² = n(d1) / (S·σ·√T) where n(x) is the standard normal PDF
  • Theta: ∂C/∂t (time decay per day)
  • Vega: ∂C/∂σ = S·n(d1)·√T

3. Historical VaR:

  • Given N days of historical returns, compute portfolio return for each day
  • Sort returns, take the (1-α) percentile as VaR
  • For 99% VaR with 252 days: the 2.52nd worst day (interpolate)

4. Parametric VaR:

  • Assume returns are normally distributed
  • VaR = μ - z_α · σ (where z_0.99 ≈ 2.326)
  • Compute portfolio σ from individual position volatilities and correlations

Technical Requirements

Required crates:

  • clap — CLI
  • csv + serde — data parsing
  • chrono — date handling for option expiry
  • criterion — benchmarking
  • pyo3 (optional) — Python bindings

NOT allowed for math:

  • statrs, nalgebra, ndarray, or any statistics/math library for the core computations
  • You implement the math. That’s the point.

Performance target:

  • Compute full risk report for 1,000 positions in < 100ms
  • Update P&L for a single price change in < 1μs

Expected Concepts Covered

ConceptHow It Appears
Floating-point precisionBlack-Scholes requires careful handling of very small/large numbers, NaN checks
Mathematical implementationImplementing CDF, PDF, option pricing from formulas
Structs with methodsPosition, Portfolio, RiskReport with computation methods
EnumsPositionType::Equity, PositionType::Option { strike, expiry, kind }
IteratorsPortfolio-level aggregation: .iter().map(compute_greeks).fold(...)
Performance profilingcriterion benchmarks, cargo flamegraph for hotspot identification
unsafe (optional)SIMD hints for vectorized portfolio computation if you reach stretch goals
PyO3Exposing Portfolio and compute_risk() to Python
TestingProperty-based tests: put-call parity must hold, Greeks must satisfy mathematical identities

Definition of Done

  • Black-Scholes pricing matches known test values to 4 decimal places
  • Greeks match known test values (use options pricing calculators as reference)
  • Put-call parity holds: C - P = S - K·e^(-rT) (within floating-point tolerance)
  • Historical VaR computation is correct against a hand-calculated example
  • Full risk report for 1,000 positions runs in < 100ms
  • CLI produces the formatted report shown above
  • Streaming mode updates on new price data from stdin
  • criterion benchmarks for all core computations
  • 20+ tests including mathematical identity checks
  • PyO3 binding works: from riskbook import Portfolio, compute_risk

Real-World Value

This is directly applicable to your LSEG role. A well-documented, tested, open-source Rust risk library with Python bindings would attract real users from quantitative finance. The PyO3 path means Python quants can use it without learning Rust. The SaaS path: a risk computation API endpoint.



Phase 4B — xlsxfmt

High-Fidelity Bidirectional Excel Processing Library

Timeline: Weeks 37–40 (four focused weekends + evenings)


The Problem It Solves

You lived this problem at LSEG. The Rust Excel ecosystem today:

CrateReadWritePreserve FormattingStatus
calamineYesNoNo (data only)Mature, maintained
rust_xlsxwriterNoYesYes (new files only)Mature, maintained
edit-xlsxYesYesPartialEarly stage, unstable API
xlsxwriter (C bindings)NoYesYesC dependency, write only

The gap: No production-quality, pure-Rust library can read an existing XLSX file with full formatting fidelity, modify cell values, and write it back with formatting preserved. This is the core operation in document processing pipelines, report generation, and template-based systems.

xlsxfmt fills this gap.


Functional Requirements

Library API:

use xlsxfmt::Workbook;

// Read with full formatting
let wb = Workbook::open("report.xlsx")?;
let sheet = wb.sheet("Q1 Revenue")?;

// Inspect cell with formatting
let cell = sheet.cell("B5")?;
println!("Value: {}", cell.value());           // "1,234.56"
println!("Font: {} {}pt", cell.font_name(), cell.font_size()); // "Calibri 11pt"
println!("Bold: {}", cell.is_bold());          // true
println!("Fill: {}", cell.fill_color());       // "#4472C4"

// Modify value, preserve ALL formatting
sheet.set_value("B5", 1500.00)?;

// Save — formatting is preserved
wb.save("report_updated.xlsx")?;

CLI Tool:

xlsxfmt inspect report.xlsx                    # List sheets, dimensions, formatting summary
xlsxfmt cell report.xlsx "Sheet1!B5"           # Show cell value + all formatting attributes
xlsxfmt set report.xlsx "Sheet1!B5" "1500.00"  # Update cell, preserve formatting
xlsxfmt diff original.xlsx modified.xlsx       # Show cell-by-cell differences

Technical Deep Dive: XLSX Format

XLSX is a ZIP archive containing XML files (OOXML standard). Key files inside the ZIP:

[Content_Types].xml          — file type declarations
xl/workbook.xml              — sheet names and order
xl/worksheets/sheet1.xml     — cell values and references to shared strings/styles
xl/sharedStrings.xml         — deduplicated string table
xl/styles.xml                — all formatting (fonts, fills, borders, number formats)
xl/theme/theme1.xml          — color themes
_rels/.rels                  — relationship definitions

The critical insight: cell values and cell formatting are stored separately. sheet1.xml has cell references like <c r="B5" s="12"> where s="12" points to style index 12 in styles.xml. To preserve formatting, you must:

  1. Parse and retain the full styles.xml
  2. When modifying a cell value, keep its style reference (s attribute) unchanged
  3. When writing back, ensure the shared strings table is updated correctly
  4. Preserve all XML elements and attributes you don’t understand (round-trip fidelity)

Expected Concepts Covered

ConceptHow It Appears
Binary I/OReading ZIP archives with the zip crate
XML parsingquick-xml for streaming XML parsing/writing
Complex data structuresShared strings table, styles table, theme colors — interconnected data
Builder patternWorkbookBuilder for creating new formatted workbooks
Trait implementationDisplay for cell values, From/Into for type conversions
Error handling (production)Graceful handling of corrupt files, missing XML elements, unknown formats
Testing (round-trip)Open → modify → save → reopen → verify (formatting preserved)
Documentationrustdoc with examples for every public API method

In-Program Scope (Weeks 37–40 — Basic Implementation)

This is what gets built during the program. Honest about what it covers.

Week 37: Understand the format before writing code. Open a .xlsx file with unzip, read the XML by hand. Implement reading: parse cell values (strings, numbers, booleans), handle the shared string table correctly (values are stored by index reference — getting this wrong is the most common mistake in XLSX parsers).

Week 38: Implement writing — create a new workbook with one or more sheets, write cell values, write basic formatting (bold, font size, background color) through correct styles.xml construction.

Week 39: Cell range reading (A1:D50Vec<Vec<Value>>), formula string preservation (read a formula cell, store the formula string, write it back — do not evaluate it, that is formulaengine’s job), row height and column width preservation.

Week 40: PyO3 binding, integration tests (read → modify → write → re-read → assert), README.md that explicitly states what is and is not supported, publish as 0.1.0-alpha.

Definition of Done (In-Program)

  • Can read XLSX files created by Excel, LibreOffice, and Google Sheets (cell values, shared strings, basic formatting)
  • Basic formatting preserved after read-modify-write: bold, font size, fill color
  • Shared strings table is updated correctly when cell values change
  • Formula string preservation: formulas read and written back unchanged (not evaluated)
  • Row height and column width preserved on write
  • CLI inspect shows useful summary of workbook structure
  • 15+ tests including round-trip tests
  • README explicitly documents what is NOT supported
  • Published as 0.1.0-alpha to signal scope
  • Benchmarks showing read/write performance vs. calamine + rust_xlsxwriter separately

What is explicitly NOT in scope for Weeks 37–40: merged cells, charts, images, conditional formatting, pivot tables, full number format string parsing (the Excel custom format language is its own mini-language), and complex border/alignment fidelity.


Post-Program Extension Track (Weeks 53–60+ — Advanced Implementation)

Pursue if xlsxfmt becomes a serious OSS project after the program ends. This is outside the 1-year program.

  • Weeks 53–54: Full formatting fidelity — borders, alignment, wrap text, number format string parsing (dates, currency, percentages, custom formats)
  • Weeks 55–56: Merged cell handling — mergeCells in sheet1.xml, reading/writing merged ranges correctly
  • Weeks 57–58: Named ranges, cross-sheet formula references (Sheet2!A1), sheet visibility and protection
  • Weeks 59–60: Streaming reader for large files (>100MB) using quick-xml’s streaming API, formulaengine integration as optional feature flag, benchmark vs. Python openpyxl, upgrade from 0.1.0-alpha to 0.2.0

Real-World Value

Every Python developer doing Excel automation hits this wall. LibreOffice is slow and fragile. openpyxl is slow at scale. A fast, safe, pure-Rust library for bidirectional Excel processing would be immediately useful for document pipelines, report generators, and template systems. Directly connects to your LSEG work. Publishable to crates.io. SaaS path: an API that accepts XLSX modifications and returns the result.



THREE CAPSTONE OPTIONS: Choose one of docforge (Option A — document conversion), formulaengine (Option B — spreadsheet formula evaluator), or ruleforge (Option C — business rules engine). All three teach the same advanced Rust concepts (parsing, trait-based architecture, PyO3, publishing). Pick whichever excites you most.

Phase 5 — docforge

Document Conversion Library (Capstone — Option A)

Timeline: Weeks 41–52


The Problem It Solves

Converting documents (XLSX → PDF, DOCX → PDF) with formatting fidelity is universally needed and universally painful. The open-source standard is LibreOffice in headless mode:

  • 500MB+ install size (includes a full office suite you don’t need)
  • 3–8 second startup time per conversion (JVM-like cold start)
  • Documented formatting bugs: pagination breaks, font substitution issues, merged cell rendering errors (hundreds of Stack Overflow threads)
  • Sequential only: one conversion at a time per process instance
  • 700 seconds for a batch — your own LSEG measurement

wkhtmltopdf is abandoned. Headless Chrome is 200MB+ and only handles HTML. Python wrappers (python-docx + reportlab) are slow and lose formatting.

You already proved this can be done better — your LSEG tool does it in 3.5 seconds with parallel processing. docforge takes that proof-of-concept and makes it a proper, open-source, publishable Rust library.


Phased Build Plan

Phase 5a — XLSX → PDF (Weeks 21–22):

Build on xlsxfmt from Phase 4B. The rendering pipeline:

XLSX file
  → xlsxfmt reads cells + formatting
  → Layout engine computes column widths, row heights, page breaks
  → PDF renderer draws cells, borders, text with correct fonts/colors
  → Output PDF file

Must handle:

  • Column widths and row heights (from XLSX metadata)
  • Cell borders (thin, medium, thick, dashed, dotted)
  • Fill colors (solid fills)
  • Font styles (bold, italic, size, color)
  • Number formatting (currency $1,234.56, percentages 45.2%, dates)
  • Merged cells
  • Multi-sheet documents (one PDF page per sheet, or configurable)
  • Page orientation (portrait/landscape from XLSX print settings)

Phase 5b — DOCX → PDF (Weeks 23–24):

DOCX is also ZIP + XML (OOXML). The XML structure is different from XLSX but the parsing approach is similar.

Must handle:

  • Paragraphs with alignment (left, center, right, justified)
  • Headings (H1–H6 mapped from DOCX heading styles)
  • Bold, italic, underline, strikethrough
  • Font size and color
  • Tables with borders and cell formatting
  • Page breaks (explicit and auto)
  • Basic list items (numbered and bulleted)

Phase 5c — Polish and Publish (Weeks 25–26):

  • Clean, ergonomic public API for both XLSX and DOCX conversion
  • CLI: docforge convert input.xlsx --output output.pdf
  • Comprehensive documentation with examples
  • Benchmarks: conversion time and file size vs. LibreOffice headless
  • PyO3 binding: docforge.xlsx_to_pdf("input.xlsx", "output.pdf")
  • Publish to crates.io
  • Build CI/CD with GitHub Actions

Technical Requirements

Required crates:

  • xlsxfmt (your Phase 4B library) — XLSX reading
  • quick-xml — XML parsing for DOCX
  • zip — ZIP archive handling
  • printpdf — PDF generation (or lopdf for lower-level control)
  • rusttype or ab_glyph — font rendering and text measurement
  • clap — CLI
  • pyo3 — Python binding
  • criterion — benchmarking

Architecture:

docforge/
  src/
    lib.rs           → Public API: convert_xlsx_to_pdf(), convert_docx_to_pdf()
    xlsx/
      reader.rs      → Uses xlsxfmt for XLSX reading
      layout.rs      → Page layout computation (column widths, pagination)
    docx/
      reader.rs      → DOCX XML parsing
      layout.rs      → Paragraph/table layout computation
    pdf/
      renderer.rs    → PDF rendering engine (shared between XLSX and DOCX)
      fonts.rs       → Font loading and text measurement
      page.rs        → Page model (dimensions, margins)
    cli.rs           → CLI entry point
  python/
    src/lib.rs       → PyO3 bindings
  benches/
    conversion.rs    → Benchmark suite vs LibreOffice
  tests/
    fixtures/        → Test XLSX and DOCX files created in Excel/Word
    round_trip.rs    → Convert and verify output

Expected Concepts Covered

ConceptHow It Appears
Library designClean public API with builder patterns, good defaults
Module architectureMulti-module crate with clear separation of concerns
Binary I/OZIP parsing, PDF binary format
XML processing at scaleStreaming XML parsing for large documents
Trait-based abstractiontrait DocumentReader, trait PdfRenderable
Font renderingText measurement, font metrics, glyph positioning
Performance engineeringProfiling, benchmarking, comparison to LibreOffice
PyO3Full Python binding with type stubs
Publishingcrates.io packaging, semantic versioning, CI/CD
Documentationrustdoc with comprehensive examples, architecture doc
TestingVisual regression testing (convert → compare output), fixture-based tests

Definition of Done

  • XLSX → PDF conversion handles all listed formatting features
  • DOCX → PDF conversion handles paragraphs, headings, tables, basic styles
  • CLI works: docforge convert input.xlsx -o output.pdf
  • Benchmarks show measurable speed improvement over LibreOffice headless
  • PyO3 binding works: docforge.xlsx_to_pdf(input_path, output_path)
  • Published on crates.io with version 0.1.0
  • README with architecture overview, usage examples, and benchmark results
  • 30+ tests including fixture-based visual verification
  • CI/CD with GitHub Actions (test on Linux, macOS, Windows)
  • At minimum 3 real-world test files (created in Excel/Word/Google Docs) convert correctly

Real-World Value and SaaS Potential

Open-source impact: Every data engineering team, every report generation system, every document pipeline hits the LibreOffice problem. A fast, native-code, well-documented alternative would get significant adoption.

SaaS path: A REST API that accepts document uploads and returns PDFs:

POST /convert
Content-Type: multipart/form-data
file: report.xlsx

→ 200 OK
Content-Type: application/pdf
[PDF bytes]

Hosted on a single server, this could serve hundreds of concurrent conversions (unlike LibreOffice which handles one at a time). Real businesses pay $50–500/month for document conversion APIs. You already proved the concept at LSEG. This is the productizable version.

Career impact: A published crate that solves a real problem, with benchmarks proving it outperforms the incumbent, with Python bindings for broad accessibility — this is the kind of project that transforms a resume from “I used frameworks” to “I built infrastructure.”



Phase 5 (Alternative) — formulaengine

Spreadsheet Formula Computation Engine (Capstone — Option B)

Timeline: Weeks 41–52


The Problem It Solves

Every product with a “calculated field,” “formula column,” or “computed cell” needs a formula evaluation engine under the hood — Airtable, Notion databases, Google Sheets, internal pricing tools, financial models, report builders. Building one from scratch is hard: you need a parser, a dependency resolver, circular reference detection, type coercion, and hundreds of built-in functions.

Today, companies solve this by:

  • Embedding full Excel via COM automation — Windows-only, fragile, requires a licensed Excel installation, impossible to deploy in containers
  • Wrapping LibreOffice Calc — same 500MB+ footprint problems as document conversion, slow, unstable
  • Reimplementing every formula in application code — no reuse, error-prone, impossible to let end-users define their own formulas
  • Using JavaScript-based engines (HyperFormula, FormulaJS) — single-threaded, slow on large sheets, no systems-level performance

There is no standalone, embeddable formula evaluation library in Rust, Go, or C++. The closest things are full spreadsheet applications (LibreOffice Calc, Gnumeric) where the formula engine is deeply coupled to the UI and cannot be extracted.

This pairs directly with xlsxfmt — together, they form a “Rust spreadsheet toolkit” where xlsxfmt handles reading/writing the file format and formulaengine handles computing the values. This combination does not exist in any language as a pair of composable libraries.


Phased Build Plan

Phase 5a — Parser and Core Engine (Weeks 21–22):

Build the formula language parser and basic evaluation engine.

"=IF(AND(A1>100, B1<50), A1*0.15, A1*0.10)"
  → Tokenizer: [EQUALS, FUNC("IF"), LPAREN, FUNC("AND"), ...]
  → AST: FunctionCall("IF", [FunctionCall("AND", [...]), Multiply(...), Multiply(...)])
  → Evaluator: walks the AST, resolves cell references, returns Value::Number(15.0)

Must handle:

  • Tokenizer/Lexer: Numbers, strings, booleans, cell references (A1, $A$1, A1:B10), operators (+, -, *, /, ^, &, =, <>, <, >, <=, >=), function names, parentheses, commas
  • Parser: Recursive descent parser producing an AST. Operator precedence (PEMDAS). Nested function calls.
  • Cell reference resolution: Absolute ($A$1), relative (A1), ranges (A1:B10), cross-sheet (Sheet2!A1)
  • Type system: Value enum — Number(f64), Text(String), Boolean(bool), Error(FormulaError), Empty
  • Type coercion: “123” + 1 = 124 (Excel-compatible coercion rules)
  • Core functions (20+): SUM, AVERAGE, COUNT, COUNTA, MIN, MAX, IF, AND, OR, NOT, CONCATENATE, LEFT, RIGHT, MID, LEN, TRIM, UPPER, LOWER, ROUND, ABS, MOD

Phase 5b — Dependency Graph and Advanced Functions (Weeks 23–24):

Cells reference other cells. You must evaluate them in the right order.

A1 = 10
B1 = =A1 * 2          → depends on A1
C1 = =B1 + A1          → depends on B1 and A1
D1 = =SUM(A1:C1)       → depends on A1, B1, C1

Dependency graph:
  A1 → B1 → C1 → D1

Topological sort → evaluate A1 first, then B1, then C1, then D1

Must handle:

  • Dependency graph: Build a directed graph of cell → cell dependencies
  • Topological sort: Evaluate cells in dependency order
  • Circular reference detection: A1 = =B1, B1 = =A1 → detect and return Error::CircularReference
  • Incremental recalculation: When A1 changes, only recalculate cells that depend on A1 (not the entire sheet). Use the dependency graph to find the dirty set.
  • Range functions at scale: SUM(A1:A100000) must not clone 100k values
  • Advanced functions (20+): VLOOKUP, HLOOKUP, INDEX, MATCH, SUMIF, COUNTIF, AVERAGEIF, IFERROR, ISBLANK, ISNA, DATE, YEAR, MONTH, DAY, NOW, TODAY, TEXT, VALUE, INDIRECT, OFFSET
  • Array formulas: Basic support for functions that return ranges
  • Named ranges: revenue → A1:A12

Phase 5c — Polish and Publish (Weeks 25–26):

  • Clean, ergonomic public API with builder pattern
  • Integration with xlsxfmt: load an XLSX file, evaluate all formulas, write results back
  • CLI: formulaengine eval spreadsheet.xlsx --recalculate --output computed.xlsx
  • CLI: formulaengine eval --formula "=SUM(1,2,3)" --format value
  • Benchmarks: evaluate 100k cells with dependencies, compare to Python (openpyxl + manual eval)
  • PyO3 binding: engine.set_cell("A1", 100); engine.set_formula("B1", "=A1*2"); engine.evaluate("B1")
  • Comprehensive documentation with examples
  • Publish to crates.io
  • Build CI/CD with GitHub Actions

Technical Requirements

Required crates:

  • xlsxfmt (your Phase 4B library) — XLSX reading/writing for integration
  • clap — CLI
  • pyo3 — Python binding
  • criterion — benchmarking
  • petgraph — dependency graph (or hand-roll with HashMap + topological sort)
  • chrono — date/time functions
  • tracing — debugging complex evaluation chains

Architecture:

formulaengine/
  src/
    lib.rs              → Public API: Engine::new(), set_cell(), set_formula(), evaluate()
    parser/
      tokenizer.rs      → Lexer: input string → token stream
      ast.rs            → AST node types: Expr, CellRef, Range, FunctionCall
      parser.rs         → Recursive descent parser: tokens → AST
    eval/
      evaluator.rs      → AST walker: resolves references, applies operators, calls functions
      value.rs          → Value enum (Number, Text, Boolean, Error, Empty)
      coercion.rs       → Type coercion rules (Excel-compatible)
    functions/
      registry.rs       → FunctionRegistry: trait-based, extensible
      math.rs           → SUM, AVERAGE, ROUND, ABS, MOD, etc.
      text.rs           → CONCATENATE, LEFT, RIGHT, MID, LEN, TRIM, etc.
      logical.rs        → IF, AND, OR, NOT, IFERROR, etc.
      lookup.rs         → VLOOKUP, HLOOKUP, INDEX, MATCH, etc.
      date.rs           → DATE, YEAR, MONTH, DAY, NOW, TODAY, etc.
      statistical.rs    → COUNT, COUNTA, COUNTIF, SUMIF, AVERAGEIF, etc.
    graph/
      dependency.rs     → Build dependency graph from cell formulas
      topo.rs           → Topological sort for evaluation order
      incremental.rs    → Dirty-set tracking for incremental recalc
    sheet.rs            → Sheet model: cells, values, formulas
    cli.rs              → CLI entry point
  python/
    src/lib.rs          → PyO3 bindings
  benches/
    evaluation.rs       → Benchmark: 100k cells, complex dependency chains
    functions.rs        → Benchmark: individual function performance
  tests/
    excel_compat.rs     → Test against known Excel outputs
    circular.rs         → Circular reference detection tests
    incremental.rs      → Incremental recalculation tests
    fixtures/           → Test XLSX files with formulas and expected results

Expected Concepts Covered

ConceptHow It Appears
Parsing / CompilersTokenizer, recursive descent parser, AST design — classic CS
Graph algorithmsDependency graph, topological sort, cycle detection
Enums with dataValue enum, Expr AST nodes — Rust’s killer feature
Trait-based extensibilitytrait Function with fn evaluate(&self, args: &[Value]) -> Value
GenericsEngine<S: SheetProvider> — abstract over data source
LifetimesCell references borrowing from sheet data during evaluation
Performance engineeringIncremental recalc, benchmarking, avoiding clones on large ranges
Error handlingFormulaError enum (DivZero, CircularRef, NameError, TypeError, etc.)
PyO3Full Python binding with type stubs
Publishingcrates.io packaging, semantic versioning, CI/CD
Documentationrustdoc with comprehensive examples
TestingExcel-compatibility test suite, property-based testing for parser

Definition of Done

  • Parser handles all standard Excel formula syntax (operators, functions, cell refs, ranges)
  • 40+ built-in functions across math, text, logical, lookup, date, and statistical categories
  • Dependency graph correctly resolves evaluation order via topological sort
  • Circular references detected and reported as Error::CircularReference
  • Incremental recalculation: changing one cell only recalculates its dependents
  • XLSX integration: load formulas from xlsxfmt, evaluate, write results back
  • CLI works: formulaengine eval spreadsheet.xlsx --recalculate --output computed.xlsx
  • Benchmarks: 100k cell evaluation under 1 second
  • PyO3 binding works: engine.evaluate("B1") returns correct value from Python
  • Published on crates.io with version 0.1.0
  • README with architecture overview, usage examples, and benchmark results
  • 50+ tests including Excel-compatibility verification
  • CI/CD with GitHub Actions (test on Linux, macOS, Windows)
  • Custom functions: users can register their own via the Function trait

Real-World Value and SaaS Potential

Open-source impact: Every “smart spreadsheet” product, every internal tool with calculated fields, every reporting engine that needs user-defined formulas would benefit from an embeddable, high-performance formula engine. This is infrastructure-level software — the kind of dependency that gets millions of downloads.

SaaS path: A REST API for formula evaluation:

POST /evaluate
{
  "cells": {
    "A1": {"value": 1000},
    "A2": {"value": 2000},
    "A3": {"formula": "=SUM(A1:A2)"},
    "B1": {"formula": "=A3 * 0.15"}
  },
  "evaluate": ["A3", "B1"]
}

→ 200 OK
{
  "results": {
    "A3": 3000,
    "B1": 450
  }
}

This powers pricing calculators, financial models, and report builders without requiring each product to implement their own formula engine. Companies like Rows.com, Causal.app, and Equals.app are building exactly this — they’d pay for a reliable, fast engine they can embed.

Career impact: Building a parser + evaluator + dependency graph + incremental computation engine is the most “computer science” project in this entire program. It demonstrates compiler knowledge, graph algorithm fluency, and systems-level thinking. Combined with xlsxfmt, you’d have a two-crate spreadsheet toolkit that positions you uniquely in the Rust ecosystem.



Phase 5 (Alternative) — ruleforge

Embeddable Business Rules Engine (Capstone — Option C)

Timeline: Weeks 41–52


The Problem It Solves

Every enterprise application is full of business logic that changes frequently: pricing rules, eligibility checks, approval workflows, risk thresholds, compliance policies, discount tiers, routing logic. Today, this logic lives scattered across application code:

# This is in 47 different files across 3 microservices
if customer.tier == "gold" and order.total > 5000:
    discount = 0.15
elif customer.tier == "gold" and order.total > 1000:
    discount = 0.10
elif customer.loyalty_years > 5:
    discount = 0.08
else:
    discount = 0.0

When these rules change — and they change constantly — a developer must find every instance, modify code, get it reviewed, test it, and redeploy. A rules engine externalizes this logic into a simple, readable format that non-developers can modify:

rule "Gold tier high-value discount"
  when customer.tier == "gold" AND order.total > 5000
  then set discount = 0.15

rule "Gold tier standard discount"
  when customer.tier == "gold" AND order.total > 1000
  then set discount = 0.10

rule "Loyalty discount"
  when customer.loyalty_years > 5
  then set discount = 0.08

Rules are loaded at runtime. No redeployment. No developer needed for business logic changes. This is how banks process loan applications, how insurance companies evaluate claims, how e-commerce platforms manage pricing, and how healthcare systems enforce compliance.

The current landscape:

  • Drools (Java): The dominant open-source rules engine. 20+ years old, massive codebase, requires JVM. Powerful but heavyweight — overkill for embedding in a microservice.
  • Open Policy Agent (OPA) (Go): Policy engine from CNCF. Excellent for authorization and infrastructure policy. Uses Rego, a purpose-built query language. Not designed for general business rules (no arithmetic, no stateful rule chaining).
  • json-rules-engine (JavaScript): Lightweight JSON-based engine. Decent for simple conditions but no DSL, no rule chaining, no conflict resolution, limited operator set.
  • Grule (Go): Go rules engine inspired by Drools. Active but limited documentation and smaller ecosystem.

There is no embeddable rules engine in Rust. Not even an experimental one. A search for “rules engine” on crates.io returns nothing meaningful. This is a genuine, total gap.


Phased Build Plan

Phase 5a — Rule Language and Parser (Weeks 21–22):

Design and implement the rule DSL (Domain-Specific Language). This is the core CS challenge — you’re building a small programming language.

Rule language grammar (simplified):

ruleset     = rule+
rule        = "rule" STRING priority? condition action
priority    = "priority" INTEGER
condition   = "when" expression
action      = "then" statement ("AND" statement)*
expression  = comparison (("AND" | "OR") comparison)*
comparison  = accessor operator value
accessor    = IDENTIFIER ("." IDENTIFIER)*
operator    = "==" | "!=" | ">" | "<" | ">=" | "<=" | "contains" | "matches"
value       = STRING | NUMBER | BOOLEAN | accessor
statement   = "set" accessor "=" expression
            | "emit" STRING
            | "flag" STRING

Must handle:

  • Tokenizer/Lexer: Keywords (rule, when, then, set, AND, OR, NOT), identifiers, operators, literals (strings, numbers, booleans)
  • Parser: Recursive descent parser producing a Rule AST
  • Rule types: Conditions with AND/OR/NOT combinators, nested field access (customer.address.country), comparison operators including contains (for lists) and matches (for regex)
  • Actions: set (modify a field), emit (produce an output event/message), flag (mark for downstream processing)
  • Priority system: Rules have optional priority (higher runs first). When multiple rules match, priority determines order.
  • Rule validation: Catch syntax errors with clear error messages (“Line 3: expected operator after ‘customer.tier’, found ‘gold’”)

Phase 5b — Evaluation Engine and Conflict Resolution (Weeks 23–24):

Build the engine that evaluates rules against data (called “facts” in rules engine terminology).

Facts (input):
{
  "customer": { "tier": "gold", "loyalty_years": 7 },
  "order": { "total": 6200, "items": 3 }
}

Rules evaluated → matching rules:
  1. "Gold tier high-value discount" (priority 10) → set discount = 0.15
  2. "Loyalty discount" (priority 5) → set discount = 0.08

Conflict resolution (highest priority wins):
  → discount = 0.15

Output:
{
  "discount": 0.15,
  "matched_rules": ["Gold tier high-value discount", "Loyalty discount"],
  "applied_rule": "Gold tier high-value discount"
}

Must handle:

  • Fact model: Accept JSON-like data structures. Support nested field access via dot notation.
  • Expression evaluator: Walk the AST, resolve field references against facts, apply operators, short-circuit AND/OR
  • Conflict resolution strategies: Configurable — priority (highest wins), first_match (stop at first match), all (apply all matching rules in order), custom (user-defined via trait)
  • Rule chaining: When a rule’s action modifies a fact, re-evaluate other rules that depend on that fact. Detect infinite loops (rule A triggers rule B which triggers rule A).
  • Type coercion: “123” == 123 (configurable strict/loose mode)
  • Built-in functions: len(), sum(), avg(), min(), max(), now(), contains(), matches() (regex)
  • Audit trail: Log which rules fired, in what order, what they changed — critical for compliance

Phase 5c — Polish and Publish (Weeks 25–26):

  • Clean, ergonomic public API with builder pattern
  • CLI: ruleforge eval --rules pricing.rules --facts order.json
  • CLI: ruleforge validate --rules pricing.rules (syntax check without evaluation)
  • CLI: ruleforge explain --rules pricing.rules --facts order.json (show which rules matched and why)
  • Benchmarks: evaluate 10k rule sets against complex facts, compare to json-rules-engine (JS) and Grule (Go)
  • PyO3 binding: engine.load_rules("pricing.rules"); result = engine.evaluate(facts)
  • Comprehensive documentation with examples
  • Publish to crates.io
  • Build CI/CD with GitHub Actions

Technical Requirements

Required crates:

  • serde + serde_json — JSON fact parsing and result serialization
  • clap — CLI
  • pyo3 — Python binding
  • criterion — benchmarking
  • regexmatches operator support
  • chrono — date/time comparisons and now() function
  • tracing — audit trail and rule execution logging

Architecture:

ruleforge/
  src/
    lib.rs              → Public API: Engine::new(), load_rules(), evaluate()
    parser/
      tokenizer.rs      → Lexer: rule DSL text → token stream
      ast.rs            → AST node types: Rule, Condition, Action, Expr, Accessor
      parser.rs         → Recursive descent parser: tokens → Vec<Rule>
      validator.rs      → Semantic validation (type checking, undefined field warnings)
    engine/
      evaluator.rs      → Evaluate conditions against facts
      resolver.rs       → Conflict resolution strategies (priority, first_match, all)
      chaining.rs       → Rule chaining with loop detection
      audit.rs          → Execution trace: which rules fired, what changed
    facts/
      model.rs          → Fact data structure (JSON-like nested key-value)
      accessor.rs       → Dot-notation field access (customer.address.country)
      coercion.rs       → Type coercion rules (strict/loose mode)
    functions/
      registry.rs       → Built-in function registry (len, sum, contains, matches, etc.)
      builtins.rs       → Built-in function implementations
    cli.rs              → CLI entry point (eval, validate, explain)
  python/
    src/lib.rs          → PyO3 bindings
  benches/
    evaluation.rs       → Benchmark: 10k rules × complex facts
    parsing.rs          → Benchmark: parse large rule files
  tests/
    fixtures/           → Test rule files and fact JSON files
    evaluation.rs       → Rule evaluation correctness tests
    chaining.rs         → Rule chaining and loop detection tests
    conflict.rs         → Conflict resolution strategy tests
    edge_cases.rs       → Type coercion, null handling, nested access

Expected Concepts Covered

ConceptHow It Appears
Parsing / Language designTokenizer, recursive descent parser, AST — you’re building a DSL
Enums with dataExpr, Value, Action, ConflictStrategy — pervasive use of Rust enums
Trait-based extensibilitytrait ConflictResolver, trait Function — users extend the engine
GenericsEngine<R: ConflictResolver> — generic over resolution strategy
LifetimesRule ASTs borrowing from source text during parsing
Error handlingParseError, EvalError, ChainLoopError — rich, typed errors
Serde (advanced)Custom deserialization for facts, serialization for results and audit trail
Performance engineeringBenchmarking rule evaluation, optimizing condition short-circuiting
PyO3Full Python binding with type stubs
Publishingcrates.io packaging, semantic versioning, CI/CD
Documentationrustdoc with comprehensive examples, rule language reference
TestingProperty-based testing for parser, fixture-based evaluation tests

Definition of Done

  • Rule DSL parser handles conditions, actions, priorities, nested field access, AND/OR/NOT
  • 10+ built-in functions (len, sum, avg, min, max, contains, matches, now, etc.)
  • Three conflict resolution strategies: priority, first_match, all
  • Rule chaining works: action modifying a fact re-triggers dependent rules
  • Infinite chain loop detection with clear error message
  • Audit trail: every evaluation produces a log of which rules fired and what changed
  • CLI works: ruleforge eval --rules pricing.rules --facts order.json
  • CLI explain mode: shows human-readable reasoning for each matched rule
  • Benchmarks: 10k rule evaluation under 100ms for simple rules
  • PyO3 binding works: engine.evaluate(facts) returns results from Python
  • Published on crates.io with version 0.1.0
  • README with rule language reference, usage examples, and benchmark results
  • 50+ tests including edge cases (null fields, type mismatches, empty rules)
  • CI/CD with GitHub Actions (test on Linux, macOS, Windows)
  • Custom functions: users can register their own via the Function trait
  • Custom conflict resolvers: users can implement ConflictResolver trait

Real-World Value and SaaS Potential

Open-source impact: Every enterprise that processes decisions — loan approvals, insurance claims, pricing, compliance checks, access control — needs a rules engine. The only serious open-source option (Drools) requires JVM. A lightweight, embeddable Rust engine with Python bindings would fill a massive gap in the cloud-native ecosystem.

SaaS path: A Rules-as-a-Service API:

POST /evaluate
{
  "ruleset_id": "pricing-v3",
  "facts": {
    "customer": { "tier": "gold", "loyalty_years": 7 },
    "order": { "total": 6200 }
  }
}

→ 200 OK
{
  "results": {
    "discount": 0.15,
    "matched_rules": ["Gold tier high-value discount"],
    "audit_trail": [
      { "rule": "Gold tier high-value discount", "matched": true, "action": "set discount = 0.15" },
      { "rule": "Loyalty discount", "matched": true, "action": "set discount = 0.08", "superseded_by": "Gold tier high-value discount" }
    ]
  }
}

Companies would pay to externalize their business rules into a managed service with versioning, audit trails, and A/B testing of rule sets. This is exactly what LaunchDarkly does for feature flags — but for business logic.

Career impact: Building a DSL parser + evaluation engine + conflict resolution system demonstrates language design thinking — the kind of work that senior engineers and compiler teams do. The audit trail feature shows compliance awareness. The embeddable architecture shows systems thinking. This is a project that speaks directly to enterprise engineering leadership.



Project Summary Table

#ProjectPhaseWeeksReal-World GapDomainOpen-SourceableSaaS Potential
1csvq11–3Partial (xsv unmaintained)Dev toolsYesNo
2schemaguard24–7Yes (no fast CLI validator)Data engYesYes
3logr28No (personal tool)PersonalYesNo
4HTTP Server311No (learning exercise)LearningNoNo
5ticknorm312–14YesFinanceYesYes
5*airpost312–14Yes (no OSS air quality aggregator)Public goodYesYes
6riskbook415–18Yes (no OSS Rust risk engine)FinanceYesYes
7xlsxfmt419–20Yes (no bidirectional XLSX)Dev toolsYesYes
8Adocforge521–26Yes (LibreOffice is the only option)Doc processingYesYes
8Bformulaengine521–26Yes (no standalone formula evaluator in any language)Spreadsheet/InfraYesYes
8Cruleforge521–26Yes (no Rust rules engine; Drools = JVM, OPA = policy-only)Enterprise/InfraYesYes

ticknorm is retained for reference but replaced by airpost due to conflict of interest with LSEG employment. See notice in ticknorm section.

Three capstone options are provided (8A/8B/8C). Choose one. All three teach the same advanced Rust concepts: parsing, trait-based architecture, PyO3 bindings, benchmarking, and crates.io publishing.

Real-world problem solvers: 5 of 8 (63%) Finance projects: 1 of 8 (riskbook) Public good projects: 1 of 8 (airpost) Open-sourceable: 6 of 8 SaaS potential: 4 of 8