Real-World Gap Analysis — Why These Projects Matter
Evidence-Based Justification for Each Project in the Rust Learning Program
Ashwin Hebbar | March 2026
This document exists for one reason: to prove that the projects in your Rust program aren’t invented exercises. Each section investigates the current landscape, names the existing tools, acknowledges what they do well, identifies what’s genuinely missing, and explains why a Rust solution has a defensible reason to exist. Where the gap is partial rather than absolute, that’s stated honestly.
1. csvq — Command-Line CSV Query Tool
Gap Strength: Partial (Niche, Not Wide Open)
What Exists Today
The Rust CSV CLI space has been explored. The major players:
-
xsv (by BurntSushi, the creator of
ripgrep): The original Rust CSV toolkit. Fast, elegant, well-designed. However, as of 2024, xsv is officially unmaintained. BurntSushi’s own README now recommends alternatives. -
qsv (by dathere): A maintained fork of xsv that has grown into a full data-wrangling toolkit. It adds ~50 commands beyond xsv, supports Lua/Python scripting, has a
--progressbarflag, and supports Polars for analytics. It is actively maintained and well-documented. -
xsv2 (by Faraday): A minimalist fork that modernizes xsv’s codebase without the feature expansion of qsv. Designed for people who found qsv too bloated.
-
xan: Another successor to xsv recommended by BurntSushi himself.
Where the Gap Actually Is
The gap is not “no CSV tool exists in Rust.” The gap is narrower and more specific:
qsv is powerful but overwhelming. It has 50+ subcommands, optional Python/Lua scripting runtimes, and a feature set that rivals a small database. For a data engineer who just wants to run tool data.csv --filter "age > 25" --select "name,salary" --sort "salary desc", qsv’s learning curve is a barrier. Its documentation is extensive but assumes familiarity.
xsv2 is simple but limited. It preserves xsv’s elegance but doesn’t add the query-language-style interface that would make ad-hoc analysis fast.
No existing tool provides a unified query interface. All of them (xsv, qsv, xan) use separate subcommands for each operation: xsv select, xsv search, xsv sort. To filter, select, and sort, you pipe three commands together. A tool where you write csvq data.csv --filter "age > 25" --select "name,salary" --sort "salary desc" in a single invocation — composing a query rather than chaining shell pipes — doesn’t exist.
DuckDB CLI fills this niche somewhat (you can run SQL on CSV files), but it’s a full database engine (~20MB binary), not a lightweight single-purpose CLI.
Honest Assessment
This is the weakest gap of the six real-world projects. You’re competing with mature, maintained tools. The justification for building csvq is primarily educational (it’s your Phase 1 project), with a secondary niche in the “simple unified query” space. If you want to make this genuinely useful post-program, the differentiator would be: single-command queries with composable flags, zero configuration, and a binary smaller than 5MB.
2. schemaguard — Streaming Data Contract Validator
Gap Strength: Strong
What Exists Today
Data quality validation is a crowded space — but with a critical blind spot.
Batch-oriented tools (the dominant category):
-
Great Expectations (Python): The market leader for data quality testing. Extremely powerful, with 300+ built-in expectations and a rich ecosystem. However, it is fundamentally batch-oriented. The official community forum confirms that streaming data validation is a known limitation. Users report that validating large datasets “can be slow without optimization,” and the recommended workaround for streaming is to use Spark Structured Streaming with micro-batches — not true record-by-record validation. Great Expectations also requires a full Python environment, making it heavy for CI/CD pipelines.
-
Soda Core (Python/YAML): A lighter alternative that uses a YAML-based declarative language (SodaCL) to define checks. Simpler than Great Expectations, but SQL-native at its core — it turns your checks into SQL queries against a database. It cannot validate a raw NDJSON stream or CSV file without first loading it into a database.
-
Pandera (Python): Schema validation for Pandas DataFrames. Excellent for Python notebooks. Cannot validate streaming data. Requires loading the entire dataset into memory as a DataFrame.
-
dbt tests (SQL): Data quality tests that run inside your data warehouse. Post-load only — by the time the test runs, bad data is already in your warehouse.
-
Deequ (Scala/Spark): Amazon’s data quality library built on Apache Spark. Powerful but requires a Spark cluster to run. Massive operational overhead for simple validation tasks.
Schema registries (a different category):
- Confluent Schema Registry: Validates Kafka messages against Avro/Protobuf/JSON Schema. Powerful but requires the full Confluent platform. It validates schema compatibility (can consumers handle this schema change?), not data content (is the value of field X within range?).
Where the Gap Actually Is
The gap is specific and well-defined:
There is no fast, standalone, language-agnostic CLI tool that validates streaming data (NDJSON, CSV) against a schema contract at high throughput.
Every existing solution is either:
- Python-speed (Great Expectations, Pandera) — too slow for high-throughput validation
- SQL-native (Soda Core, dbt) — requires a database, can’t validate raw files or streams
- Tied to a specific ecosystem (Confluent Schema Registry for Kafka, Deequ for Spark)
- Batch-only (all of the above, to varying degrees)
What’s missing is the equivalent of ripgrep but for data validation: a single binary you can drop into any CI/CD pipeline, point at a data file or stream, and get instant validation results. Something a data engineer can run as:
schemaguard validate --schema contract.json --input events.ndjson
# Validated 2,000,000 records in 1.8s. 12 violations found.
No Python. No JVM. No database connection. No 500MB dependency tree. Just a fast binary that reads data and checks it against a contract.
Why Rust Specifically
Rust’s performance makes 100k+ records/second validation feasible on a single core. The type system makes schema definition and validation logic safe and expressive. The single-binary compilation means zero runtime dependencies — critical for CI/CD and edge deployment.
Who Would Use This
- Data engineers adding quality gates to CI/CD pipelines
- Teams migrating from batch to streaming who need inline validation
- Anyone who has been bitten by a silent schema change breaking a downstream pipeline
- ML engineers validating feature store data before training
3. ticknorm — Real-Time Market Data Feed Normalizer
⚠️ CONFLICT OF INTEREST NOTICE
This project has been replaced by
airpost(see Section 3A below) due to conflict of interest with LSEG employment. LSEG’s core business is market data normalization and distribution. Building an open-source market data feed normalizer while employed at LSEG risks violating IP assignment and non-compete clauses. This section is retained for reference only.
Gap Strength: Very Strong (No Open-Source Solution Exists)
What Exists Today
Market data normalization is a core infrastructure function in financial services. The landscape is entirely commercial:
-
Databento: Provides 15 normalized schemas derived from raw PCAPs with nanosecond precision. Pricing: per-data-unit fees that scale with usage. Not open-source.
-
dxFeed: Streams normalized real-time and delayed market data directly from exchanges. Enterprise licensing. Not open-source.
-
Atlas Content Platform (Options Technology): A full integrated stack of feed handlers, real-time databases, distribution gateways, and entitlement tools. Enterprise product. Not open-source.
-
RedlineFeed (Pico): Normalized low-latency multi-asset class market data. Enterprise product. Not open-source.
-
LSEG Real-Time (your own employer): LSEG provides its own normalized data feeds through Refinitiv (now LSEG Data & Analytics). Proprietary.
Where the Gap Actually Is
There is no open-source market data feed normalizer. Period.
Databento itself explains the normalization problem clearly: “Normalization is the process of converting financial data in various source formats from different trading venues, exchanges, or data publishers to a single, standardized format.” Every financial data company builds this internally as proprietary infrastructure. There is no community-shared version.
The closest open-source tools in the Rust ecosystem focus on different problems:
- barter-rs: A Rust framework for live-trading and backtesting. Not a normalization service.
- hftbacktest: High-frequency trading backtesting with tick data support. Consumes normalized data; doesn’t produce it.
- The ohlcv crate: Downloads historical OHLCV data from crypto exchanges. A data downloader, not a normalizer.
Why This Gap Persists
- Commercial incentive: Companies like Databento, dxFeed, and LSEG charge significant fees for normalized data. There’s no business motivation to open-source this.
- Complexity: Real normalization handles exchange-specific quirks (different tick sizes, lot sizes, corporate actions, halt statuses). It’s genuinely hard to build correctly.
- Domain expertise required: You need to understand financial market microstructure to build this, which is rare in the open-source community.
Why Rust Specifically
Tick-by-tick normalization has hard performance requirements. At a busy exchange, you might receive 100,000+ messages per second. Python’s per-object overhead and GIL make this infeasible. Java works (and is commonly used in finance) but is memory-heavy. Rust provides the throughput of C++ with memory safety, and the async ecosystem (Tokio) handles concurrent feed connections elegantly.
Why This Matters for You Specifically
You work at LSEG. Market data normalization is the core of what your employer does. Building a simplified open-source version demonstrates domain understanding that is directly career-relevant. Even a simplified version (simulated feeds, basic normalization rules) teaches the real architectural patterns.
3A. airpost — Real-Time Air Quality Aggregator (Replacement for ticknorm)
Gap Strength: Strong (Fragmented Data, No Unified Open-Source Tool)
What Exists Today
Air quality monitoring is a global concern with multiple data sources, none of which interoperate well:
Government networks:
-
India’s CPCB (Central Pollution Control Board): Operates the National Air Quality Monitoring Programme with ~800 stations. Data is published on the CPCB website and app, but the API is undocumented, the data format is inconsistent, and historical data access is limited. Researchers have documented persistent issues with data gaps, delayed updates, and stations going offline without notice. CPCB reports in AQI (a composite index) rather than raw pollutant concentrations, making cross-source comparison difficult without conversion.
-
US EPA AirNow: Reports AQI for the United States. Well-documented API, but US-only.
Open data aggregators:
-
OpenAQ: The most significant open-source air quality platform. Aggregates data from government networks worldwide into a unified API. However, OpenAQ is a data mirror, not a normalization tool — it stores what each government publishes in (roughly) the format each government provides. Coverage of Indian CPCB data has been intermittent. OpenAQ is a cloud-hosted service, not a self-hostable tool.
-
WAQI (World Air Quality Index): Another aggregator that provides AQI data globally via a JSON API. Rate-limited on the free tier. Reports AQI values only (not raw concentrations), making it insufficient for researchers who need µg/m³ measurements.
Citizen science networks:
-
PurpleAir: A network of ~30,000 low-cost air quality sensors deployed by individuals worldwide. Growing rapidly in India. PurpleAir sensors report raw PM2.5 particle counts, which require correction factors (the EPA correction factor, or the ALT-CF3 correction) to convert to µg/m³ equivalent. These corrections depend on humidity, which PurpleAir also measures. Without applying corrections, PurpleAir readings can be 30-50% too high. PurpleAir has its own JSON API with a completely different schema from government sources.
-
Sensor.Community (formerly Luftdaten): A European-origin citizen science network with growing presence in India. Different API, different data format.
Where the Gap Actually Is
No self-hostable, open-source tool unifies these fragmented sources into a single, normalized, queryable service.
The specific problems:
-
Unit incompatibility: CPCB reports AQI (a composite index). PurpleAir reports raw particle counts. OpenAQ reports µg/m³. WAQI reports AQI. A researcher comparing readings from two sensors 500 meters apart — one CPCB, one PurpleAir — must manually convert between AQI and µg/m³ using EPA breakpoint tables, and apply PurpleAir humidity correction factors. No tool does this automatically.
-
Schema incompatibility: Every source uses different field names, different station identifiers, different timestamp formats, and different coordinate precision. Joining data across sources requires custom scripts for each combination.
-
Quality assessment: PurpleAir sensors can malfunction, be placed indoors, or be affected by nearby cooking. CPCB stations can go offline for days. No tool applies automated quality flagging based on known issues (e.g., PurpleAir sensors reporting in high humidity should be flagged as less reliable).
-
Self-hosting: OpenAQ and WAQI are cloud services. Researchers, NGOs, and citizen groups in India who want to run their own monitoring infrastructure — especially for hyperlocal analysis (comparing pollution in one neighborhood to another) — need a tool they can deploy on their own servers.
Why This Matters — Especially in India
Air quality in Indian cities has reached crisis levels. Bengaluru, once known for its pleasant climate, has seen AQI regularly exceed 200 (classified as “Very Unhealthy”) in winter months. Delhi’s annual smog crisis makes global headlines. The data infrastructure to understand and respond to this crisis is fragmented and inaccessible:
- Citizens cannot easily compare government readings with nearby citizen sensors
- Researchers spend weeks writing data collection scripts before beginning actual analysis
- Journalists investigating pollution patterns need engineering support to aggregate data
- Environmental NGOs lack the technical resources to build and maintain data pipelines
A self-hostable air quality aggregation tool with proper unit normalization, quality flagging, and a clean API would be genuinely useful to all of these groups.
Why Rust Specifically
Air quality monitoring requires polling multiple APIs concurrently on different schedules, handling API rate limits and failures gracefully, and serving normalized data with low latency. Rust’s async ecosystem (Tokio) handles concurrent polling elegantly. The type system ensures unit conversions are correct at compile time (you can’t accidentally store AQI where µg/m³ is expected). The single-binary deployment is ideal for NGOs and researchers who need to deploy on a VPS without complex infrastructure.
Why It Teaches the Same Concepts as ticknorm
The architecture is structurally identical to ticknorm:
- Multiple concurrent data sources with incompatible formats → Tokio async tasks
- Normalization to a canonical schema → trait-based adapters, unit conversion
- Persistent storage → sqlx with compile-time checked PostgreSQL queries
- Query API → axum HTTP server
- Source resilience → error isolation (one API down doesn’t crash others)
- Structured logging → tracing
The domain is different. The Rust learning is identical.
4. riskbook — Real-Time Portfolio Risk Engine
Gap Strength: Moderate-Strong (Partial Coverage Exists, But Not as an Integrated Engine)
What Exists Today
This requires an honest accounting, because the Rust finance ecosystem has grown:
Rust crates for options pricing:
-
OptionStratLib: A comprehensive Rust library for options trading. Supports Black-Scholes, Binomial Tree, Monte Carlo, and Telegraph Process models. Full Greeks suite including Delta, Gamma, Theta, Vega, Rho, Vanna, Vomma, Charm, and Color. This is a mature library.
-
quantrs: Options pricing, portfolio optimization, and risk analysis. Supports European, American, Asian, Rainbow, and Binary options with multiple pricing methods.
-
RustQuant: A broader quantitative finance library covering stochastic processes, option pricing, and Monte Carlo simulation.
-
QuantMath: Financial maths library for risk-neutral pricing and risk, utilizing ndarray and statrs.
-
options-rusty: Options pricing and Greeks algorithms.
C++ ecosystem:
-
QuantLib: The 800-pound gorilla. 20+ years of development. Comprehensive coverage of derivatives pricing, term structure modeling, and risk analytics. Available in C++ with Python bindings via QuantLib-Python. However, the official documentation acknowledges that “C++ is known to be challenging for new developers”, and the library’s codebase is enormous and difficult to navigate.
-
Open Source Risk Engine (ORE): Built on top of QuantLib, ORE provides exposure calculation and XVA. Notably, LSEG itself provides ORE support as part of their post-trade solutions.
Where the Gap Actually Is
The existing Rust crates solve pricing (given parameters, what is this option worth?). What none of them provide is an integrated portfolio risk engine — a system that:
- Takes a portfolio of mixed instruments (equities + options)
- Ingests streaming market data updates
- Computes portfolio-level risk metrics in real-time: aggregate P&L, Value-at-Risk, concentration risk, marginal contribution to risk
- Updates these metrics as prices change, at tick speed
This is the difference between a calculator and a dashboard. OptionStratLib can price an individual option. It cannot tell you: “your portfolio’s 99% VaR is $8,400, and 63% of your risk comes from your AAPL position, and if AAPL drops 5% your total P&L impact would be $12,300.”
QuantLib + ORE can do this in C++, but the learning curve is measured in months, the codebase is 1M+ lines, and the architecture was designed in the early 2000s.
No Rust crate provides a lightweight, real-time, portfolio-level risk aggregation engine with streaming market data input.
Why Rust Specifically
Real-time risk computation at tick speed requires processing thousands of positions against every market data update. Python can do end-of-day batch risk. Java can do near-real-time. Rust can do true real-time — sub-millisecond portfolio risk updates — which is what trading desks need.
Honest Caveat
If your goal were purely to use an options pricing library, OptionStratLib or quantrs would be sufficient. The justification for riskbook is the integrated portfolio engine that doesn’t exist, plus the educational value of implementing Black-Scholes from the formula yourself (which is the whole point of this learning program).
5. xlsxfmt — High-Fidelity Bidirectional Excel Processing Library
Gap Strength: Strong (Clearly Documented, Widely Felt)
What Exists Today
The Rust XLSX ecosystem has a well-known structural problem. Here’s the exact landscape:
-
calamine (by tafia): The standard Rust XLSX reader. Mature, fast, well-maintained. But its own documentation states explicitly: calamine “does not support reading extra content, such as formatting, excel parameters, encrypted components” and is “read-only”. It reads cell values — numbers, strings, dates. It discards all formatting information entirely: no fonts, no colors, no borders, no merged cells, no number formats.
-
rust_xlsxwriter (by jmcnamara): The standard Rust XLSX writer. Excellent quality — the author also maintains the widely-used Python
xlsxwriterand PerlExcel::Writer::XLSX. Supports full formatting, charts, conditional formatting, sparklines, images. But it is write-only — it creates new files from scratch. It cannot read an existing XLSX file. -
edit-xlsx (by MortalreminderPT): The only Rust crate that attempts bidirectional XLSX operations (read + modify + write). However, it has documented stability issues — a June 2024 bug report shows the library panicking with
ParseIntErroron certain real-world Excel files. The crate has ~2,500 total downloads on crates.io (compared to calamine’s 15M+), indicating limited adoption and testing. -
rust-excel-core: A recent composite library that wraps calamine (for reading), rust_xlsxwriter (for writing), and umya-spreadsheet (for mutation). A pragmatic workaround, but it’s glue code over three different libraries with different data models, not a unified solution.
Where the Gap Actually Is
The Rust community has discussed this gap explicitly. A Rust Users forum thread titled “Crate for reading and writing excel files” from 2023 captures the exact problem: people want to read an existing Excel file, modify some cells, and write it back — the most basic document processing operation — and there is no production-quality crate that does this with formatting preservation.
The fundamental operation that every document processing pipeline needs — read, modify, write with formatting intact — does not have a reliable, single-crate solution in Rust.
In the Python ecosystem, openpyxl handles this (read with formatting, modify, save). In Java, Apache POI handles it. In C#, ClosedXML handles it. In Rust, you either lose formatting (calamine → rust_xlsxwriter), use an unstable crate (edit-xlsx), or assemble a Frankenstein from three libraries (rust-excel-core).
Why This Matters at Scale
The XLSX format is the lingua franca of business data. Financial reports, regulatory filings, data exports, template-based report generation — all of these involve reading XLSX files, modifying values, and saving with formatting preserved. Your own work at LSEG (the Excel-to-PDF renderer) demonstrates this need firsthand. Every company that processes Excel files at scale hits this wall.
Why Rust Specifically
XLSX processing at scale is CPU and memory intensive (XLSX is a ZIP of XML files — parsing XML is expensive). Rust’s zero-cost abstractions and lack of GC make it ideal for high-throughput document processing. A well-built Rust XLSX library could be orders of magnitude faster than openpyxl, which is the current Python standard.
THREE CAPSTONE OPTIONS:
docforge(Section 6 — Option A),formulaengine(Section 6A — Option B), andruleforge(Section 6B — Option C). All three have strong real-world gaps. Choose one.
6. docforge — Document Conversion Library (XLSX/DOCX → PDF)
Gap Strength: Very Strong (The Current Standard Is Universally Hated)
What Exists Today
Document conversion (Office formats → PDF) is one of the most universally needed operations in software. The landscape is bleak:
The incumbent: LibreOffice headless
LibreOffice in headless mode is the de facto standard for open-source document conversion. It is also universally acknowledged as problematic:
-
Startup overhead: LibreOffice adds approximately 2-3 seconds of startup time per conversion when a new instance is spawned. For batch processing, this overhead multiplies. This is documented in the Gotenberg project (a popular Go-based document conversion server that wraps LibreOffice).
-
Formatting fidelity issues: The Ask LibreOffice forums contain hundreds of threads documenting conversion issues. Common problems include: single-page Excel content broken into multiple PDF pages, font substitution causing layout shifts, formula cells rendering as zeroes, and headless mode producing different results than the GUI.
-
Stability issues at scale: The Gotenberg project documents that running multiple instances of headless LibreOffice inside a container leads to “memory leaks and zombie processes,” and running LibreOffice in server mode is “even more unstable.”
-
Resource footprint: A LibreOffice installation is 500MB+. It includes an entire office suite (word processor, spreadsheet, presentation, drawing, database) when you only need the conversion engine.
-
Sequential processing: Each LibreOffice instance handles one conversion at a time. Parallel processing requires multiple instances, each consuming hundreds of MB of RAM.
Your own evidence: At LSEG, your team measured 700 seconds for a batch conversion job. Your Python replacement (which bypasses LibreOffice entirely) brought this to 3.5 seconds — a 200x improvement. This is not a theoretical problem.
Other options and why they don’t solve it:
-
wkhtmltopdf: Converts HTML to PDF. Abandoned/archived as of 2023. Only handles HTML, not XLSX or DOCX.
-
Headless Chrome / Puppeteer: Converts HTML to PDF. Requires a full Chrome installation (~200MB). Cannot natively handle XLSX or DOCX — you’d need to convert to HTML first, losing formatting.
-
Gotenberg: A Go-based API server that wraps LibreOffice (and Chrome). Well-engineered API layer, but the conversion engine underneath is still LibreOffice with all its problems.
-
Aspose (commercial): High-quality .NET-based document conversion. Expensive commercial license ($1,000+/year per developer). Not open-source.
-
Python libraries (python-docx + reportlab, openpyxl + fpdf): Can technically read Office formats and generate PDFs, but at Python speed, with significant formatting loss, and requiring a full Python environment.
Where the Gap Actually Is
There is no native-code, open-source library in any modern language that converts XLSX or DOCX to PDF with formatting fidelity, without depending on LibreOffice.
Not in Rust. Not in Go. Not in C++ (outside of LibreOffice itself). Not in Java (outside of Apache POI + iText, which is commercial for PDF generation).
This is arguably the strongest gap in this entire project list. Every developer, every DevOps engineer, every data pipeline that needs to generate PDFs from Office documents currently depends on a 500MB LibreOffice installation that is slow, has formatting bugs, leaks memory at scale, and produces different output in headless mode than it does in GUI mode.
The SaaS Opportunity
Document conversion APIs are a real, revenue-generating product category:
- CloudConvert: Charges per conversion, with plans starting at $8/month for 500 conversions
- Zamzar: API plans from $25/month
- Adobe Document Services: Enterprise pricing
- Aspose Cloud: Per-conversion pricing
These services exist because the problem is real and businesses will pay to not deal with LibreOffice. A Rust-based conversion engine that runs as a single binary, handles concurrent conversions, and produces faithful output would be a credible competitor — especially with a PyO3 binding for Python users and a REST API wrapper for SaaS deployment.
Why Rust Specifically
Document conversion is CPU-bound (XML parsing, layout computation, PDF rendering) and benefits enormously from:
- No GC pauses: Consistent conversion times, critical for SLA-bound API services
- Parallelism: Safe concurrent conversions without LibreOffice’s “one-at-a-time” limitation
- Small binary: A Rust conversion binary could be <20MB vs. LibreOffice’s 500MB+
- Deployment simplicity: Single binary, no runtime dependencies, works in Docker alpine images
6A. formulaengine — Spreadsheet Formula Computation Engine (Capstone — Primary)
Gap Strength: Very Strong (Nothing Exists as a Standalone Library in Any Systems Language)
What Exists Today
Spreadsheet formula evaluation is one of the most universally needed computational capabilities in software. Every product with “calculated fields” needs it. The landscape:
Full spreadsheet applications (not embeddable):
-
LibreOffice Calc / Excel: Complete spreadsheet applications with formula engines deeply coupled to their UI, file format handling, and rendering. You cannot extract the formula engine and embed it in your own application. Using LibreOffice as a formula evaluator means spawning a 500MB headless process to compute
=SUM(A1:A10). -
Gnumeric: Open-source spreadsheet with a capable formula engine, but it’s a GTK application written in C. The formula engine is intertwined with the UI layer and cannot be used as a standalone library.
JavaScript libraries (single-threaded, browser-focused):
-
HyperFormula (Handsontable): The closest thing to a standalone formula engine. It’s a JavaScript/TypeScript library that evaluates Excel-like formulas. However: it’s single-threaded (JavaScript), designed for browser use, and benchmarks show it handles ~100k cells but struggles beyond that. Not suitable for server-side batch processing of large spreadsheets.
-
FormulaJS: A JavaScript implementation of Excel functions. It evaluates individual formulas but has no dependency graph, no incremental recalculation, and no cell reference resolution. It’s a function library, not an engine.
Python libraries (slow, partial):
-
formulas (PyPI): A Python library that parses and evaluates Excel formulas. It supports a subset of Excel functions and can build dependency graphs. However, it’s pure Python (slow), has limited function coverage, and is primarily designed for offline model calculation rather than real-time evaluation.
-
openpyxl: Can read formulas from XLSX files but does not evaluate them. It reads the cached value (last computed by Excel) but cannot compute formula results itself.
-
xlcalc: Another Python formula evaluator. Supports basic formulas but hasn’t been updated since 2022 and has limited function coverage.
What doesn’t exist:
There is no standalone, embeddable formula evaluation library in Rust, Go, C++, or Java that provides:
- A formula parser (tokenizer → AST)
- A dependency graph with topological sort
- Circular reference detection
- Incremental recalculation (only recompute what changed)
- A comprehensive function library (40+ functions)
- The ability to embed in any application via library linking or FFI
This is a foundational piece of infrastructure that every “smart spreadsheet” product rebuilds from scratch internally.
Where the Gap Actually Is
The gap is at the infrastructure layer. Companies building products with formula capabilities (Airtable, Notion, Rows.com, Causal.app, Equals.app, Sigma Computing, every internal business tool with calculated fields) each implement their own formula engine internally. There is no shared, high-quality, open-source engine they can embed.
This is analogous to how SQLite provides an embeddable database engine — before SQLite, every application that needed local data storage rolled its own file format. formulaengine would be the “SQLite of formula evaluation”: a reliable, fast, embeddable engine that any application can link against.
The SaaS Opportunity
Formula-as-a-Service is an emerging category:
- Rows.com: Raised $16M building a “spreadsheet with superpowers” — their core technology is a formula engine connected to APIs
- Causal.app: Financial modeling tool built around formula evaluation — acquired by Lumos in 2024
- Equals.app: Spreadsheet for data teams — their differentiator is formula performance on large datasets
- Sigma Computing: $1B+ valuation, cloud-native spreadsheet interface for databases
These companies each built their own formula engine. A high-performance, embeddable Rust engine with Python bindings would be immediately useful to:
- SaaS companies building spreadsheet-like products
- Data teams needing to evaluate Excel formulas in pipelines (without Excel)
- Financial modeling tools needing fast, server-side formula computation
- Report generators that process XLSX templates with formulas
Why Rust Specifically
Formula evaluation benefits enormously from Rust’s strengths:
- Parsing performance: Tokenizer and parser with zero-copy string handling — critical for large spreadsheets
- Graph algorithms: Dependency graph with topological sort, cycle detection — Rust’s ownership model prevents dangling references in the graph
- Incremental computation: Only recalculate dirty cells — Rust’s fine-grained ownership makes it natural to track what’s changed
- No GC pauses: Consistent evaluation times, critical for interactive applications
- Embeddability: Compiles to a native library with C FFI — can be embedded in any language. PyO3 gives first-class Python support.
- Pairs with
xlsxfmt: Together, they form a complete spreadsheet toolkit — read/write the file format (xlsxfmt) and compute the values (formulaengine). This combination does not exist in any language.
6B. ruleforge — Embeddable Business Rules Engine (Capstone — Option C)
Gap Strength: Very Strong (Complete Absence in Rust, Only JVM Alternative Is Heavyweight)
What Exists Today
Business rules engines externalize decision logic — pricing, eligibility, compliance, routing — from application code into a readable format that can be modified at runtime without redeploying. This is critical infrastructure in banking, insurance, healthcare, and e-commerce.
The incumbent: Drools (Java)
Drools has dominated the open-source rules engine space for over 20 years. It’s part of the KIE (Knowledge Is Everything) suite from Red Hat. Drools is powerful but:
- Requires JVM: Cannot be embedded in non-JVM applications without significant overhead. Deploying Drools means deploying a Java runtime.
- Heavyweight: The Drools engine plus dependencies is tens of megabytes. The learning curve is steep — it uses its own DRL (Drools Rule Language) plus MVEL expressions, and the documentation assumes enterprise Java familiarity.
- Overkill for microservices: Drools was designed for monolithic enterprise applications. Using it in a lightweight microservice means pulling in a massive dependency for what might be 20 rules.
- Not embeddable via FFI: You cannot link Drools into a Rust, Go, Python, or C++ application. You must run it as a separate JVM process or use REST API wrappers.
Open Policy Agent (Go)
OPA is a CNCF-graduated project focused on policy evaluation (authorization, infrastructure policy, API access control). It uses Rego, a purpose-built query language:
- Designed for policy, not business rules: OPA excels at “is this user allowed to do X?” but is not designed for general business logic like pricing calculations, discount tiers, or loan eligibility scoring.
- No arithmetic actions: Rego can evaluate conditions and return decisions, but it doesn’t support actions like “set discount = 0.15” or “flag for review.” It’s a query language, not a rules engine.
- No rule chaining: OPA evaluates policies independently. It doesn’t support forward chaining where Rule A’s output becomes Rule B’s input.
- Go-only embedding: OPA can be embedded in Go applications natively, but other languages must use REST API or WASM (with limitations).
JavaScript options
- json-rules-engine: A lightweight JSON-based engine. Decent for simple conditions but no DSL (rules are JSON objects, not human-readable), no rule chaining, no conflict resolution beyond “all matching rules fire,” limited operator set. Stars: ~2.5k on GitHub — popular because there’s nothing better in the JS ecosystem, not because it’s good.
- nools: Node.js rules engine inspired by Drools. Abandoned since 2016.
Go options
- Grule: Go rules engine inspired by Drools. Active development but limited documentation, small community, and missing features like audit trails and explain mode.
Rust options
There are none. A search for “rules engine” on crates.io returns no meaningful results. There is no Rust crate — experimental or otherwise — that provides a business rules engine with a DSL, evaluation engine, and conflict resolution.
Where the Gap Actually Is
The gap exists on two axes:
-
Language gap: There is no rules engine in Rust. Period. Not even a basic one. Anyone building a Rust application that needs externalized business rules has zero options — they either embed a JVM for Drools, call OPA over REST, or hardcode the logic.
-
Architecture gap: Even across all languages, there is no lightweight, embeddable rules engine designed for the cloud-native era. Drools is JVM-heavy. OPA is policy-specific. json-rules-engine is too basic. The market needs a rules engine that compiles to a single binary, embeds via FFI into any language, evaluates rules in microseconds, and includes an audit trail for compliance.
This gap is particularly acute in regulated industries (banking, insurance, healthcare) where:
- Business rules change frequently (new regulations, new products, updated risk models)
- Audit trails are legally required (who changed what rule, when, and what decisions it affected)
- Performance matters (real-time loan decisioning, insurance claim processing)
- Deployment must be lightweight (microservices, serverless, edge computing)
The SaaS Opportunity
Rules-as-a-Service is an active commercial category:
- InRule: Enterprise rules engine, pricing starts at tens of thousands per year
- Progress Corticon: Business rules platform, enterprise pricing
- ILOG (IBM): Part of IBM’s Decision Manager, enterprise pricing
- LaunchDarkly: While primarily feature flags, their “targeting rules” are essentially a simplified rules engine — valued at $3B+
These products exist because enterprises will pay significant money to externalize and manage business logic. An open-source Rust engine with Python bindings would be immediately attractive to:
- Fintech companies building lending, insurance, or payment platforms
- SaaS products needing configurable business logic without redeployment
- Compliance teams needing auditable, versionable rule sets
- Data pipeline teams needing configurable data routing and transformation rules
Why Rust Specifically
A rules engine benefits uniquely from Rust’s capabilities:
- Parsing performance: Tokenizer and parser with zero-copy string handling — rules are parsed once and evaluated millions of times
- Evaluation speed: Pattern matching and condition evaluation in tight loops with zero allocation — critical for high-throughput rule evaluation (payment processing, real-time pricing)
- Embeddability: Compiles to a native library with C FFI — embed in Python (PyO3), Node.js (napi-rs), Go (cgo), or any language. No JVM required.
- Safety: Rule chaining with loop detection benefits from Rust’s ownership model — no dangling references in the rule dependency graph
- Small binary: A Rust rules engine could be <5MB vs. Drools’ JVM requirement (200MB+)
- WASM target: Rules engine compiled to WASM runs in browsers, edge functions, and serverless — evaluate rules client-side without a server round-trip
Summary: Gap Strength by Project
| Project | Gap Strength | Existing Landscape | What’s Missing |
|---|---|---|---|
csvq | Partial | qsv, xan, xsv2 exist | Unified query-style CLI (vs. pipe-chaining subcommands) |
schemaguard | Strong | Great Expectations (batch, slow), Soda (SQL-native) | Fast, standalone streaming validator CLI |
ticknorm | |||
airpost | Strong | OpenAQ (cloud-only mirror), CPCB (inconsistent), PurpleAir (raw/uncorrected) | Self-hostable tool unifying fragmented air quality sources with unit normalization |
riskbook | Moderate-Strong | OptionStratLib, quantrs (pricing libs) | Integrated portfolio risk engine with streaming updates |
xlsxfmt | Strong | calamine (read-only, no fmt), rust_xlsxwriter (write-only), edit-xlsx (unstable) | Reliable bidirectional XLSX with formatting |
docforge (Option A) | Very Strong | LibreOffice headless (slow, buggy, 500MB) | Native-code document conversion without LibreOffice |
formulaengine (Option B) | Very Strong | HyperFormula (JS, single-threaded), formulas (Python, slow), nothing in Rust/Go/C++ | Standalone embeddable formula evaluation engine |
ruleforge (Option C) | Very Strong | Drools (JVM-heavy), OPA (policy-only), json-rules-engine (basic), nothing in Rust | Lightweight embeddable business rules engine |
Sources
- xsv GitHub — unmaintained notice
- qsv — maintained fork of xsv
- xsv2 — minimalist fork
- Great Expectations — streaming support discussion
- Soda Core vs. Great Expectations comparison
- Great Expectations vs Deequ vs Soda comparison
- Databento — market data normalization explained
- dxFeed — commercial real-time data
- Atlas Content Platform
- RedlineFeed
- OptionStratLib — Rust options library
- quantrs — Rust quant library
- RustQuant
- QuantMath
- QuantLib
- Open Source Risk Engine (ORE)
- LSEG ORE support
- QuantLib ecosystem complexity
- calamine — read-only, no formatting
- rust_xlsxwriter — write-only
- edit-xlsx — panic issue on complex files
- edit-xlsx crate
- Rust Users forum — bidirectional XLSX discussion
- Gotenberg — LibreOffice performance issues
- LibreOffice — XLSX to PDF pagination bug
- LibreOffice — headless vs GUI output difference
- LibreOffice — formula cells rendering as zeroes
- Rust finance crates listing
- OpenAQ — open air quality data platform
- OpenAQ API documentation
- CPCB India — Central Pollution Control Board real-time data
- PurpleAir — low-cost sensor network
- PurpleAir correction factors — EPA study
- WAQI — World Air Quality Index project
- Sensor.Community — open environmental data
- EPA AQI breakpoint table
- AQI calculation methodology — EPA Technical Assistance Document
- India NAQI — National Air Quality Index methodology
- HyperFormula — JavaScript formula engine by Handsontable
- HyperFormula GitHub
- FormulaJS — JavaScript Excel function library
- formulas — Python Excel formula evaluator
- xlcalc — Python Excel calculator
- openpyxl — does not evaluate formulas
- Rows.com — spreadsheet startup ($16M raised)
- Crafting Interpreters — Robert Nystrom (parser/evaluator reference)
- petgraph — Rust graph data structure library
- Drools — JVM rules engine
- Open Policy Agent (OPA)
- OPA Rego language reference
- json-rules-engine — JavaScript rules engine
- nools — abandoned Node.js rules engine
- Grule — Go rules engine
- crates.io search: rules engine (no results)
- InRule — commercial rules engine
- LaunchDarkly — feature flags and targeting rules ($3B valuation)