Why Specwright¶
The Problem¶
Modern software development increasingly involves LLMs generating code. This creates a fundamental trust problem: how do you know the generated code is correct?
The traditional answer — code review — doesn't scale. Reviewing LLM-generated code line by line defeats the purpose of using LLMs. And automated testing alone isn't enough, because someone still needs to decide what to test.
The Approach¶
Specwright inverts the problem. Instead of reviewing generated code, you define contracts that the code must satisfy, and the framework enforces those contracts automatically.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Human │ │ LLM │ │ Specwright │
│ │ │ │ │ │
│ Writes: │────▶│ Writes: │────▶│ Enforces: │
│ - Types │ │ - Logic │ │ - Types │
│ - Docstring │ │ - Algorithm │ │ - Contracts │
│ - States │ │ - Tests │ │ - States │
│ - Tests req │ │ │ │ - Coverage │
└─────────────┘ └─────────────┘ └─────────────┘
The human stays in control of what the software does. The LLM handles how. The framework bridges the two.
Three Layers of Enforcement¶
Layer 1: Decoration Time¶
When a module is imported, Specwright validates:
- Every
@specfunction has complete type annotations - Every
@specfunction has a docstring - Every
StateMachinesubclass has valid states and transitions
These checks catch structural problems immediately — before any code runs.
Layer 2: Runtime¶
When functions are called, Specwright validates:
- Input arguments match declared type hints
- Return values match declared return types
- State transitions are valid for the current state
These checks catch behavioral problems as they happen.
Layer 3: Test Time¶
When pytest collects tests, Specwright validates:
- Every
@requires_testsfunction has the expected test functions - Test names follow the naming convention
This catches coverage gaps before the test suite runs.
Why Not Just Use Type Checkers?¶
Tools like mypy catch type errors statically — at analysis time. Specwright catches them dynamically — at runtime. Both are valuable, but they serve different purposes:
| Feature | mypy | Specwright |
|---|---|---|
| When | Static analysis | Runtime |
| Catches | Type mismatches in your code | Type mismatches in generated or dynamic code |
| Requires | Running a separate tool | Nothing — works at import time |
| Complex types | Full support | Full support (via Pydantic) |
| Documentation | No | Yes — docstrings enforced, metadata generated |
| State machines | No | Yes — validated transitions |
| Test requirements | No | Yes — enforced coverage |
Specwright and mypy are complementary. Use both.
Why Not Just Write Tests?¶
Tests verify specific scenarios. Specwright enforces structural invariants:
- "This function always takes an int and returns a str" (not just in the test cases you wrote)
- "This state machine never transitions from shipped to cancelled" (not just when you test it)
- "This function has tests for all edge cases" (not just the ones you remember)
Tests tell you "this specific input produces this specific output." Specwright tells you "this function's contract is always satisfied."
The Specification-First Workflow¶
- Define the spec — Type hints, docstrings, state machines, test requirements
- Generate the implementation — Ask an LLM, or write it yourself
- Run and validate — Specwright enforces the spec at every level
- Iterate — If the implementation is wrong, the spec catches it
This workflow works because specs are concise (a few lines of type hints and a docstring) but powerful (they constrain the entire behavior of the function). You write 5 lines of spec, the LLM writes 50 lines of implementation, and Specwright ensures the 50 lines satisfy the 5.