Proof

Proof, not benchmark theater.

Most AI-coding numbers are unaudited and unrepeatable. We hold ourselves to a higher bar: a result only counts if it carries a real tool-call trace. Every headline run is bit-reproducible. We publish exactly one performance number: scoped, labeled, and honest.

The one citable number

We publish a single performance claim, and we name it precisely. It is a frozen benchmark artifact, scoped to file localization: which file the answer lives in.

Strict file-localization F1

+89%

Kin scores 0.219 versus a tuned BM25 + cross-encoder baseline at 0.116.

Suite: Frozen 26-task suite
Reproducibility: n=3, bit-identical
Scope: File localization

The baseline is deliberately strong. BM25 + a cross-encoder reranker is a tuned, competitive retrieval stack, not a strawman. We beat it on its own terms, on a suite we froze in advance.

How we prove it

The number above is only as trustworthy as the method behind it. So the method is the product.

A proof-gate, not a scoreboard

A result only counts if it carries a real Kin tool-call trace. No trace, no credit. The gate throws out answers that cannot show their work, so the number reflects what Kin actually retrieved, not what a harness backfilled.

Bit-reproducible runs (n=3 identical)

Every headline run is executed three times and must come back bit-identical. Retrieval, ranking, and scoring are deterministic and key-derived, so the result is an artifact you can re-cut, not a lucky single pass.

One-command reproduction

The benchmark is a frozen suite with a pinned baseline and a single reproduce command. Anyone with the harness can re-run it against the exact same tasks and binaries and land on the same score.

We self-invalidate

We threw out an earlier headline result when it didn’t honestly answer the thesis question. If a number can’t survive its own proof-gate, it doesn’t ship. We would rather retract a claim than defend a soft one.

What we do not claim yet

This list is the point, not a footnote. Our competitors publish numbers they can’t reproduce. We’d rather under-claim than overclaim. So here is everything we are not asserting.

×No token-reduction percentage. We make no “X% fewer tokens” claim.
×No “beats grep” or “beats Git” headline. That comparison isn’t settled, so we don’t assert it.
×No merge-time percentage. No “N% less time to merge” or “N× faster review.”
×No bug-catch rate. No “catches N× more bugs” or defect-detection percentage.
×No adoption or customer numbers beyond what’s publicly verifiable on GitHub.

When one of these becomes provable under the same proof-gate, it graduates to the page above with a number. Until then, it stays here.

Pillar maturity

Three pillars, three honest maturities. We label each one so you always know whether you’re looking at a measured result, live code, or the roadmap.

Agent Context

Citable now

Graph-native retrieval and context, measured by the frozen file-localization benchmark below. This is the pillar the +89% number proves.

AI Merge Trust

Live code, benchmark in progress

Semantic diff, graph impact analysis, and review gates are implemented and demonstrable today. The benchmark that would put a defensible number on it is still being built, so we publish no merge-trust metric yet.

Org Spine / cross-repo blast radius

Live code, hosted proof in progress

Cross-repo impact, security-exposure blast radius, and org-graph review are implemented and demonstrable. Daemon, CLI, MCP, and KinLab surfaces are all wired. The hosted proof over the full Kin ecosystem is in progress. Live daemon indexing over real repos is not yet complete, so we publish no org-spine metric yet.

See the harness

The proof-gate and reproducible runner live in the open. Read how the suite is frozen and how a run is scored.

kin-bench docs firelock-ai

Claims you can audit.

One number, proven the hard way. Plus an honest list of everything we haven’t proven yet. Get early access and read the rest.

Request early access Explore the ecosystem