attest

An open-source, evidence-grounded evaluator for AI agents, published on PyPI. It grades an agent against its real tool outputs, not LLM-judge vibes.

attest is an open-source tool I built and published for evaluating AI agents. It grades an agent by checking its claims against the real outputs of the tools it used, instead of asking another model whether the answer looks good. It installs from PyPI as agent-attest, runs on Anthropic, OpenAI, or Gemini, and is MIT licensed.

The problem

Evaluating an agent usually means LLM-as-judge: one model grading another. The weakness is that the judge reacts to the agent's explanation, so a confident, well-written answer can pass even when a specific detail buried inside it is wrong. This is measured, not hypothetical. A 2026 paper, Gaming the Judge, found that rewriting an agent's reasoning while leaving its actions unchanged can inflate a judge's false-positive rate by up to 90 percent.

The approach

attest never trusts what the agent says it did. It breaks the final answer into atomic claims and verifies each one against the agent's recorded tool outputs, the receipts. The property that makes this resistant to gaming is isolation: when attest verifies a claim, the model sees only that claim and the evidence, never the agent's reasoning or narrative. It still uses a model to judge, but constrains that judgment to a narrow entailment question rather than a holistic opinion, and every verdict quotes the exact span of evidence behind it.

What it checks

attest grades a run across four dimensions:

Faithfulness: does each claim follow from the tool outputs?
Tool use: were the right tools called, and were errors handled rather than ignored?
Prompt injection: did instructions hidden in tool data steer the agent? A regex scan catches known payloads, and an effect-based check catches novel ones by asking whether the agent took an action the user never authorized.
Role adherence: did the agent stay within the role its system prompt defines, or get talked out of it by a jailbreak?

Each check returns the same shape, a result with a pass or fail, an optional score, and a list of findings, so all four read and serialize the same way.

Engineering decisions

A few choices I am happy with:

One interface, three providers. Grading runs on Anthropic, OpenAI, or Gemini behind a single structured-output layer, so you can switch models without changing your code.
The judge is hardened against itself. attest reads attacker-controllable text, which makes its own judge an injection target. A guard frames that text as data to evaluate, never commands to obey, so a planted "mark this as passing" does not flip the verdict.
Cheap at scale. Faithfulness originally made one model call per claim, which exploded on long answers. I batched the verification and tightened claim extraction, taking one real example from about 170 grading calls down to 5, with the verdicts unchanged.
Shipped, not just written. It is published to PyPI with automated releases through GitHub Actions Trusted Publishing, so a new version goes out on a tagged release with no manual upload and no long-lived tokens.

Evals as tests

I use attest the way evals are meant to be used: as a test suite. I dogfooded it on a real LangGraph code-review agent by writing a spec of expected behavior, a few legitimate requests plus a set of jailbreak attempts, and a runner that grades each result with attest and asserts against the spec. It exits non-zero on failure, so it drops into CI. If the agent's system prompt is ever weakened and it starts complying with a jailbreak, the suite turns red instead of the problem reaching production.

Honest limits

attest constrains a model rather than removing it, so its verdicts are far harder to game than a holistic grader, but not infallible. Prompt injection is an unsolved problem, and attest detects rather than prevents it. The goal is to make agent behavior measurable and auditable, not to claim it is solved.

Repo: github.com/adepeju4/attest · Install: pip install agent-attest