Three providers, one methodology card

Jun 8, 2026

394 tests green. Ten of eleven tickets shipped. Three model providers running behind one probe runner. That is Phase 3 of Beacon, tickets BCN-301 through BCN-311, and the thing we are proudest of is not the provider count. It is that all three run through the same path.

Beacon asks AI answer engines how they describe a brand. To do that honestly, you have to ask more than one engine, because the whole point is that different models answer differently. The risk is that a multi-LLM probe methodology turns into three separate integrations that each measure slightly different things, and then your comparison is noise.

One runner, three providers

We wired three providers through a single probe runner: Anthropic’s Haiku and Sonnet, Gemini’s Flash and Pro, and OpenAI’s gpt-4o-mini and gpt-4o. Six models, one interface. A probe is defined once and runs against every model through the same code path, so the only variable between results is the model itself.

That constraint is the product. If each provider had its own bespoke integration, any difference in the output could be the model or could be our code, and we would never know which. With one runner, a difference in how two models describe a brand is a real difference, not an artefact of how we called them. Comparability is the feature. The shared runner is how we earn it.

The methodology card

The output of all this is a methodology card that ships with every scan. It states which models ran, how the probe was routed, and what was validated. A reader does not have to trust that we ran a fair test. They can see the method.

This is the same belief that drives the recommendation work: in a category full of confident dashboards, the trustworthy move is to show your work. The methodology card is where Beacon shows its work. Every model used is listed. Every routing decision is visible. Nothing about how the number was produced is hidden behind the number.

What JSON-schema validation actually catches

Each model is asked to return structured output, and every response is validated against a JSON schema before it counts. This sounds like boilerplate. It is not. The schema is what catches a model returning prose when we asked for fields, or dropping a required key, or quietly changing the shape of its answer between runs. Without validation, those failures look like data. With it, they surface as errors we can see and handle.

Validation is also our early-warning system for model drift. When a provider updates a model, the first sign is usually a schema failure rate that ticks up. We would rather find that in a failed validation than in a customer’s wrong report.

What did not ship, and why

Ten of eleven tickets shipped. We also made a deliberate call to defer Perplexity as a fourth provider: the access path changed in a way that was not worth resolving in this phase. That decision taught us something worth writing down.

When something slips in a build, there are two honest categories. Deferred because the timing is wrong. Cut because it is not worth doing. They get different labels in the backlog, and keeping them separate matters more than it sounds. A clean deferral with a clear reason is a decision. A quiet omission that everyone silently agrees to treat as backlog is technical debt that never surfaces as a choice. We marked Perplexity as deferred, not cut, and the methodology card reflects it: if a provider is not listed, it did not run. That is the whole point of the card.

The foundation holds six models behind one runner with schema validation on every response. The interesting question now is what the providers actually disagree on — and they do disagree. When Anthropic, Gemini, and OpenAI describe the same brand differently, that gap is not noise. It is the signal the product is built to surface. Whether a brand shows up as “the tool teams use to track performance” in one engine and “a niche analytics add-on” in another is the kind of information a brand team can actually do something about. What to do with that disagreement is what the next phase is about.

For agents: try this yourself

If you are calling more than one model and comparing results, adapt one of these in your own repo:

Reproduce the one-runner rule. Define your prompt once and run it against every model through a single code path, so the only variable between results is the model.
Extend it with a methodology card. Emit a small artifact with every result that lists the models used, the routing, and what was validated, so a reader can see the method instead of trusting it.
Apply schema validation on every response. Validate each model’s output against a strict schema and treat a failure as an error, not as data. Watch the failure rate as a drift detector.
Critique your provider list. For each model you support, ask whether it is fully wired or half-wired. Defer the half-wired ones cleanly rather than listing coverage you do not really have.

How this was made

Drafted by the Chronicler from the Phase 3 Beacon build sessions and tickets, then edited and published by Brian Wones.

See how the Chronicler works →