AI-Generated Tests vs Deterministic Execution

BoxProbe uses AI during scenario generation — but never at runtime. Here's why that separation matters for trust, reproducibility, and cost.

The phrase "AI-powered testing" covers a wide range of products with very different trust profiles. Some use AI to generate test suggestions. Some use AI to self-heal broken selectors at runtime. Some use AI to interpret screenshots and decide if a test passed or failed. These are fundamentally different architectures with different failure modes.

BoxProbe draws a hard line between two phases: generation and execution.

Generation: AI-assisted

During generation, BoxProbe uses AI to convert a plain-language scenario description into a structured Python test file. The AI sees the application's current state — screenshots, DOM structure, API traffic — and produces pixel-level bounding box annotations for each interaction step.

This is where AI adds value: understanding visual layouts, identifying interactive elements, mapping user intent to coordinates. The output is a Python file you can read, review, and modify. It goes into your Git repository like any other test file.

Execution: zero AI tokens

At runtime, the scenario executes deterministically. It clicks at the coordinates specified in the file. It waits for network idle. It captures API traffic. There are no AI calls, no LLM inference, no probabilistic decisions.

Same scenario, same application state, same result. Every time. The execution is a pure function of the scenario file and the application under test.

This separation matters for three reasons:

Trust. You can read the scenario before it runs. You know exactly what it will click, type, and capture.
Reproducibility. If a diff report shows a regression, you can re-run the exact same scenario and get the exact same result.
Cost. AI tokens are consumed once during generation. Execution is free. Run the same scenario a thousand times in CI.

The self-healing trap

Some testing tools use AI to "self-heal" broken selectors at runtime. When a CSS selector fails, the AI looks at the current DOM and guesses which element the test meant to target. This sounds convenient — until the AI guesses wrong.

A self-healing test that clicks the wrong button and still passes is worse than a broken test that fails loudly. The broken test tells you something changed. The self-healed test hides the change behind a probabilistic guess.

BoxProbe avoids this entirely. Scenarios use pixel coordinates. If the layout changes enough that the coordinates no longer point to the right element, the scenario fails. You regenerate with fresh annotations. The failure is explicit, the fix is explicit.

When to regenerate

Scenarios need regeneration when the application's visual layout changes significantly — a redesign, a major UI refactor, a navigation restructure. Minor text changes, color updates, or backend-only changes don't affect coordinates.

The regeneration cost is one credit per step. For a 6-step scenario, that's 6 credits to produce a fresh, deterministic, reviewable test file. Compare that to the ongoing runtime cost of AI-powered tools that consume tokens on every execution.

— BoxProbe pricing model

The generation/execution separation isn't just an architectural decision. It's a trust boundary. Everything on the generation side can be creative, probabilistic, and AI-assisted. Everything on the execution side must be deterministic, reproducible, and auditable. Mixing the two erodes the trust that makes test results meaningful.