VISUAL REGRESSION
TESTING FOR YOUR
CODING AGENT

YOUR CODING AGENT RUNS THE TESTS AND JUDGES AMBIGUOUS DIFFS FROM COMPACT REGION TILES. NO EMBEDDED LLM. NO API KEY. NO PER-SNAPSHOT BILL. OPEN SOURCE FROM CLI TO CI.

INSTALL
npm install -g @blazediff/agent
blazediff-agent onboard
~/Projects/blazediff - claude
(base) ➜ blazediff git:(main)  claude

 /blazediff --cwd apps/website

 Bash(blazediff-agent check --judge host --json)
 ⎿ 21/23 passed, 2 ambiguous

 Read(.blazediff/judgments/*/regions.png)

 Both: em-dash → hyphen. Intentional. Writing verdicts.

 21/23 passed, 2 intentional-likely. Rewrite baselines?

 

WHY THIS DESIGN

FOUR DECISIONS THAT KEEP THE LOOP CHEAP, AUDITABLE, AND OUTSIDE A VENDOR'S CLOUD.

01

YOUR AGENT IS THE JUDGE

When the heuristic can't decide, the agent judges compact region tiles and writes a verdict file. No API call leaves your machine.

02

TOKEN-EFFICIENT

Region tiles are 10x to 100x smaller than full-page PNGs. The host agent reads only the changed crops first, full pages only on demand.

03

ONE PLAYBOOK, THREE HARNESSES

One onboard command installs the same skill into Claude Code, Cursor, and Codex. Switch tools without rewriting your testing setup.

04

MASK, DON'T REBASELINE

Carousels, iframes, clocks, randomized avatars. Tag them with a CSS selector once. The agent paints them out in both baseline and actual, so flakiness stops at the source.

HOW IT WORKS

TWO TOUCHPOINTS. ONE COMMAND TO AUTHOR FROM YOUR CODING AGENT, ONE STEP IN CI TO ENFORCE.

LOCAL - RUN /BLAZEDIFF IN YOUR CODING AGENT

$ /blazediff

One slash command in Claude Code, Cursor, or Codex. The skill walks your router, boots the dev server, captures deterministic baselines, and commits a manifest. You review the screenshots and merge.

CI - RUN CHECK ON EVERY PR

$ blazediff-agent check

One step in CI. Every PR re-renders every route, diffs against the committed baseline, and writes a structured report (change type, position, severity, bbox) per regression. Exit code 1 fails the build.

REPORT OUTPUT

EVERY CHECK WRITES A 5-COLUMN MARKDOWN REPORT WITH BASELINE, ACTUAL, AND DIFF THUMBNAILS PER ROUTE. THE SAMPLE BELOW IS FROM A REAL RUN ON THIS WEBSITE.

.blazediff/summary.md
TOTAL 23PASSED 22PENDING 1
examples-interpretFAIL10 regions · 1.87% · medium
BASELINE
examples-interpret baseline
CURRENT
examples-interpret current
changeType
deletion
position
bottom
shape
mixed-region
bbox
x670 y977 · 199×89
pixels
9,449 (0.58%)
region
1 / 10