🐝

How to effectively test AI?

So here are my takeaway on how to test “modern” aka. “Software 3.0” aka. vibe-coded apps…

include “Goldens” + “Rubrics” + “Property-based” tests

What they mean:

Golden: an expected (“golden”) output for a given input. Think snapshot: this is our reference for “good”.
Rubric (scoring): a checklist with weights. Instead of strict word-for-word match, you score criteria (e.g., “does it name the error?”, “does it list 3 steps?”).
Property-based: you validate properties of the output rather than the exact wording (e.g., JSON must be valid; must contain a 6-digit order ID; must not include personal data, etc.).

Simple example (error-message explainer AI):

Input: TypeError: cannot read properties of undefined (reading 'length')
Expected (golden) key points:

Explain what “undefined” means.
Give 1–2 typical causes.
Provide 2 troubleshooting steps.

Rubric (example, with weights)

Clear explanation of root cause (0–3 pts)
At least 2 concrete causes (0–2 pts)
At least 2 numbered troubleshooting steps (0–2 pts)
Conciseness (<180 words) (0–1 pt)
No false claims (0–2 pts)

Total: 10; threshold: ≥8 = PASS

Properties (binary checks)

Includes the term “undefined”
Has a numbered list for the steps
Text < 180 words
Avoids forbidden phrases (e.g., “guaranteed solution”)

Ideas for implementation:

Don’t make the golden overly literal. Keep an “acceptable variations” list: the key points that must appear — wording may vary.
Keep each rubric in one file per use case (e.g., evals/bug_explainer/rubric.json).
Property checks are easy with regex and a JSON validator.

❤️ Smoke tests and nightly runs ❤️

Why?

LLMs are non-deterministic: the same prompt can yield slightly different answers. A single run can give false confidence.

What we do:

Run the same test N times (e.g., N=5 in PR smoke, N=20 nightly).
For each run, compute the rubric score and property PASS/FAIL.
Track:

Pass rate = runs that meet the threshold / N
Average score + standard deviation (volatility)
Drift: compare current averages vs. last week/month
Cost: number of calls × unit price (or credits)
Latency: average response time

Starter thresholds:

Smoke (in PR): N=3–5, pass rate ≥ 80%, average score ≥ threshold
Nightly: N=20, pass rate ≥ 90%, stdev ≤ 1.2 pts
Drift alert: if average score drops ≥1 point vs. 7-day baseline

Why it helps

You see whether the system is stable instead of just “got lucky” a few times.

treat prompts as source code

Principle:

Prompts, eval rubrics, and test data (RAG corpus, examples) are source code. Version them and review them like code.

Structure (example):

/prompts/
  bug_explainer.prompt.md
  summarizer.prompt.md
/evals/
  bug_explainer/
    rubric.json
    properties.json
  summarizer/
    rubric.json
    properties.json
/data/
  rag/
    guides/
    api_refs/

CI rules:

When /prompts/, /evals/ or /data/ changes → run the related eval suites.
In PRs show the diff (prompt-diff). Reviewers see exactly which instructions changed.
Data-diff: when the RAG corpus updates, run the suites that depend on it (use tags/manifest mapping).

GitHub Actions — sample logic (sketch):

on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Detect changed areas
        run: |
          git diff --name-only origin/main... > changed.txt
          python scripts/select_eval_suites.py changed.txt > suites.txt
      - name: Run evals
        run: |
          while read SUITE; do
            python tools/run_eval.py --suite "$SUITE" --n 5 --threshold 8
          done < suites.txt

when to test what?

Regression eval

Goal: don’t break what already worked.
Include past bug fixes, common customer questions, internal goldens.
Run on every model / prompt / tool change and any PR touching /prompts, /evals, /data.

Adversarial eval

Goal: deliberately tricky, “break it” cases.
Examples:

Vague request (“Can you do that for me?” — zero context)
Conflicting instructions (contradictions in one prompt)
Language switching (EN↔HU mix, etc.)
Extreme length (very long input)
Format traps (strict JSON schema required)
Prompt injection (“Ignore the previous rules…”)
Tool failure simulation (API 500, timeout) → expected: clear error + fallback strategy

Run weekly and on major releases. Beyond PASS/FAIL, the aim is to harvest new red flags to add into the regression pack.

Gating:

Block the PR if:

Any critical property fails (e.g., JSON validity, PII guard)
Regression pass rate < 90% (or your team’s threshold)
Any defined catastrophic adversarial case fails

Quick starter kit

1) Minimal files

evals/<usecase>/rubric.json — criteria + weights + threshold
evals/<usecase>/properties.json — regexes, JSON schemas, forbidden terms
prompts/<usecase>.prompt.md — versioned instruction set
data/rag/... — referenced knowledge base

2) Minimal metrics (table friendly)

Date, Use case, Model/Version, Prompt hash, N, Pass rate, Avg score, Stdev, Drift vs. 7-day, Avg latency (ms), Cost (unit/call × N)

3) Baseline rules

Rubric threshold: pre-agreed (e.g., 8/10)
N: PR=5, nightly=20 (starting values)
Critical properties: JSON valid, privacy/PII, format, number of references, etc.
Gates: if a critical property fails → PR blocked

Example snippet (illustrative)

evals/bug_explainer/rubric.json

{
  "threshold": 8,
  "criteria": [
    {"name": "root_cause_clarity", "weight": 3, "pattern": "undefined|null"},
    {"name": "num_causes", "weight": 2, "min_count": 2},
    {"name": "num_steps", "weight": 2, "min_count": 2},
    {"name": "conciseness", "weight": 1, "max_words": 180},
    {"name": "no_false_claims", "weight": 2, "forbidden": ["guaranteed", "100%"]}
  ]
}

evals/bug_explainer/properties.json

{
  "must_include": ["undefined"],
  "must_have_numbered_list": true,
  "max_words": 180,
  "forbid": ["guaranteed solution"]
}

Checklist (pin it to the wall)

Prompts are versioned; code review is required

Each use case has a rubric and properties file

PR: N=5 stochastic runs; gates: critical properties + threshold

Nightly N=20, drift tracking, weekly chart

Adversarial pack runs weekly + updated with new cases

Data-diff triggers relevant suites

Latency and cost measured and trended

Every AI call logged: prompt, output, score, property results