This is only a Test
/Blog Posts
Blog Posts
/
🐝
How to effectively test AI?
🐝

How to effectively test AI?

So here are my takeaway on how to test “modern” aka. “Software 3.0” aka. vibe-coded apps…

include “Goldens” + “Rubrics” + “Property-based” tests

What they mean:

  • Golden: an expected (“golden”) output for a given input. Think snapshot: this is our reference for “good”.
  • Rubric (scoring): a checklist with weights. Instead of strict word-for-word match, you score criteria (e.g., “does it name the error?”, “does it list 3 steps?”).
  • Property-based: you validate properties of the output rather than the exact wording (e.g., JSON must be valid; must contain a 6-digit order ID; must not include personal data, etc.).

Simple example (error-message explainer AI):

  • Input: TypeError: cannot read properties of undefined (reading 'length')
  • Expected (golden) key points:
    • Explain what “undefined” means.
    • Give 1–2 typical causes.
    • Provide 2 troubleshooting steps.
  • Rubric (example, with weights)
    • Clear explanation of root cause (0–3 pts)
    • At least 2 concrete causes (0–2 pts)
    • At least 2 numbered troubleshooting steps (0–2 pts)
    • Conciseness (<180 words) (0–1 pt)
    • No false claims (0–2 pts)
    • Total: 10; threshold: ≥8 = PASS

  • Properties (binary checks)
    • Includes the term “undefined”
    • Has a numbered list for the steps
    • Text < 180 words
    • Avoids forbidden phrases (e.g., “guaranteed solution”)

Ideas for implementation:

  • Don’t make the golden overly literal. Keep an “acceptable variations” list: the key points that must appear — wording may vary.
  • Keep each rubric in one file per use case (e.g., evals/bug_explainer/rubric.json).
  • Property checks are easy with regex and a JSON validator.

❤️ Smoke tests and nightly runs ❤️

Why?

LLMs are non-deterministic: the same prompt can yield slightly different answers. A single run can give false confidence.

What we do:

  • Run the same test N times (e.g., N=5 in PR smoke, N=20 nightly).
  • For each run, compute the rubric score and property PASS/FAIL.
  • Track:
    • Pass rate = runs that meet the threshold / N
    • Average score + standard deviation (volatility)
    • Drift: compare current averages vs. last week/month
    • Cost: number of calls × unit price (or credits)
    • Latency: average response time

Starter thresholds:

  • Smoke (in PR): N=3–5, pass rate ≥ 80%, average score ≥ threshold
  • Nightly: N=20, pass rate ≥ 90%, stdev ≤ 1.2 pts
  • Drift alert: if average score drops ≥1 point vs. 7-day baseline

Why it helps

You see whether the system is stable instead of just “got lucky” a few times.

treat prompts as source code

Principle:

Prompts, eval rubrics, and test data (RAG corpus, examples) are source code. Version them and review them like code.

Structure (example):

/prompts/
  bug_explainer.prompt.md
  summarizer.prompt.md
/evals/
  bug_explainer/
    rubric.json
    properties.json
  summarizer/
    rubric.json
    properties.json
/data/
  rag/
    guides/
    api_refs/

CI rules:

  • When /prompts/, /evals/ or /data/ changes → run the related eval suites.
  • In PRs show the diff (prompt-diff). Reviewers see exactly which instructions changed.
  • Data-diff: when the RAG corpus updates, run the suites that depend on it (use tags/manifest mapping).

GitHub Actions — sample logic (sketch):

when to test what?

Regression eval

  • Goal: don’t break what already worked.
  • Include past bug fixes, common customer questions, internal goldens.
  • Run on every model / prompt / tool change and any PR touching /prompts, /evals, /data.

Adversarial eval

  • Goal: deliberately tricky, “break it” cases.
  • Examples:
    • Vague request (“Can you do that for me?” — zero context)
    • Conflicting instructions (contradictions in one prompt)
    • Language switching (EN↔HU mix, etc.)
    • Extreme length (very long input)
    • Format traps (strict JSON schema required)
    • Prompt injection (“Ignore the previous rules…”)
    • Tool failure simulation (API 500, timeout) → expected: clear error + fallback strategy
  • Run weekly and on major releases. Beyond PASS/FAIL, the aim is to harvest new red flags to add into the regression pack.

Gating:

  • Block the PR if:
    • Any critical property fails (e.g., JSON validity, PII guard)
    • Regression pass rate < 90% (or your team’s threshold)
    • Any defined catastrophic adversarial case fails

Quick starter kit

1) Minimal files

  • evals/<usecase>/rubric.json — criteria + weights + threshold
  • evals/<usecase>/properties.json — regexes, JSON schemas, forbidden terms
  • prompts/<usecase>.prompt.md — versioned instruction set
  • data/rag/... — referenced knowledge base

2) Minimal metrics (table friendly)

  • Date, Use case, Model/Version, Prompt hash, N, Pass rate, Avg score, Stdev, Drift vs. 7-day, Avg latency (ms), Cost (unit/call × N)

3) Baseline rules

  • Rubric threshold: pre-agreed (e.g., 8/10)
  • N: PR=5, nightly=20 (starting values)
  • Critical properties: JSON valid, privacy/PII, format, number of references, etc.
  • Gates: if a critical property fails → PR blocked

Example snippet (illustrative)

evals/bug_explainer/rubric.json

evals/bug_explainer/properties.json

{
  "must_include": ["undefined"],
  "must_have_numbered_list": true,
  "max_words": 180,
  "forbid": ["guaranteed solution"]
}

Checklist (pin it to the wall)

Prompts are versioned; code review is required
Each use case has a rubric and properties file
PR: N=5 stochastic runs; gates: critical properties + threshold
Nightly N=20, drift tracking, weekly chart
Adversarial pack runs weekly + updated with new cases
Data-diff triggers relevant suites
Latency and cost measured and trended
Every AI call logged: prompt, output, score, property results
Logo

Terms and Conditions

Blog

Test Management System

Created with ❤️ by Clean Cut Kft. - 2025

DiscordYouTube
on: [pull_request]
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Detect changed areas
        run: |
          git diff --name-only origin/main... > changed.txt
          python scripts/select_eval_suites.py changed.txt > suites.txt
      - name: Run evals
        run: |
          while read SUITE; do
            python tools/run_eval.py --suite "$SUITE" --n 5 --threshold 8
          done < suites.txt
{
  "threshold": 8,
  "criteria": [
    {"name": "root_cause_clarity", "weight": 3, "pattern": "undefined|null"},
    {"name": "num_causes", "weight": 2, "min_count": 2},
    {"name": "num_steps", "weight": 2, "min_count": 2},
    {"name": "conciseness", "weight": 1, "max_words": 180},
    {"name": "no_false_claims", "weight": 2, "forbidden": ["guaranteed", "100%"]}
  ]
}