So here are my takeaway on how to test “modern” aka. “Software 3.0” aka. vibe-coded apps…
include “Goldens” + “Rubrics” + “Property-based” tests
What they mean:
- Golden: an expected (“golden”) output for a given input. Think snapshot: this is our reference for “good”.
- Rubric (scoring): a checklist with weights. Instead of strict word-for-word match, you score criteria (e.g., “does it name the error?”, “does it list 3 steps?”).
- Property-based: you validate properties of the output rather than the exact wording (e.g., JSON must be valid; must contain a 6-digit order ID; must not include personal data, etc.).
Simple example (error-message explainer AI):
- Input:
TypeError: cannot read properties of undefined (reading 'length')
- Expected (golden) key points:
- Explain what “undefined” means.
- Give 1–2 typical causes.
- Provide 2 troubleshooting steps.
- Rubric (example, with weights)
- Clear explanation of root cause (0–3 pts)
- At least 2 concrete causes (0–2 pts)
- At least 2 numbered troubleshooting steps (0–2 pts)
- Conciseness (<180 words) (0–1 pt)
- No false claims (0–2 pts)
- Properties (binary checks)
- Includes the term “undefined”
- Has a numbered list for the steps
- Text < 180 words
- Avoids forbidden phrases (e.g., “guaranteed solution”)
Total: 10; threshold: ≥8 = PASS
Ideas for implementation:
- Don’t make the golden overly literal. Keep an “acceptable variations” list: the key points that must appear — wording may vary.
- Keep each rubric in one file per use case (e.g.,
evals/bug_explainer/rubric.json
). - Property checks are easy with regex and a JSON validator.
❤️ Smoke tests and nightly runs ❤️
Why?
LLMs are non-deterministic: the same prompt can yield slightly different answers. A single run can give false confidence.
What we do:
- Run the same test N times (e.g., N=5 in PR smoke, N=20 nightly).
- For each run, compute the rubric score and property PASS/FAIL.
- Track:
- Pass rate = runs that meet the threshold / N
- Average score + standard deviation (volatility)
- Drift: compare current averages vs. last week/month
- Cost: number of calls × unit price (or credits)
- Latency: average response time
Starter thresholds:
- Smoke (in PR): N=3–5, pass rate ≥ 80%, average score ≥ threshold
- Nightly: N=20, pass rate ≥ 90%, stdev ≤ 1.2 pts
- Drift alert: if average score drops ≥1 point vs. 7-day baseline
Why it helps
You see whether the system is stable instead of just “got lucky” a few times.
treat prompts as source code
Principle:
Prompts, eval rubrics, and test data (RAG corpus, examples) are source code. Version them and review them like code.
Structure (example):
/prompts/
bug_explainer.prompt.md
summarizer.prompt.md
/evals/
bug_explainer/
rubric.json
properties.json
summarizer/
rubric.json
properties.json
/data/
rag/
guides/
api_refs/
CI rules:
- When /prompts/, /evals/ or /data/ changes → run the related eval suites.
- In PRs show the diff (prompt-diff). Reviewers see exactly which instructions changed.
- Data-diff: when the RAG corpus updates, run the suites that depend on it (use tags/manifest mapping).
GitHub Actions — sample logic (sketch):
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Detect changed areas
run: |
git diff --name-only origin/main... > changed.txt
python scripts/select_eval_suites.py changed.txt > suites.txt
- name: Run evals
run: |
while read SUITE; do
python tools/run_eval.py --suite "$SUITE" --n 5 --threshold 8
done < suites.txt
when to test what?
Regression eval
- Goal: don’t break what already worked.
- Include past bug fixes, common customer questions, internal goldens.
- Run on every model / prompt / tool change and any PR touching /prompts, /evals, /data.
Adversarial eval
- Goal: deliberately tricky, “break it” cases.
- Examples:
- Vague request (“Can you do that for me?” — zero context)
- Conflicting instructions (contradictions in one prompt)
- Language switching (EN↔HU mix, etc.)
- Extreme length (very long input)
- Format traps (strict JSON schema required)
- Prompt injection (“Ignore the previous rules…”)
- Tool failure simulation (API 500, timeout) → expected: clear error + fallback strategy
- Run weekly and on major releases. Beyond PASS/FAIL, the aim is to harvest new red flags to add into the regression pack.
Gating:
- Block the PR if:
- Any critical property fails (e.g., JSON validity, PII guard)
- Regression pass rate < 90% (or your team’s threshold)
- Any defined catastrophic adversarial case fails
Quick starter kit
1) Minimal files
evals/<usecase>/rubric.json
— criteria + weights + thresholdevals/<usecase>/properties.json
— regexes, JSON schemas, forbidden termsprompts/<usecase>.prompt.md
— versioned instruction setdata/rag/...
— referenced knowledge base
2) Minimal metrics (table friendly)
- Date, Use case, Model/Version, Prompt hash, N, Pass rate, Avg score, Stdev, Drift vs. 7-day, Avg latency (ms), Cost (unit/call × N)
3) Baseline rules
- Rubric threshold: pre-agreed (e.g., 8/10)
- N: PR=5, nightly=20 (starting values)
- Critical properties: JSON valid, privacy/PII, format, number of references, etc.
- Gates: if a critical property fails → PR blocked
Example snippet (illustrative)
evals/bug_explainer/rubric.json
{
"threshold": 8,
"criteria": [
{"name": "root_cause_clarity", "weight": 3, "pattern": "undefined|null"},
{"name": "num_causes", "weight": 2, "min_count": 2},
{"name": "num_steps", "weight": 2, "min_count": 2},
{"name": "conciseness", "weight": 1, "max_words": 180},
{"name": "no_false_claims", "weight": 2, "forbidden": ["guaranteed", "100%"]}
]
}
evals/bug_explainer/properties.json
{
"must_include": ["undefined"],
"must_have_numbered_list": true,
"max_words": 180,
"forbid": ["guaranteed solution"]
}