aka. Testing “Software 3.0”
Goals
- Understand how AI-built (“vibe-coded”) systems fail and how to test them.
- Keep what works from classic QA; replace what doesn’t.
- Ship a minimal evaluation stack: goldens, rubrics, properties, N-sample runs, prompt/data diffs, and CI gates.
- Operate day-to-day in AI-assisted teams (pair-programming tools, agent workflows, human-in-the-loop UIs).
Concepts in Plain Language
Vibe-coded app
An app built rapidly with AI co-dev tools where parts of the behavior are described in natural language (prompts) rather than only in code. Specs are fuzzy; outputs can vary.
AI-generated component
Any feature whose output comes from an LLM/agent (summaries, code fixes, recommendations, form fills, chatbot flows, “autofix” buttons, etc.).
Key reality
Determinism drops. QA moves from exact matching to evaluation with thresholds, sampling, and guardrails.
What to Keep vs. What to Ditch
Area | Keep | Ditch / Replace |
Unit tests | Deterministic tests for pure functions, adapters, safety checks | String-snapshot tests for model text |
Integration | Contract tests for tools/APIs, schema validation | Single-path “happy path only” E2E |
Test strategy | Risk-based testing, exploratory testing, accessibility & UX heuristics | “Test once” mindset; assume one pass means stable |
CI/CD gates | Blocking on critical failures; rollbacks; canaries | Gating solely on code coverage % |
Data | Test data management, fixtures, PII handling | Untagged, unversioned RAG corpora |
Metrics | Latency, error budgets, availability SLOs | Pass/Fail without stability or drift tracking |
Review | Code review for prompts/evals like code | Unreviewed prompt tweaks in production |
Security | Threat modeling, input validation, output filtering | Trusting model outputs as “facts” |
QA Architecture for AI Systems
Layered model
- Deterministic core (keep classic tests)
- Business rules, calculations, feature flags
- Tool/API adapters (schema-strict)
- Model orchestration
- Prompt templates, system instructions, tool selection
- Evaluation & guardrails
- Goldens, rubrics, property checks (JSON schemas, PII filters)
- N-sample runs + stability metrics (pass rate, variance, drift, latency, cost)
- Human-in-the-loop (HITL) UI
- Diff view, autonomy slider, accept/reject, rollback
- Observability
- Prompt hash, model version, context snapshot, tool calls, grader scores
- Data plane
- Versioned RAG/data with lineage and access control
See details in this blogpost: How to effectively test AI? How to effectively test AI?
Minimal Evaluation Stack (ship this first)
1) Goldens
Reference outputs or key points for known inputs.
2) Rubrics
Weighted criteria (clarity, correctness, steps, citations). Threshold (e.g., ≥8/10) for PASS.
3) Property checks (binary)
- Valid JSON (if structured)
- Required fields present, ranges satisfied
- Forbidden content blocked (PII, unsafe ops, disallowed actions)
4) Stochastic runs (N samples)
- PR smoke: N=5; Nightly: N=20
- Track: pass rate, average score, stdev, drift, latency, cost
5) Prompt-diff & Data-diff in CI
- Any change to
/prompts
,/evals
, or/data
triggers only the relevant suites. - Show diffs in PR so reviewers see what changed in instructions/knowledge.
6) Suites
- Regression: don’t break what works
- Adversarial: try to break on purpose (vague inputs, schema traps, long inputs, multi-language, prompt injection, tool failures)
7) Gates
- Block merges if: critical property fails (e.g., JSON validity/PII), regression pass rate < 90%, or any “catastrophic” adversarial case fails.
Day-to-Day Workflow in AI-Assisted Teams
During planning / grooming
- Convert user stories for AI features into evaluation contracts: inputs, expected key points, rubric, properties, cost/latency budgets.
- Identify autonomy level (read-only suggestion → one-click apply → multi-step agent). Lower autonomy = lower risk.
During implementation
- Treat prompt files and data like code (branch, PR, review).
- Add evals alongside features; start with 5–10 golden cases.
- For structured output, enforce JSON schemas and reject/recover on invalid output.
During PR
- Auto-run evals on changed suites (N=5).
- Reviewer checks: prompt diff, rubric threshold met, properties pass, latency/cost within budget, no new PII risk.
Post-merge / pre-release
- Nightly N=20, drift alerts, adversarial pack weekly.
- Canary release: sample % of users, shadow traffic (if possible), kill switch ready.
Operations
- Log everything: prompt hash, model version, tool calls, grader scores, user decisions.
- Build error taxonomy (e.g., Hallucination, FormatError, ToolTimeout, PolicyBlock, InjectionSuspected).
- Make a weekly “eval health” report.
Failure Modes & How to Test Them
Failure | Symptom | Test/Guardrail |
Hallucination | Confident but wrong output | Rubric “no false claims”; require citations; cross-checks; second-model checker |
Format error | Invalid JSON / schema breach | JSON schema validator; auto-retry with “fix-format” prompt |
Prompt injection | Instructions overwritten by input | Adversarial injection cases; policy filter; instruction isolation |
Tool misuse | Wrong API call/arguments | Contract tests; mock tool failures; rate/permission limits |
Data leak | Sensitive text in prompt/output | PII scanners, redaction, allowlists, egress policies |
Drift | Quality degrades over time | Trend pass rate/score; weekly adversarial; canary alarms |
Latency/cost blow-ups | Timeouts, budget overruns | SLOs in CI; p95/p99 latency alarms; cost budget gates |
Multi-language confusion | Wrong language/format | Properties: response language, locale formats |
Over-autonomy | Risky actions without review | Autonomy slider, HITL approval, audit logs |
Human-in-the-Loop (HITL) Patterns Testers Should Demand
- Diff view (before/after code or text) with highlight
- Autonomy slider (suggest → auto with confirm → full auto)
- Accept / Reject / Edit with reason capture (for dataset feedback)
- Step-by-step trace of agent actions; each step reversible
- Audit log: who approved what, when, with which prompt/model
Data, Privacy, Security
Data classification
- Mark PII/Secrets. Redact before prompting.
- Maintain “no-go” inputs (legal/compliance).
Access & retention
- Separate dev/test/prod datasets.
- Expire logs containing user content; hash references to large payloads.
Safety
- Block unsafe actions (e.g., file system writes, external POSTs) unless explicitly allowed.
- Verify tool manifests and allowed domains.
Practical Labs
Lab 1: Build a minimal eval suite
- Create 10 goldens for a summarizer.
- Write
rubric.json
(clarity, factuality, length) andproperties.json
(must include 3 bullets, <150 words, no PII). - Run N=5, set threshold 8/10.
Lab 2: Break it (adversarial)
- Add 8 adversarial cases: very long input, language mix (EN+HU), JSON schema trap, injection string.
- Verify at least one fails; patch prompts/guards until passes.
Lab 3: Prompt-diff gate
- Change a single instruction; observe failing rubric; fix and re-run.
Lab 4: Tool failure drills
- Mock API 500/timeout; ensure model surfaces a clear, actionable error and retries or falls back.
Lab 5: HITL workflow
- Review 10 suggestions in a diff UI; track acceptance rate, average review time, reasons for rejection → feed back as new goldens/properties.
PR checklist (tester)
Team Roles & Routines
Roles
- Eval Engineer (QA+): owns eval suites, metrics, gates
- Data Steward: labels goldens, curates adversarial cases, PII policy
- Prompt Owner: maintains prompt files; pairs with QA on rubrics/properties
Rituals
- Weekly eval review: drift, new adversarial, false-positive/negative tuning
- Bug triage: use error taxonomy; every fix adds/updates a golden
- Post-incident: root cause → new property or rubric criterion