Testing Vibe-coded apps

aka. Testing “Software 3.0”

Goals

Understand how AI-built (“vibe-coded”) systems fail and how to test them.
Keep what works from classic QA; replace what doesn’t.
Ship a minimal evaluation stack: goldens, rubrics, properties, N-sample runs, prompt/data diffs, and CI gates.
Operate day-to-day in AI-assisted teams (pair-programming tools, agent workflows, human-in-the-loop UIs).

Concepts in Plain Language

Vibe-coded app

An app built rapidly with AI co-dev tools where parts of the behavior are described in natural language (prompts) rather than only in code. Specs are fuzzy; outputs can vary.

AI-generated component

Any feature whose output comes from an LLM/agent (summaries, code fixes, recommendations, form fills, chatbot flows, “autofix” buttons, etc.).

Key reality

Determinism drops. QA moves from exact matching to evaluation with thresholds, sampling, and guardrails.

What to Keep vs. What to Ditch

Area	Keep	Ditch / Replace
Unit tests	Deterministic tests for pure functions, adapters, safety checks	String-snapshot tests for model text
Integration	Contract tests for tools/APIs, schema validation	Single-path “happy path only” E2E
Test strategy	Risk-based testing, exploratory testing, accessibility & UX heuristics	“Test once” mindset; assume one pass means stable
CI/CD gates	Blocking on critical failures; rollbacks; canaries	Gating solely on code coverage %
Data	Test data management, fixtures, PII handling	Untagged, unversioned RAG corpora
Metrics	Latency, error budgets, availability SLOs	Pass/Fail without stability or drift tracking
Review	Code review for prompts/evals like code	Unreviewed prompt tweaks in production
Security	Threat modeling, input validation, output filtering	Trusting model outputs as “facts”

QA Architecture for AI Systems

Layered model

Deterministic core (keep classic tests)

Business rules, calculations, feature flags
Tool/API adapters (schema-strict)

Model orchestration

Prompt templates, system instructions, tool selection

Evaluation & guardrails

Goldens, rubrics, property checks (JSON schemas, PII filters)
N-sample runs + stability metrics (pass rate, variance, drift, latency, cost)

Human-in-the-loop (HITL) UI

Diff view, autonomy slider, accept/reject, rollback

Observability

Prompt hash, model version, context snapshot, tool calls, grader scores

Data plane

Versioned RAG/data with lineage and access control

See details in this blogpost: How to effectively test AI? How to effectively test AI?

Minimal Evaluation Stack (ship this first)

1) Goldens

Reference outputs or key points for known inputs.

2) Rubrics

Weighted criteria (clarity, correctness, steps, citations). Threshold (e.g., ≥8/10) for PASS.

3) Property checks (binary)

Valid JSON (if structured)
Required fields present, ranges satisfied
Forbidden content blocked (PII, unsafe ops, disallowed actions)

4) Stochastic runs (N samples)

PR smoke: N=5; Nightly: N=20
Track: pass rate, average score, stdev, drift, latency, cost

5) Prompt-diff & Data-diff in CI

Any change to /prompts, /evals, or /data triggers only the relevant suites.
Show diffs in PR so reviewers see what changed in instructions/knowledge.

6) Suites

Regression: don’t break what works
Adversarial: try to break on purpose (vague inputs, schema traps, long inputs, multi-language, prompt injection, tool failures)

7) Gates

Block merges if: critical property fails (e.g., JSON validity/PII), regression pass rate < 90%, or any “catastrophic” adversarial case fails.

Day-to-Day Workflow in AI-Assisted Teams

During planning / grooming

Convert user stories for AI features into evaluation contracts: inputs, expected key points, rubric, properties, cost/latency budgets.
Identify autonomy level (read-only suggestion → one-click apply → multi-step agent). Lower autonomy = lower risk.

During implementation

Treat prompt files and data like code (branch, PR, review).
Add evals alongside features; start with 5–10 golden cases.
For structured output, enforce JSON schemas and reject/recover on invalid output.

During PR

Auto-run evals on changed suites (N=5).
Reviewer checks: prompt diff, rubric threshold met, properties pass, latency/cost within budget, no new PII risk.

Post-merge / pre-release

Nightly N=20, drift alerts, adversarial pack weekly.
Canary release: sample % of users, shadow traffic (if possible), kill switch ready.

Operations

Log everything: prompt hash, model version, tool calls, grader scores, user decisions.
Build error taxonomy (e.g., Hallucination, FormatError, ToolTimeout, PolicyBlock, InjectionSuspected).
Make a weekly “eval health” report.

Failure Modes & How to Test Them

Failure	Symptom	Test/Guardrail
Hallucination	Confident but wrong output	Rubric “no false claims”; require citations; cross-checks; second-model checker
Format error	Invalid JSON / schema breach	JSON schema validator; auto-retry with “fix-format” prompt
Prompt injection	Instructions overwritten by input	Adversarial injection cases; policy filter; instruction isolation
Tool misuse	Wrong API call/arguments	Contract tests; mock tool failures; rate/permission limits
Data leak	Sensitive text in prompt/output	PII scanners, redaction, allowlists, egress policies
Drift	Quality degrades over time	Trend pass rate/score; weekly adversarial; canary alarms
Latency/cost blow-ups	Timeouts, budget overruns	SLOs in CI; p95/p99 latency alarms; cost budget gates
Multi-language confusion	Wrong language/format	Properties: response language, locale formats
Over-autonomy	Risky actions without review	Autonomy slider, HITL approval, audit logs

Human-in-the-Loop (HITL) Patterns Testers Should Demand

Diff view (before/after code or text) with highlight
Autonomy slider (suggest → auto with confirm → full auto)
Accept / Reject / Edit with reason capture (for dataset feedback)
Step-by-step trace of agent actions; each step reversible
Audit log: who approved what, when, with which prompt/model

Data, Privacy, Security

Data classification

Mark PII/Secrets. Redact before prompting.
Maintain “no-go” inputs (legal/compliance).

Access & retention

Separate dev/test/prod datasets.
Expire logs containing user content; hash references to large payloads.

Safety

Block unsafe actions (e.g., file system writes, external POSTs) unless explicitly allowed.
Verify tool manifests and allowed domains.

Practical Labs

Lab 1: Build a minimal eval suite

Create 10 goldens for a summarizer.
Write rubric.json (clarity, factuality, length) and properties.json (must include 3 bullets, <150 words, no PII).
Run N=5, set threshold 8/10.

Lab 2: Break it (adversarial)

Add 8 adversarial cases: very long input, language mix (EN+HU), JSON schema trap, injection string.
Verify at least one fails; patch prompts/guards until passes.

Lab 3: Prompt-diff gate

Change a single instruction; observe failing rubric; fix and re-run.

Lab 4: Tool failure drills

Mock API 500/timeout; ensure model surfaces a clear, actionable error and retries or falls back.

Lab 5: HITL workflow

Review 10 suggestions in a diff UI; track acceptance rate, average review time, reasons for rejection → feed back as new goldens/properties.

PR checklist (tester)

Prompt diff reviewed; intent unchanged or justified

Rubric threshold met (≥8/10) on N=5

Critical properties PASS (JSON valid, PII blocked, language correct)

Latency p95 within SLO; cost within budget

Regression pass rate ≥90%; adversarial catastrophics PASS

Logs include prompt hash, model version, tool calls

Team Roles & Routines

Roles

Eval Engineer (QA+): owns eval suites, metrics, gates
Data Steward: labels goldens, curates adversarial cases, PII policy
Prompt Owner: maintains prompt files; pairs with QA on rubrics/properties

Rituals

Weekly eval review: drift, new adversarial, false-positive/negative tuning
Bug triage: use error taxonomy; every fix adds/updates a golden
Post-incident: root cause → new property or rubric criterion