This is only a Test
/Curriculum
Curriculum
/
Testing Vibe-coded apps

Testing Vibe-coded apps

aka. Testing “Software 3.0”

Goals

  • Understand how AI-built (“vibe-coded”) systems fail and how to test them.
  • Keep what works from classic QA; replace what doesn’t.
  • Ship a minimal evaluation stack: goldens, rubrics, properties, N-sample runs, prompt/data diffs, and CI gates.
  • Operate day-to-day in AI-assisted teams (pair-programming tools, agent workflows, human-in-the-loop UIs).

Concepts in Plain Language

Vibe-coded app

An app built rapidly with AI co-dev tools where parts of the behavior are described in natural language (prompts) rather than only in code. Specs are fuzzy; outputs can vary.

AI-generated component

Any feature whose output comes from an LLM/agent (summaries, code fixes, recommendations, form fills, chatbot flows, “autofix” buttons, etc.).

Key reality

Determinism drops. QA moves from exact matching to evaluation with thresholds, sampling, and guardrails.

What to Keep vs. What to Ditch

Area
Keep
Ditch / Replace
Unit tests
Deterministic tests for pure functions, adapters, safety checks
String-snapshot tests for model text
Integration
Contract tests for tools/APIs, schema validation
Single-path “happy path only” E2E
Test strategy
Risk-based testing, exploratory testing, accessibility & UX heuristics
“Test once” mindset; assume one pass means stable
CI/CD gates
Blocking on critical failures; rollbacks; canaries
Gating solely on code coverage %
Data
Test data management, fixtures, PII handling
Untagged, unversioned RAG corpora
Metrics
Latency, error budgets, availability SLOs
Pass/Fail without stability or drift tracking
Review
Code review for prompts/evals like code
Unreviewed prompt tweaks in production
Security
Threat modeling, input validation, output filtering
Trusting model outputs as “facts”

QA Architecture for AI Systems

Layered model

  1. Deterministic core (keep classic tests)
    • Business rules, calculations, feature flags
    • Tool/API adapters (schema-strict)
  2. Model orchestration
    • Prompt templates, system instructions, tool selection
  3. Evaluation & guardrails
    • Goldens, rubrics, property checks (JSON schemas, PII filters)
    • N-sample runs + stability metrics (pass rate, variance, drift, latency, cost)
  4. Human-in-the-loop (HITL) UI
    • Diff view, autonomy slider, accept/reject, rollback
  5. Observability
    • Prompt hash, model version, context snapshot, tool calls, grader scores
  6. Data plane
    • Versioned RAG/data with lineage and access control

See details in this blogpost:

How to effectively test AI? How to effectively test AI?How to effectively test AI? How to effectively test AI?

Minimal Evaluation Stack (ship this first)

1) Goldens

Reference outputs or key points for known inputs.

2) Rubrics

Weighted criteria (clarity, correctness, steps, citations). Threshold (e.g., ≥8/10) for PASS.

3) Property checks (binary)

  • Valid JSON (if structured)
  • Required fields present, ranges satisfied
  • Forbidden content blocked (PII, unsafe ops, disallowed actions)

4) Stochastic runs (N samples)

  • PR smoke: N=5; Nightly: N=20
  • Track: pass rate, average score, stdev, drift, latency, cost

5) Prompt-diff & Data-diff in CI

  • Any change to /prompts, /evals, or /data triggers only the relevant suites.
  • Show diffs in PR so reviewers see what changed in instructions/knowledge.

6) Suites

  • Regression: don’t break what works
  • Adversarial: try to break on purpose (vague inputs, schema traps, long inputs, multi-language, prompt injection, tool failures)

7) Gates

  • Block merges if: critical property fails (e.g., JSON validity/PII), regression pass rate < 90%, or any “catastrophic” adversarial case fails.

Day-to-Day Workflow in AI-Assisted Teams

During planning / grooming

  • Convert user stories for AI features into evaluation contracts: inputs, expected key points, rubric, properties, cost/latency budgets.
  • Identify autonomy level (read-only suggestion → one-click apply → multi-step agent). Lower autonomy = lower risk.

During implementation

  • Treat prompt files and data like code (branch, PR, review).
  • Add evals alongside features; start with 5–10 golden cases.
  • For structured output, enforce JSON schemas and reject/recover on invalid output.

During PR

  • Auto-run evals on changed suites (N=5).
  • Reviewer checks: prompt diff, rubric threshold met, properties pass, latency/cost within budget, no new PII risk.

Post-merge / pre-release

  • Nightly N=20, drift alerts, adversarial pack weekly.
  • Canary release: sample % of users, shadow traffic (if possible), kill switch ready.

Operations

  • Log everything: prompt hash, model version, tool calls, grader scores, user decisions.
  • Build error taxonomy (e.g., Hallucination, FormatError, ToolTimeout, PolicyBlock, InjectionSuspected).
  • Make a weekly “eval health” report.

Failure Modes & How to Test Them

Failure
Symptom
Test/Guardrail
Hallucination
Confident but wrong output
Rubric “no false claims”; require citations; cross-checks; second-model checker
Format error
Invalid JSON / schema breach
JSON schema validator; auto-retry with “fix-format” prompt
Prompt injection
Instructions overwritten by input
Adversarial injection cases; policy filter; instruction isolation
Tool misuse
Wrong API call/arguments
Contract tests; mock tool failures; rate/permission limits
Data leak
Sensitive text in prompt/output
PII scanners, redaction, allowlists, egress policies
Drift
Quality degrades over time
Trend pass rate/score; weekly adversarial; canary alarms
Latency/cost blow-ups
Timeouts, budget overruns
SLOs in CI; p95/p99 latency alarms; cost budget gates
Multi-language confusion
Wrong language/format
Properties: response language, locale formats
Over-autonomy
Risky actions without review
Autonomy slider, HITL approval, audit logs

Human-in-the-Loop (HITL) Patterns Testers Should Demand

  • Diff view (before/after code or text) with highlight
  • Autonomy slider (suggest → auto with confirm → full auto)
  • Accept / Reject / Edit with reason capture (for dataset feedback)
  • Step-by-step trace of agent actions; each step reversible
  • Audit log: who approved what, when, with which prompt/model

Data, Privacy, Security

Data classification

  • Mark PII/Secrets. Redact before prompting.
  • Maintain “no-go” inputs (legal/compliance).

Access & retention

  • Separate dev/test/prod datasets.
  • Expire logs containing user content; hash references to large payloads.

Safety

  • Block unsafe actions (e.g., file system writes, external POSTs) unless explicitly allowed.
  • Verify tool manifests and allowed domains.

Practical Labs

Lab 1: Build a minimal eval suite

  • Create 10 goldens for a summarizer.
  • Write rubric.json (clarity, factuality, length) and properties.json (must include 3 bullets, <150 words, no PII).
  • Run N=5, set threshold 8/10.

Lab 2: Break it (adversarial)

  • Add 8 adversarial cases: very long input, language mix (EN+HU), JSON schema trap, injection string.
  • Verify at least one fails; patch prompts/guards until passes.

Lab 3: Prompt-diff gate

  • Change a single instruction; observe failing rubric; fix and re-run.

Lab 4: Tool failure drills

  • Mock API 500/timeout; ensure model surfaces a clear, actionable error and retries or falls back.

Lab 5: HITL workflow

  • Review 10 suggestions in a diff UI; track acceptance rate, average review time, reasons for rejection → feed back as new goldens/properties.

PR checklist (tester)

Prompt diff reviewed; intent unchanged or justified
Rubric threshold met (≥8/10) on N=5
Critical properties PASS (JSON valid, PII blocked, language correct)
Latency p95 within SLO; cost within budget
Regression pass rate ≥90%; adversarial catastrophics PASS
Logs include prompt hash, model version, tool calls

Team Roles & Routines

Roles

  • Eval Engineer (QA+): owns eval suites, metrics, gates
  • Data Steward: labels goldens, curates adversarial cases, PII policy
  • Prompt Owner: maintains prompt files; pairs with QA on rubrics/properties

Rituals

  • Weekly eval review: drift, new adversarial, false-positive/negative tuning
  • Bug triage: use error taxonomy; every fix adds/updates a golden
  • Post-incident: root cause → new property or rubric criterion
Description

Created
June 1, 2025
Category
Formátum
Logo

Terms and Conditions

Blog

Test Management System

Created with ❤️ by Clean Cut Kft. - 2025

DiscordYouTube