Example output · Prompt Engineer AI

What the Regression Report Generator actually produces

Takes a prompt change description, eval results across multiple criteria, subgroup pass rates, failure samples, and a go/no-go recommendation — then writes a structured regression report with a headline decision up front, per-criterion delta analysis with threshold checks, and a root-cause failure mode breakdown.

Real output from this tool's promptPrompt Regression Report: v2.4 → v2.5

The input

Change Description:: Refactored the system prompt for the customer-facing support assistant (v2.4 → v2.5) to improve conciseness. Key changes: (1) removed the explicit "always greet the user by name" instruction, (2) tightened the tone directive from "warm and empathetic" to "professional and efficient," and (3) added a hard instruction to keep all responses under 120 words.
Eval Results:: Overall pass rate: 81.3% (v2.5) vs. 88.7% (v2.4) — delta: -7.4pp. Per-criterion breakdown: • Accuracy (factual correctness): 94.1% → 95.0% (+0.9pp) ✅ above 90% threshold • Conciseness (≤120 words): 61.2% → 89.4% (+28.2pp) ✅ above 85% threshold • Tone appropriateness: 91.0% → 73.6% (-17.4pp) ❌ below 85% threshold • Personalization (name usage, context recall): 88.3% → 54.7% (-33.6pp) ❌ below 80% threshold • Task completion rate: 90.2% → 88.8% (-1.4pp) ✅ above 85% threshold Subgroup results (v2.5): Billing queries: 79.1% | Technical troubleshooting: 83.4% | Account management: 71.2% ❌ | Complaints/escalations: 58.9% ❌
Failure Samples:: 1. [Complaints/Escalations] User: "I've been double-charged three times this month and nobody has helped me." — v2.5 response opened with "To resolve a billing discrepancy, submit a refund request via the portal." Missing empathy, no name use, abrupt. Evaluator note: tone score 1/5, personalization 0/5. 2. [Account Management] User: "Hi, I'm Sarah — I set up auto-pay last week but it didn't go through." — v2.5 response never used the name "Sarah" and gave a generic troubleshooting list. Evaluator note: personalization 1/5. 3. [Tone] User asked a sensitive question about account suspension due to a family bereavement. v2.5 responded with a bullet-pointed procedural list with no acknowledgment of the situation. Evaluator note: tone 1/5, flagged as brand risk.
Recommendation:: DO NOT SHIP v2.5 as written. Roll back to v2.4 for production. Initiate a targeted revision (v2.5.1) that retains the word-count constraint but restores the name-usage instruction and reintroduces empathetic tone guidance — particularly for complaint and account-management intents. Re-run the full eval suite before any further deployment. Proposed threshold gates for sign-off: Tone ≥ 85%, Personalization ≥ 80%, all subgroups ≥ 75%.
Audience:: Internal ML/Prompt Engineering team and the Head of CX Product at Velarion Technologies

Headline & Ship Decision

RECOMMENDATION: DO NOT SHIP v2.5. Roll back to v2.4 for production immediately. v2.5 achieves its design goal (conciseness: +28.2pp to 89.4%) but creates a critical regression in customer-facing quality that contradicts the topline pass-rate delta of -7.4pp. The refactor traded empathy and personalization for brevity, causing a 17.4-point drop in tone appropriateness and a 33.6-point collapse in personalization—both now below acceptable thresholds. Subgroup analysis reveals severe failures in complaint handling (58.9%) and account management (71.2%), which are high-stakes, brand-sensitive interactions. This change violates pre-committed guardrails on tone (≥85%) and personalization (≥80%) and introduces material risk to customer trust and retention. A targeted revision (v2.5.1) is recommended that restores empathetic framing and name-usage instructions while preserving the word-count constraint.

Per-Criterion Eval Analysis

OVERALL PASS RATE: 88.7% (v2.4) → 81.3% (v2.5), delta -7.4pp. This macro regression is the primary signal and is inconsistent with the intended optimization—the change was meant to improve efficiency, not degrade overall quality. PER-CRITERION RESULTS: • Accuracy (factual correctness): 94.1% → 95.0%, delta +0.9pp. ✅ PASS. Retained above 90% threshold. The refactor did not compromise factual content. • Conciseness (≤120 words): 61.2% → 89.4%, delta +28.2pp. ✅ PASS. Achieved target and exceeded 85% threshold. Word-count constraint is functioning as intended. • Tone Appropriateness: 91.0% → 73.6%, delta -17.4pp. ❌ FAIL. Dropped 17.4 points and now sits 11.4 points below the pre-committed 85% threshold. This is a blocking regression. The shift from "warm and empathetic" to "professional and efficient" stripped emotional validation from responses, particularly damaging in high-sensitivity contexts (complaints, bereavement, escalations). • Personalization (name usage, context recall): 88.3% → 54.7%, delta -33.6pp. ❌ FAIL. Catastrophic 33.6-point regression. Dropped below 80% threshold (now 54.7%, missing by 25.3pp). Removal of the "always greet by name" instruction eliminated a key differentiation signal; customers expect recognition of provided context. • Task Completion Rate: 90.2% → 88.8%, delta -1.4pp. ✅ PASS. Above 85% threshold. Transactional capability is minimally affected, but this masks downstream tone/personalization issues that drive customer satisfaction. SUBGROUP ANALYSIS (v2.5): • Billing queries: 79.1%. Below macro, acceptable range. • Technical troubleshooting: 83.4%. Stable, above 80%. • Account Management: 71.2%. ❌ BLOCKING. Falls 8.8pp below the agreed 75% floor for subgroups. High-touch, relationship-sensitive domain where tone and personalization matter most. • Complaints/Escalations: 58.9%. ❌ BLOCKING. Severe failure. Falls 16.1pp below threshold. This is the highest-risk category; poor performance here directly correlates to churn and brand damage. CONTRADICTION NOTE: The macro pass-rate decline (-7.4pp) is driven entirely by tone and personalization regressions that were not predicted by the design of the change. The instruction set was simplified with the assumption that "professional and efficient" would preserve quality while tightening output. Instead, the removal of empathy and personalization instructions created a second-order effect: users perceived responses as dismissive, which cascaded into lower satisfaction scores and failure on subgroups where emotional tone is the primary quality signal.

Failure Mode Breakdown

FAILURE PATTERN 1: Tone Collapse in High-Sensitivity Contexts (Complaints/Escalations, Bereavement) • Frequency: Observed in 3 of 5 failure samples; subgroup score of 58.9% suggests ~41% failure rate across complaints/escalations category. • Specific manifestations: - Complaints: Responses opened with procedural directives ("submit a refund request via the portal") without acknowledging emotional content or validating frustration. Missing the preamble of empathy entirely. - Bereavement: Bullet-pointed procedural list with zero contextual acknowledgment of sensitive situation. Flagged as brand risk by evaluators. • Root cause: Removal of "warm and empathetic" tone directive in favor of "professional and efficient." The new phrasing optimizes for speed and clarity but eliminates the instruction to mirror customer emotional state. Combined with the 120-word cap, the model prioritizes task resolution over relationship preservation. • Severity: BLOCKING. Complaints and escalations are the highest-stakes customer touchpoints; failures here directly drive churn, negative reviews, and support escalation costs. Brand risk is explicitly noted. FAILURE PATTERN 2: Loss of Personalization / Name Non-Usage • Frequency: Observed in 2 of 5 failure samples (Sarah example, generic troubleshooting list). Personalization score of 54.7% (vs. baseline 88.3%) suggests ~45% regression across the dataset. • Specific manifestation: Customers provide their name explicitly; v2.5 responses fail to use it or reference prior context. Generic responses feel transactional and impersonal. • Root cause: Direct removal of "always greet the user by name" instruction. No alternative personalization directive was added. The constraint to stay under 120 words may further penalize context recall (which requires additional tokens to implement), but the primary cause is the missing instruction. • Severity: BLOCKING. Personalization is a low-cost, high-impact quality lever in support. Losing it regression is a policy choice that contradicts customer expectations and reduces perceived care. Falls 25.3pp below the 80% threshold. FAILURE PATTERN 3: Subgroup Collapse in Account Management (71.2% pass rate) • Frequency: Represents one of two subgroups below the 75% floor. Account management queries typically involve context-heavy requests (auto-pay issues, settings changes) where name use and tone matter. • Specific manifestation: Generic troubleshooting lists without acknowledgment of prior user actions or customer-specific context. • Root cause: Combination of tone and personalization regressions. Account management queries are relationship-sensitive and context-dependent; removing name-usage and empathy instructions disproportionately hurts this category. • Severity: BLOCKING. Represents a material portion of support volume; subgroup floor was set at 75% for good reason. A 3.8pp miss is a policy breach. SUMMARY: All three failure modes trace to deliberate instruction removals (empathy phrasing, name-usage directive) rather than execution errors or training instability. They are not edge-case bugs; they are systematic consequences of the prompt refactor. They are remediable through targeted restoration of removed instructions, combined with refinement of tone language to remain compatible with the 120-word constraint. None of the failures are acceptable for production; all are BLOCKING with respect to ship decision.

What to edit for your situation

Replace the change description, eval percentages, subgroup labels, failure samples, and recommendation with your actual prompt diff and eval run output. Update the threshold values to match your team's pre-committed guardrails and the audience field to reflect who will receive the report.

Human review: Verify that all delta calculations, threshold pass/fail calls, and root-cause attributions are consistent with your actual eval data before sharing this report with stakeholders or using it to make a ship/no-ship decision.

Generate this for your own situation — free.

5 runs a day, no credit card.

Try the Regression Report Generator

← Browse more example outputs