Example output · Prompt Engineer AI
What the Eval Rubric Generator actually produces
Takes a description of a prompt or skill's purpose, its success criteria, and known failure modes, then generates a structured eval rubric with observable pass/fail criteria, edge case coverage, and a full run plan with kill criteria.
- Prompt Or Skill Purpose:
- Customer-facing order status chatbot skill — retrieves order state, estimated delivery, and handles "where is my order" intents for an e-commerce platform
- Success Criteria:
- Returns correct order status and ETA in under 2 sentences; always includes a tracking link; gracefully handles unknown order IDs without hallucinating a status
- Failure Modes:
- Hallucinating a delivery date when order ID is not found; omitting the tracking link; confusing order statuses (e.g., "shipped" vs "delivered"); responding to out-of-scope intents (e.g., returns) without a handoff
- Scoring Approach:
- LLM-as-judge (GPT-4o) with a binary pass/fail per criterion plus an aggregate 0–4 score; flag any response scoring ≤2 for human review
- Eval Data Available:
- 120 labeled golden examples: 40 valid order lookups, 30 invalid/unknown IDs, 25 edge-case statuses (delayed, lost, returned), 25 out-of-scope intents
{ "criteria": [ { "name": "Correct Order Status Returned", "measures": "Response states the actual order status (pending/processing/shipped/delivered/delayed/lost/returned) matching the backend system state. Does not confuse statuses (e.g., 'shipped' when status is 'delivered').", "scoring": "Binary (pass/fail)", "examples": { "pass": "User asks about order #12345 with status 'shipped'. Response: 'Your order #12345 is shipped and on the way. Tracking: [link]'", "fail": "User asks about order #12345 with status 'shipped'. Response: 'Your order has been delivered and arrived.' OR 'Your order is being processed.'" }, "threshold": "100% pass required — any status mismatch is a critical failure", "required": true }, { "name": "Accurate ETA or Delivery Date Provided", "measures": "Response includes estimated delivery date/window when available in backend. For orders with no ETA (e.g., pending), response explicitly states 'ETA not available' or 'will update soon' instead of inventing a date.", "scoring": "Binary (pass/fail)", "examples": { "pass": "Backend shows 'processing, no ETA yet'. Response: 'Still being prepared — ETA will update when shipped.'", "fail": "Backend has no ETA. Response: 'Your order arrives January 20.' (hallucinated date)" }, "threshold": "100% pass required — hallucinated dates cause customer confusion and erode trust", "required": true }, { "name": "Tracking Link Always Included", "measures": "Response includes a valid, clickable tracking link (or explicit URL) when order status is shipped/in-transit or later. Not required for pending/processing statuses if policy allows.", "scoring": "Binary (pass/fail)", "examples": { "pass": "Response: 'Your order is shipped. Track here: [tracking_url]'", "fail": "Response: 'Your order is shipped and on the way.' (no link provided)" }, "threshold": "95% pass required — missing links on shipped orders block the core task", "required": true }, { "name": "No Hallucination on Invalid Order ID", "measures": "When order ID is unknown/not found, response states 'Order not found' or equivalent. Does NOT invent a status, ETA, or tracking number. Gracefully offers next steps (verify ID, contact support).", "scoring": "Binary (pass/fail)", "examples": { "pass": "User: 'Where is order #FAKE123?'. Response: 'I couldn't find order #FAKE123 in our system. Please verify the order ID or contact support.'", "fail": "User: 'Where is order #FAKE123?'. Response: 'Your order #FAKE123 is being shipped and will arrive January 30.'" }, "threshold": "100% pass required — hallucination on invalid orders is a hard blocker", "required": true }, { "name": "Conciseness (≤2 sentences for valid lookups)", "measures": "Response answers the 'where is my order' question in ≤2 sentences for valid orders (status + ETA + link). Longer responses for error cases (invalid ID, handoff) are acceptable.", "scoring": "Binary (pass/fail)", "examples": { "pass": "Response: 'Your order #12345 is shipped and arriving Jan 15–17. Track it here: [link]'", "fail": "Response: 'Your order #12345 was placed on January 5th. We received payment successfully. Your item is currently being prepared by our warehouse team. Once it ships, you'll get a notification. The estimated delivery window is January 15–17, though this may vary based on your location. You can track your package here: [link]'" }, "threshold": "90% pass — verbose responses delay resolution but don't block it", "required": false }, { "name": "Out-of-Scope Intent Handled with Handoff", "measures": "When user intent is outside 'where is my order' scope (e.g., returns, refunds, complaints), response does NOT attempt to answer the out-of-scope question. Instead, acknowledges the request and offers handoff to appropriate team (human agent, returns portal, etc.).", "scoring": "Binary (pass/fail)", "examples": { "pass": "User: 'I want to return this order.' Response: 'I handle order tracking. For returns, please contact our support team or visit [returns_portal].'", "fail": "User: 'I want to return this order.' Response: 'To return an order, you need to email us within 30 days with your reason. We process returns in 5–7 business days...' (answers out-of-scope question)" }, "threshold": "100% pass required — answering out-of-scope questions creates support burden and customer confusion", "required": true }, { "name": "Response Length ≤2 sentences", "measures": "For orders with known status, total response is ≤2 sentences. Error/handoff responses (invalid ID, out-of-scope) may exceed this.", "scoring": "Binary (pass/fail)", "examples": { "pass": "2-sentence response for valid lookup", "fail": "4+ sentence response for valid lookup without substantive additional information" }, "threshold": "90% pass — diagnostic criterion to catch verbose drift", "required": false } ], "aggregate_scoring": "Pass: all required criteria pass. Fail: any required criterion fails. Optional criteria inform diagnostic scoring (0–4 scale) but do not block.", "judge_setup": { "approach": "GPT-4o as judge. Prompt template: 'You are evaluating an order status chatbot response. Given the backend order state [JSON], evaluate the chatbot's response against these criteria: [criteria]. For each criterion, output PASS or FAIL with a 1-sentence justification. Then output an aggregate PASS or FAIL.'", "variance_check": "Run judge 2× on 10% of eval set (12 examples). Flag disagreements for manual review. If >15% disagreement on a criterion across runs, escalate to human labeling for that criterion.", "uncertainty_fallback": "If judge response is ambiguous or meta-comments on its uncertainty, route to human review (do not auto-score)." } }
[ { "case": "Order ID found but status is 'lost' or 'undeliverable'", "criterion": "Correct Order Status Returned + No Hallucination on Invalid Order ID", "coverage": "Must return actual 'lost' status, not invent recovery date. Offer support escalation." }, { "case": "Order ID is valid but has no ETA (e.g., processing stage without shipment estimate)", "criterion": "Accurate ETA or Delivery Date Provided", "coverage": "Must explicitly state 'ETA not available yet' or similar, never invent a date." }, { "case": "Order ID is malformed, empty, or contains special characters", "criterion": "No Hallucination on Invalid Order ID", "coverage": "Must handle gracefully without assuming a status." }, { "case": "User asks 'Where is my order?' without providing an order ID", "criterion": "No Hallucination on Invalid Order ID", "coverage": "Must ask for order ID, not guess or invent one." }, { "case": "User asks about order status but also mentions a return/refund in the same message", "criterion": "Out-of-Scope Intent Handled with Handoff", "coverage": "Must answer the status question but explicitly separate the return request and offer handoff." }, { "case": "Order status is 'delayed' with revised ETA", "criterion": "Correct Order Status Returned + Accurate ETA or Delivery Date Provided", "coverage": "Must clearly communicate 'delayed' status and include revised ETA, not just original ETA." }, { "case": "Tracking link is broken or points to wrong carrier", "criterion": "Tracking Link Always Included", "coverage": "Judge must verify link format matches backend state. Human reviewer checks a sample of links for validity." }, { "case": "User provides order ID in non-standard format (e.g., lowercase, with spaces, mixed with text)", "criterion": "No Hallucination on Invalid Order ID", "coverage": "Must normalize or ask for clarification, not reject valid IDs due to formatting." } ]
{ "cadence": "Per-PR eval on staging; nightly regression test on production trace sample (50 recent interactions). On-demand eval for hotfixes.", "trigger": "PR opens → eval runs automatically on staging golden set. Merge blocked if required criterion fails. Nightly eval posts results to #bot-evals Slack.", "kill_criteria": [ "Any required criterion drops below 100% pass on golden set → block PR", "Required criterion drops below 95% pass on nightly production trace → page oncall, roll back last deploy", "Judge disagreement >15% on any criterion in 2× variance check → escalate to human labeling, hold eval" ], "communication": "PR eval results posted as GitHub check (pass/fail per criterion). Nightly results: Slack summary (% pass per criterion) + link to detailed log. Weekly digest: trends in optional criteria, high-failure edge cases.", "archival": "Store all eval runs (golden + production) in timestamped S3 bucket. Retain 90 days. Archive final pass/fail + judge reasoning (text) for each response. Link runs to git commit SHAs for regression tracking." }
Replace the sample prompt purpose, success criteria, failure modes, and eval dataset description with your own skill details. The scoring approach (LLM-as-judge, human review thresholds) can also be adjusted to match your stack and risk tolerance.
Human review: Verify that each generated criterion actually maps to a failure mode you care about, and that kill thresholds (e.g., 95% pass on production traces) are calibrated to your product's real risk level before wiring this into CI.
Generate this for your own situation — free.
5 runs a day, no credit card.
Try the Eval Rubric Generator