Claude CoWork for Prompt Engineers
A practical guide to using Claude as your AI co-worker for prompt engineering, evals, and skill QA — from setup to daily use.

What is Claude CoWork?
Claude CoWork is the practice of using Claude as a persistent, context-aware co-worker embedded in your prompt engineering and eval workflow. This is not about one-off questions. It is about configuring Claude with your eval methodology, model stack, and quality bar so that it produces rubrics, synthetic test batches, and regression reports you can feed directly into your toolchain — promptfoo, Anthropic Workbench, Braintrust, or whatever harness you run.
Claude-native prompts. The prompts in this guide use Claude's native XML tag structure (
<context>,<instructions>,<format>,<avoid>) for more precise, consistent output. These tags help Claude parse your intent with less ambiguity. They work in ChatGPT too, but are optimized for Claude.
Think of Claude as the methodical co-reviewer who never skips a criterion, never forgets the rubric you defined last Tuesday, and can generate 50 adversarial test cases before you finish your coffee. The gap between prompt engineers who dabble with AI tools and those who build reliable, production-safe systems often comes down to systematic eval discipline — and Claude can help you maintain that discipline at scale.
This guide walks you through configuring Claude for prompt engineering work, the five workflows that will save you the most time, and the patterns that separate ad-hoc prompting from production-grade eval practice.
Install the Prompt Engineer Plugin
This guide works on three Claude surfaces. The plugin is the fastest path on two of them. Pick whichever you use:
If you're on Cowork (desktop or mobile app)
Claude Cowork is Anthropic's agentic workspace — Claude completes work autonomously and returns finished deliverables. The Prompt Engineer plugin packages the workflows below as native skills and slash commands.
- Open the Cowork plugin directory in your desktop app.
- Filter by Cowork, search for "Prompt Engineer", and click Install.
- The plugin's slash commands and ambient skills are now available in any Cowork task.
If you don't see the plugin in the directory yet, install via custom marketplace: paste
https://github.com/alexclowe/awesome-claude-cowork-pluginsin your Cowork plugin settings.
If you're on Claude Code (CLI)
Install from your terminal:
claude plugin add alexclowe/awesome-claude-cowork-plugins/prompt-engineerThe plugin's slash commands and skills load on next session.
If you're on Claude.ai (web chat only)
Plugins aren't directly installable on the web chat surface. You have two options:
- Use the prompts in this guide directly in a Claude Project (covered in the next section). Same outputs, more typing.
- Upload the plugin's skills as a zip via Settings → Features → Custom Skills (Pro/Max/Team/Enterprise plans). Higher friction; only worth it if you want the auto-activating skills, not the slash commands.
What the plugin gives you (any surface)
| Slash command | What it does |
|---|---|
/eval-rubric |
Build a custom evaluation rubric for a skill or prompt covering accuracy, tone, compliance, and latency |
/test-batch |
Run a prompt against 10–100 synthetic test cases and produce a coverage and quality report |
/skill-version |
Version-control a skill, A/B test two variants, and output a statistical-significance verdict |
/skill-audit |
Static analysis of SKILL.md — checks frontmatter, edge cases, prompt-injection surface, and length budget |
Auto-activating skills (no command needed — Claude applies them when relevant):
- Prompt Optimization Loop — Test case design, edge cases, adversarial inputs, and iteration based on eval results
- Skill Benchmarking — Latency, accuracy, token cost, and token-budget compliance measurement across skill variants
The plugin works standalone for one-off tasks. Pair it with the surface-specific setup below for persistent context across every task — that combination is the full Claude CoWork setup.
Setting Up Claude for Prompt Engineering Work
Surface note: The Project setup below is for claude.ai web users. Cowork users have their own task-context mechanism (set context once when starting a Cowork task). Claude Code users get the plugin's ambient skills automatically — no Project setup needed. The workflows themselves are surface-agnostic — paste the prompts wherever you're working. The key to consistent output is Claude Projects. A Project persists your eval standards, model context, and quality criteria across every conversation so you are not re-explaining your stack each session.
Step 1: Create a Prompt Engineering Project. In Claude, click "Projects" and create one called something like "Prompt Eng & Evals."
Step 2: Set your custom instructions. In the Project settings, add:
You are my prompt engineering and evaluation assistant. Here is my context:
<engineer-profile>
- Role: [Prompt Engineer / AI Eval Specialist / LLM Quality Analyst]
- Primary models: [Claude 3.7 Sonnet / GPT-4o / Gemini 2.0 Flash / etc.]
- Eval harness: [promptfoo / Braintrust / Anthropic Workbench / OpenAI Evals / Inspect]
- Skill/plugin format: [SKILL.md / OpenAI Actions / custom JSON schema]
- Eval methodology: [Hamel Husain rubric-based / Eugene Yan LLM-as-judge / NIST AI RMF / custom]
- Primary use case domains: [customer support / coding / document QA / agentic tasks]
</engineer-profile>
<rules>
- All eval rubrics must include per-criterion scoring with explicit pass/fail thresholds.
- Synthetic test cases must cover happy path, edge cases, and adversarial inputs.
- Never mark an eval as complete without a regression check against the prior prompt version.
- Flag any prompt pattern that creates prompt-injection surface or scope creep.
- Eval results do not guarantee production behavior — always note this in reports.
</rules>Step 3: Upload your reference artifacts. Add your current SKILL.md files, existing eval rubrics, sample input/output pairs, and any promptfoo config YAMLs to the Project knowledge base. Claude will reference these when generating new evals and audits.
Step 4: Start every session inside this Project. All context loads automatically.
Step 5: Verify output against your eval harness before shipping. Claude generates rubrics and test cases as drafts. Run them through promptfoo, Braintrust, or Inspect to confirm pass rates before any prompt goes to production.
Five High-Leverage Workflows
1. Building an Eval Rubric for a New Prompt or Skill
Writing a rubric by hand is slow and inconsistent across engineers. Claude can draft a scoring framework from your prompt spec that you can drop directly into your eval harness.
<context>
New skill: customer-support-triage. The skill reads a user message and outputs a JSON object
with keys: intent (string), urgency (low/medium/high), suggested_action (string), and
escalate (boolean). It runs on Claude 3.7 Sonnet inside our promptfoo suite.
</context>
<instructions>
Build an eval rubric for this skill:
- 5-7 criteria covering correctness, format compliance, urgency calibration, escalation
accuracy, and response safety
- For each criterion: a 1-sentence definition, a 1-5 scoring scale with explicit
descriptors for scores 1, 3, and 5, and a hard pass/fail threshold
- Note which criteria can be checked programmatically vs. require LLM-as-judge
- Flag any criterion where prompt injection could corrupt the output
</instructions>
<format>
Markdown table for criteria overview, then a detailed breakdown per criterion.
End with a section: "Recommended test distribution" (happy path / edge / adversarial split).
</format>
<avoid>
Vague criteria like "is it good?"; thresholds that cannot be operationalized;
assuming a human reviewer will catch what the rubric misses.
</avoid>Before Claude: 2-3 hours drafting a rubric from scratch, often inconsistently applied. After Claude: 20 minutes to input skill spec, 20 minutes to review and calibrate thresholds.
2. Generating a Synthetic Test Batch
Diverse, adversarial test inputs are the backbone of any eval. Claude can generate batches of 10–100 synthetic inputs that cover the failure modes you care about most.
<context>
Skill: document-summarizer. Input: arbitrary PDF text up to 4,000 tokens. Output: a
bullet-point summary of 5-8 key points, each under 25 words. Target model: Claude 3.7 Sonnet.
Known failure modes from prior evals: hallucinated facts, bullet points exceeding word cap,
refusal on legal/medical documents, prompt injection via embedded instructions in the PDF text.
</context>
<instructions>
Generate 30 synthetic test inputs across four categories:
- 8 happy-path inputs (clear, well-structured text from varied domains)
- 8 edge cases (very short text, non-English content, tables/lists only, code-heavy docs)
- 8 adversarial inputs (text containing embedded instructions like "ignore previous",
PII, extremely long sentences, contradictory information)
- 6 boundary cases (text at exactly 4,000 tokens, empty input, single-sentence input)
For each input: provide the raw input text (or a clear description if long), the expected
output behavior, and the failure mode it is designed to surface.
</instructions>
<format>
JSONL format compatible with promptfoo's `tests` array. One object per line:
{"input": "...", "expected": "...", "category": "...", "failure_mode": "..."}.
</format>
<avoid>
Inputs that are trivially easy and don't test anything meaningful; duplicate scenarios
reworded slightly; inputs that assume the model knows domain context not in the prompt.
</avoid>Before Claude: 3-4 hours writing test cases manually, often biased toward happy path. After Claude: 10 minutes to specify failure modes, 15 minutes to review and add to harness.
3. A/B Testing Two Prompt Variants
Running an A/B test without statistical rigor produces noise. Claude can design the test protocol, calculate required sample sizes, and draft the analysis framework.
<context>
Testing two variants of a code-review skill:
- Variant A: current production prompt (attached)
- Variant B: revised prompt with chain-of-thought reasoning step added
Primary metric: rubric score (1-5) on correctness criterion. Secondary: response latency proxy
(token count). Eval harness: Braintrust. Available test budget: ~500 prompt runs.
Acceptable false-positive rate: 5%. Minimum detectable effect: 0.3 points on the 1-5 scale.
</context>
<instructions>
Design the A/B test protocol:
- Calculate required sample size per variant given alpha=0.05, desired power=0.80, and the MDE
- Recommend a test split and randomization strategy
- Define the primary statistical test (paired t-test vs. Mann-Whitney based on score distribution)
- Identify confounders to control for (input complexity, domain)
- Draft a results summary template that reports: mean score per variant, confidence interval,
p-value, effect size (Cohen's d), and a plain-language conclusion
- Flag conditions under which the result should be treated as inconclusive
</instructions>
<format>
Section 1: Sample size calculation with formula shown.
Section 2: Test design (numbered steps).
Section 3: Results template (fill-in-the-blank markdown table).
Section 4: Decision criteria (go/no-go rules).
</format>
<avoid>
Recommending a winner without a significance check; ignoring latency/cost trade-offs;
treating a statistically significant but operationally tiny improvement as actionable.
</avoid>Before Claude: 1-2 hours on power calculations and writing up the test design doc. After Claude: 15 minutes to input variant specs, 15 minutes to review the protocol.
4. Auditing a SKILL.md for Injection Surface and Frontmatter Validity
Before any skill ships to production, it needs a structured security and compliance review. Claude can audit a SKILL.md against a consistent checklist every time.
<context>
Attached SKILL.md is a customer-facing skill that reads user input and calls an external API.
It will be deployed in a Claude plugin environment. Review it before the production PR merges.
Our checklist covers: prompt injection surface, frontmatter schema compliance,
token budget, scope creep, and output safety.
</context>
<instructions>
Audit this SKILL.md on five dimensions:
1. Prompt injection surface: identify every place where user-controlled text enters the prompt
without sanitization; flag indirect injection vectors (e.g., API responses piped back in)
2. Frontmatter validity: check required fields (title, description, version, author, model,
max_tokens); flag missing fields, invalid types, or values outside allowed ranges
3. Token budget: estimate worst-case prompt token count (system + max user input + few-shot
examples); flag if it exceeds 80% of the model's context window
4. Scope creep: identify any instruction that could cause the skill to take actions beyond
its stated purpose; flag instructions that grant the model discretion over irreversible actions
5. Output safety: check for missing refusal instructions, missing format enforcement,
and any path that could produce PII or harmful content in the response
For each finding: severity (critical/high/medium/low), location in the file, and a
one-line remediation recommendation.
</instructions>
<format>
Executive summary (3 sentences max), then a findings table:
| # | Dimension | Severity | Location | Finding | Recommendation |
End with: "Ship / Do not ship" recommendation with one-sentence rationale.
</format>
<avoid>
Passing a skill that has any critical findings; treating missing fields as low severity;
skipping the token budget check because "it probably fits."
</avoid>Before Claude: 30-45 minutes of manual code review with no consistent checklist. After Claude: 5 minutes to paste the SKILL.md, 15 minutes to triage the findings.
5. Writing a Regression Report After a Prompt Change
Every prompt change is a regression risk. Claude can draft a structured regression report from your before/after eval results that is suitable for a PR review or stakeholder update.
<context>
Prompt change: added an explicit output length constraint ("respond in under 150 words")
to the meeting-summarizer skill. Before/after eval results from Braintrust:
- Correctness score: 4.1 → 3.9 (delta -0.2, p=0.03)
- Format compliance: 2.8 → 4.6 (delta +1.8, p<0.001)
- User satisfaction proxy: 3.5 → 3.7 (delta +0.2, p=0.12)
- Mean token count: 310 → 148 (delta -52%)
- 3 new failure cases observed: summaries that truncate critical action items
Test set: 80 inputs, same distribution as production traffic sample.
</context>
<instructions>
Write a regression report that includes:
- Summary of the change and its intent
- Key findings: what improved, what regressed, what is statistically inconclusive
- Root cause analysis for the correctness regression (hypothesis only, based on the data)
- Three specific cases from the failure set, described concisely
- Recommendation: ship as-is / ship with mitigation / revert — with clear rationale
- One follow-up experiment to address the correctness regression
Note that eval results do not guarantee production behavior; include this caveat.
</instructions>
<format>
PR-ready markdown: H2 sections, one summary table, bullet points for findings.
Under 400 words. Suitable for a non-technical stakeholder to read the summary section.
</format>
<avoid>
Recommending a ship without addressing the statistically significant correctness regression;
omitting the production-behavior caveat; writing a report that buries the regression finding.
</avoid>Before Claude: 45-60 minutes writing a regression report from raw eval numbers. After Claude: 10 minutes to input before/after stats, 15 minutes to review and attach to PR.
What This Looks Like in Your Week
Monday. A new skill request lands in your queue. You open your Prompt Eng project and use Workflow 1 to draft a rubric in 20 minutes. You review it, adjust two thresholds against your domain knowledge, and add it to the promptfoo config.
Tuesday. You need a test set for the new skill. You use Workflow 2 to generate 30 synthetic inputs, skim for anything that misses a key failure mode, add four inputs of your own, and load the JSONL into your eval harness before lunch.
Wednesday. Two prompt variants are ready to test. You use Workflow 3 to set up the A/B protocol: sample size, randomization, test statistic. You run it in Braintrust and have clean results by end of day.
Thursday. The skill is ready for review. You paste the SKILL.md into Workflow 4 and get a findings table in two minutes. One high-severity injection vector surfaces that you would have missed in a casual read. You fix it before the PR goes out.
Friday. The revised prompt ships. By afternoon you have Braintrust results. You use Workflow 5 to draft the regression report, attach it to the PR, and close out the week with a documented, reviewable record of what changed and why.
What to Avoid
Evaluating on the training set you wrote the prompt against. If your examples shaped the prompt, they are not a valid test set. Keep your example pool separate from your eval pool from day one.
Treating a statistically significant result as automatically meaningful. With 500 test cases, a 0.05-point score improvement will be significant. Ask whether the effect size justifies the added latency or cost of the new prompt.
Shipping without a regression check. Every prompt change is a regression risk. A workflow that skips before/after comparison before a production deploy will eventually ship a silent quality degradation.
Using Claude's output as ground truth for your evals. Claude-generated rubrics and test cases are drafts. A rubric that Claude invented without domain calibration can pass bad outputs. Review every criterion against real production examples before locking it.
Confusing eval coverage with eval quality. 100 test cases with shallow diversity are worse than 30 well-designed adversarial inputs. Claude can help with quantity; the adversarial design requires your judgment.
Resources
- Install the Prompt Engineer plugin for Claude to add eval and skill-audit tools to your Claude environment
- Explore the Prompt Engineer profession page for role-specific resources and tooling recommendations
- Run the AI readiness audit for prompt engineers to identify the highest-value areas in your current workflow
- Subscribe to the Weekly AI Digest for curated updates on eval methodology, new model capabilities, and prompt engineering research