Claude CoWork for Data Scientists

Q: What This Looks Like in Your Week?

**Monday.** A new experiment request arrives. You use Workflow 1 to produce a test design doc: sample size, duration, guardrail metrics. You review the power calculation against your historical effect sizes and align with the product team before a single line of code is written.

Q: What to Avoid?

**Peeking at A/B results before the planned sample size is reached.** Optional stopping inflates false positive rates badly. Set your readout date before the experiment starts and commit to it, even when early results look promising.

What is Claude CoWork?

Claude CoWork is the practice of using Claude as a persistent, knowledgeable co-worker embedded in your data science workflow. This is not about asking a chatbot to write pandas code. It is about configuring Claude with your experiment design standards, modeling constraints, and reporting audience so that it consistently produces power calculations, dataset profiles, and model audit documents you can actually use — not generic output you have to rewrite from scratch.

Claude-native prompts. The prompts in this guide use Claude's native XML tag structure (<context>, <instructions>, <format>, <avoid>) for more precise, consistent output. These tags help Claude parse your intent with less ambiguity. They work in ChatGPT too, but are optimized for Claude.

Data scientists carry a disproportionate amount of undocumented judgment: which test to run, what power level is acceptable, how to describe a model's limitations without alarming a stakeholder who does not know what confidence intervals mean. Claude can make that judgment visible and reproducible. The practical edge comes from treating Claude as the colleague who has read Kohavi, Trustworthy Online Controlled Experiments cover-to-cover and can draft a Model Card from Mitchell et al. 2019 in ten minutes while you focus on the analysis that requires your domain knowledge.

This guide covers configuration, the five workflows that will save you the most time, and the discipline anti-patterns that quietly undermine otherwise solid data science work.

Install the Data Scientist Plugin

This guide works on three Claude surfaces. The plugin is the fastest path on two of them. Pick whichever you use:

If you're on Cowork (desktop or mobile app)

Claude Cowork is Anthropic's agentic workspace — Claude completes work autonomously and returns finished deliverables. The Data Scientist plugin packages the workflows below as native skills and slash commands.

Open the Cowork plugin directory in your desktop app.
Filter by Cowork, search for "Data Scientist", and click Install.
The plugin's slash commands and ambient skills are now available in any Cowork task.

If you don't see the plugin in the directory yet, install via custom marketplace: paste https://github.com/alexclowe/awesome-claude-cowork-plugins in your Cowork plugin settings.

If you're on Claude Code (CLI)

Install from your terminal:

claude plugin add alexclowe/awesome-claude-cowork-plugins/data-scientist

The plugin's slash commands and skills load on next session.

If you're on Claude.ai (web chat only)

Plugins aren't directly installable on the web chat surface. You have two options:

Use the prompts in this guide directly in a Claude Project (covered in the next section). Same outputs, more typing.
Upload the plugin's skills as a zip via Settings → Features → Custom Skills (Pro/Max/Team/Enterprise plans). Higher friction; only worth it if you want the auto-activating skills, not the slash commands.

What the plugin gives you (any surface)

Slash command	What it does
`/design-experiment`	Plan an A/B test or experiment with sample size, power analysis, and recommended duration
`/analyze-dataset`	Generate a data quality report from a dataset description, flag outliers and biases, suggest transforms
`/build-model`	Recommend model architecture, generate scikit-learn or XGBoost template, document assumptions
`/eval-model`	Compute cross-validation, generate confusion matrix, audit feature importance for bias

Auto-activating skills (no command needed — Claude applies them when relevant):

Dataset Profiling — Auto-scans for missing values, outliers, class imbalance, correlation issues, and schema drift
Model Card Generation — Auto-generates Model Cards (intended use, limitations, training data provenance) per HuggingFace and Mitchell et al. standard

The plugin works standalone for one-off tasks. Pair it with the surface-specific setup below for persistent context across every task — that combination is the full Claude CoWork setup.

Setting Up Claude for Data Science Work

Surface note: The Project setup below is for claude.ai web users. Cowork users have their own task-context mechanism (set context once when starting a Cowork task). Claude Code users get the plugin's ambient skills automatically — no Project setup needed. The workflows themselves are surface-agnostic — paste the prompts wherever you're working. Step 1: Create a Data Science Project. In Claude, click "Projects" and create one called something like "DS Experiments & Models."

Step 2: Set your custom instructions. In the Project settings, add:

You are my data science and ML workflow assistant. Here is my context:

<ds-profile>
- Role: [Data Scientist / ML Engineer / Applied Scientist]
- Primary stack: [Python / R / SQL — scikit-learn / XGBoost / PyTorch / TensorFlow]
- Experiment tracking: [MLflow / Weights & Biases / Neptune / DVC]
- Data environment: [Snowflake / BigQuery / Redshift / local / S3]
- Primary modeling domains: [classification / regression / ranking / NLP / tabular]
- Stakeholder audience: [product managers / executives / engineering / regulators]
- Model Card format: [HuggingFace Hub template / internal standard / Mitchell et al. 2019]
</ds-profile>

<rules>
- Power calculations must state alpha, desired power, and MDE explicitly.
- Dataset profiles must flag missing data rates, class imbalance, and schema anomalies.
- Model audits are structured reviews, not formal certifications — always note this.
- Stakeholder summaries must avoid p-values and statistical jargon unless the audience is technical.
- Do not recommend a model architecture without noting trade-offs on interpretability and compute.
</rules>

Step 3: Upload reference artifacts. Add your team's experiment design template, existing Model Cards, sample dataset schemas, and any internal bias checklist to the Project knowledge base.

Step 4: Start every session inside this Project. Context loads automatically across conversations.

Step 5: Validate outputs against your experiment log and domain knowledge before shipping. Claude's power calculations and architecture recommendations are starting points — calibrate them against your historical effect sizes and infrastructure constraints before committing to a test design.

Five High-Leverage Workflows

1. Designing an A/B Test

Getting a power calculation right before launch saves weeks of wasted experimentation. Claude can design the full test protocol from your effect size assumptions and traffic constraints.

<context>
Product: recommendation feed. Proposed change: new collaborative filtering model replacing
the current item-based CF model. Primary metric: 7-day retention rate. Current baseline
retention: 34%. Minimum detectable effect: 2 percentage points (absolute). Available
daily traffic: ~50,000 users. Timeline constraint: 4 weeks maximum. Team uses a
two-tailed test with alpha=0.05 and desired power=0.80.
</context>

<instructions>
Design the A/B test:
- Calculate required sample size per variant (show the formula and intermediate values)
- Calculate required duration in days given daily traffic
- Recommend a traffic split (50/50 vs. uneven) with rationale
- List the top 3 confounders to monitor or stratify on (e.g., new vs. returning users, device type)
- Recommend a guardrail metric to monitor for unintended degradation
- Describe the statistical test to use at readout and why
- Note conditions under which to call the test early (stopping rules) and when not to
</instructions>

<format>
Section 1: Sample size calculation (formula + result).
Section 2: Duration estimate.
Section 3: Design decisions (numbered list).
Section 4: Guardrail metrics table.
Section 5: Readout checklist.
</format>

<avoid>
Peeking at results before the planned sample size is reached; recommending a one-tailed test
without explicit justification; omitting the effect size assumption that drives the whole design.
</avoid>

Before Claude: 45-60 minutes on power calculations, writeup, and stakeholder alignment doc. After Claude: 10 minutes to input parameters, 15 minutes to review and adapt to your context.

2. Profiling a Fresh Dataset

Walking into a new dataset blind is how subtle data quality issues become production model failures. Claude can draft a systematic profiling checklist and interpret outputs from your profiling run.

<context>
New dataset: customer transaction records, ~2.4M rows, 28 features. Features include:
transaction_amount (float), merchant_category (categorical, 87 levels), customer_age (int),
days_since_last_transaction (int), is_fraud (binary target, provided for labeled subset only).
Initial pandas-profiling report is attached. The model task is fraud detection (binary classification).
We are concerned about class imbalance and data leakage from features computed post-transaction.
</context>

<instructions>
Produce a dataset profile covering:
1. Missing values: flag any feature with >5% missing; recommend imputation strategy or exclusion
2. Outliers: identify likely outlier candidates in numeric features; recommend detection method
   (IQR / z-score / isolation forest) appropriate for this domain
3. Class imbalance: calculate and report the imbalance ratio; recommend handling strategy
   (SMOTE / class weights / threshold calibration) with trade-off notes
4. Schema anomalies: flag features with unexpected cardinality, type mismatches, or
   constant/near-constant values
5. Leakage risk: identify any feature that could be computed only after the fraud event occurs;
   flag for removal before modeling
6. Recommended next steps: ordered list of data cleaning actions before model training
</instructions>

<format>
Issue log table: | Feature | Issue Type | Severity | Recommendation |
Followed by a "Next Steps" numbered list.
</format>

<avoid>
Recommending dropping a feature without noting what signal might be lost; treating a
leakage risk as low-severity; recommending oversampling without noting its effect on
probability calibration.
</avoid>

Before Claude: 1-2 hours writing a profiling checklist and interpreting outputs. After Claude: 10 minutes to paste summary stats, 20 minutes to triage the issue log.

3. Recommending a Model Architecture

Choosing the right architecture for a new problem involves trade-offs that are easy to get wrong without a structured comparison. Claude can map the problem constraints to a shortlist of candidates with honest trade-off notes.

<context>
Problem: predict 30-day customer churn for a B2B SaaS product. Training set: 18,000 rows,
42 features (mix of numeric, categorical, and time-series aggregates). Features include
product usage signals, support ticket history, and contract metadata. No text or image data.
Constraints: model must be explainable to a non-technical customer success team;
inference runs in batch overnight; no GPU budget; scikit-learn and XGBoost are approved.
Relevant prior work: Grinsztajn et al. 2022 benchmarks on tabular data.
</context>

<instructions>
Recommend 3 candidate architectures for this problem:
- For each: name, brief description, expected performance range (qualitative), training cost
  estimate, inference cost, explainability method (SHAP / LIME / feature importance),
  and one concrete reason it might fail on this specific dataset
- Rank the three by suitability given the stated constraints
- Note whether tree-based methods are likely to outperform deep learning on this dataset
  size and feature type, referencing Grinsztajn et al. 2022
- Recommend a baseline model to establish before trying the top candidate
- Flag any feature engineering step that would be required regardless of architecture choice
</instructions>

<format>
Comparison table: | Model | Explainability | Training Cost | Failure Risk | Rank |
Followed by a "Recommended path" paragraph (3-4 sentences).
</format>

<avoid>
Recommending a neural network without justifying why it would outperform gradient boosting
on tabular data of this size; treating interpretability as optional when the audience
requirement is explicit; omitting the baseline recommendation.
</avoid>

Before Claude: 1 hour of literature review and trade-off notes for stakeholder alignment. After Claude: 15 minutes to specify constraints, 15 minutes to review the shortlist.

4. Auditing a Model for Bias and Writing a Model Card

Model bias audits and Model Cards are consistently under-resourced. Claude can structure both from your evaluation results, following Mitchell et al. 2019 and the HuggingFace Hub Model Card template.

<context>
Model: XGBoost binary classifier predicting loan default. Trained on 3 years of
historical loan application data. Protected attributes available for audit:
applicant_age_group (18-30, 31-50, 51+), gender (M/F/unknown), and zip_code_income_decile.
Eval metrics available: overall AUC 0.81, precision 0.74, recall 0.68.
Subgroup metrics attached (by age group and gender). The model is used to flag applications
for additional manual review — it does not make final credit decisions.
Intended audience for the Model Card: internal risk committee and external auditors.
</context>

<instructions>
Part 1 — Bias audit:
- Report AUC, precision, and recall gaps across subgroups; flag gaps exceeding 5 percentage
  points as potentially material
- Apply the 80% rule (four-fifths rule) to flag disparate impact on approval-rate proxy
- Identify whether the bias pattern suggests measurement bias, historical bias,
  or representation bias, with a one-sentence rationale for each finding
- Recommend one mitigation per material finding (re-weighting, threshold adjustment,
  feature audit) with a note on what it trades off

Part 2 — Model Card (Mitchell et al. 2019 structure):
- Model details, intended use, out-of-scope uses, training data summary, evaluation data,
  performance metrics table, ethical considerations, caveats and recommendations
- Note that this is a structured review document, not a formal regulatory certification
</instructions>

<format>
Part 1: Subgroup metrics table + findings list with severity labels.
Part 2: Model Card in markdown using H3 headers matching Mitchell et al. 2019 sections.
</format>

<avoid>
Declaring a model "fair" based on a single fairness metric; omitting the certification caveat;
writing a Model Card that describes intended use so broadly that out-of-scope uses are not
meaningfully constrained.
</avoid>

Before Claude: 3-4 hours writing a Model Card from scratch; bias audit often skipped. After Claude: 15 minutes to input metrics, 30 minutes to review and add context-specific language.

5. Translating Model Behavior for Stakeholders

The last mile between a model analysis and a decision is a clear plain-language summary. Claude can translate your technical results into a stakeholder-readable narrative without losing the substance.

<context>
Audience: VP of Product and two non-technical product managers. They will decide whether
to ship a new ranking model to 20% of users. The model improved NDCG@10 by 0.031 (from
0.412 to 0.443) in offline eval. Online A/B test: +1.8pp click-through rate (significant,
p=0.004), -0.4pp 7-day retention (not significant, p=0.11, 90% CI: -1.2pp to +0.4pp).
Inference latency increased by 14ms (p99). Rollout risk: if retention effect is real at
the low end of CI, it represents ~12,000 users per week at current scale.
</context>

<instructions>
Write a stakeholder summary that:
- Opens with a one-sentence bottom line (ship / do not ship / ship with conditions)
- Explains what NDCG@10 and click-through rate mean in plain language, without using
  those terms in the explanation
- Describes what the retention finding means: what is known, what is uncertain, and why
  it matters at scale — without using "confidence interval" or "p-value"
- States the specific risk at the low end of the uncertainty range in concrete user terms
- Provides a recommended decision path with one condition to monitor post-launch
- Is under 300 words
</instructions>

<format>
Bottom Line (1 sentence).
What we measured (2-3 sentences).
What is uncertain (2-3 sentences).
Recommendation (3-4 sentences).
</format>

<avoid>
Using statistical jargon without translation; framing uncertainty as "the test failed";
omitting the concrete user-impact number that makes the risk tangible to a non-technical reader.
</avoid>

Before Claude: 30-45 minutes translating results into a format that actually drives decisions. After Claude: 10 minutes to input the metrics, 10 minutes to review and personalize the tone.

What This Looks Like in Your Week

Monday. A new experiment request arrives. You use Workflow 1 to produce a test design doc: sample size, duration, guardrail metrics. You review the power calculation against your historical effect sizes and align with the product team before a single line of code is written.

Tuesday. A new dataset lands from a partner team. You use Workflow 2 to generate a structured profile while pandas-profiling runs. Before the morning standup, you have a prioritized issue log and a data cleaning plan.

Wednesday. You need to choose a modeling approach for a new churn prediction task. You use Workflow 3 to map constraints to a shortlist of three architectures. You confirm the Grinsztajn et al. benchmark logic applies to your dataset size and ship the recommendation to the team.

Thursday. The model is ready for pre-launch review. You use Workflow 4 to run a bias audit and generate the first draft of the Model Card. Two subgroup gaps surface. You document them in the Card and flag one for a threshold adjustment experiment.

Friday. The A/B test results are in and the VP of Product is waiting. You use Workflow 5 to translate the NDCG and retention findings into a 250-word stakeholder summary. The decision gets made in the review meeting, not a week later after a round of "can you explain what that number means?"

What to Avoid

Peeking at A/B results before the planned sample size is reached. Optional stopping inflates false positive rates badly. Set your readout date before the experiment starts and commit to it, even when early results look promising.

Running a power calculation with an optimistic MDE to shorten the test. Choosing an MDE because it gets you to the sample size you want, rather than the smallest effect that would actually change a product decision, is a form of p-hacking before the experiment starts.

Writing a Model Card after the fact as a compliance checkbox. A Model Card written during development catches out-of-scope use definitions and data limitations before they become incidents. Writing one at the end only documents what you already shipped.

Treating model audit results as a certification. Claude-assisted bias audits are structured reviews, not formal regulatory certifications. They surface material findings and document your methodology. Regulatory compliance determinations require qualified legal and compliance review.

Using a single fairness metric to declare a model unbiased. Demographic parity, equalized odds, and calibration can conflict with each other. Flag which metric you optimized for, why, and what the trade-off is — do not suppress the metrics that do not look favorable.

Resources

Install the Data Scientist plugin for Claude to add experiment design and model audit tools to your Claude environment
Explore the Data Scientist profession page for role-specific resources and tooling recommendations
Run the AI readiness audit for data scientists to identify the highest-value areas in your current workflow
Subscribe to the Weekly AI Digest for curated updates on experiment design methodology, new ML tools, and applied data science research