Example output · Data Scientist AI

What the A/B Test Experiment Design Generator actually produces

Takes a one-paragraph hypothesis, a primary metric, baseline rate, MDE, available traffic, and test constraints, then outputs a full pre-committed experiment plan: power analysis with sample size and duration, a design document with metric definitions and decision rules, and SRM (sample ratio mismatch) checks for before and during the test.

Real output from this tool's promptMobile sticky "Add to Cart" A/B test

The input

Hypothesis:: Showing a sticky "Add to Cart" button on mobile product pages will increase purchase conversion rate compared to the current scroll-to-top button, because reducing friction at the point of intent will lower drop-off before checkout.
Primary Metric:: Purchase conversion rate (orders / unique product page visitors)
Baseline Rate:: 3.8%
Mde:: 0.5 percentage points (relative ~13% lift)
Traffic Available:: ~42,000 mobile product page visitors per week, 50/50 control/treatment split
Test Constraints:: Max 4-week runtime; no major promotions or holiday periods during test; mobile only (iOS + Android); ship decision if p < 0.05 two-tailed, 80% power; exclude bot traffic via existing filter

Power Analysis & Duration

{ "assumptions": { "alpha": 0.05, "power": 0.8, "test_type": "Two-tailed z-test for proportions", "baseline_conversion_rate": 0.038, "minimum_detectable_effect_absolute": 0.005, "effect_size_relative": "13.2% relative lift", "traffic_per_week": 42000, "control_treatment_split": "50/50" }, "formula": "For two proportions: n = 2 * ((z_alpha/2 + z_beta)^2 * (p0*(1-p0) + p1*(1-p1))) / (p1 - p0)^2, where p0=0.038, p1=0.043, z_alpha/2=1.96, z_beta=0.84", "calculation": { "p0_baseline": 0.038, "p1_treatment": 0.043, "effect_numerator": "2 * (1.96 + 0.84)^2 * (0.038*0.962 + 0.043*0.957)", "effect_numerator_value": "2 * 7.8400 * (0.03654 + 0.04115)", "effect_numerator_computed": 1.206, "effect_denominator": "(0.043 - 0.038)^2", "effect_denominator_computed": 0.000025, "sample_size_per_arm": 48240, "total_sample_size": 96480, "rounding_note": "Rounded up from 48,240 per arm" }, "duration": { "weekly_traffic": 42000, "total_traffic_needed": 96480, "weeks_required": 2.3, "days_required": 16.1, "rounded_up_days": 17, "feasibility": "FEASIBLE. 17 days (2.4 weeks) is well within the 4-week constraint. Recommend running full 4 weeks to capture any day-of-week seasonality and provide buffer for unexpected traffic variance." } }

Experiment Design Document

{ "primary_metric": { "name": "Purchase Conversion Rate", "definition": "Orders / Unique mobile product page visitors, computed as (count of unique user sessions that resulted in ≥1 purchase order within 24 hours of product page view) / (total unique user sessions on mobile product pages during analysis window)", "computation_notes": "Use session ID as unit of analysis. A session is attributed to control or treatment based on sticky button assignment at time of product page load. Count orders placed within 24 hours to capture immediate intent; do not count orders >24h post-pageview to avoid confounding with re-engagement or external marketing.", "metric_type": "Proportion (binary conversion)" }, "secondary_metrics_monitored_not_tested": [ { "name": "Add-to-Cart Rate", "definition": "Count of add-to-cart events / unique product page visitors", "purpose": "Directional signal on whether sticky button increases engagement before conversion. Do NOT run significance test; report point estimate and 95% CI for context only." }, { "name": "Bounce Rate (Product Page)", "definition": "Sessions with no interaction (scroll, tap, add-to-cart) / total sessions", "purpose": "Check for unintended friction from sticky button (e.g., UI obstruction, annoyance). Flag if treatment bounce rate increases significantly." }, { "name": "Session Duration (Product Page)", "definition": "Median time on product page before navigating away or checking out", "purpose": "Detect if sticky button encourages faster checkout (good) or longer deliberation (possible friction). Monitored only; not powered." }, { "name": "Revenue per Visitor (RPV)", "definition": "Sum of order revenue / unique visitors", "purpose": "Flag if treatment increases order size or drives lower-margin purchases. High variance metric; report with caution and do not run formal test." } ], "analysis_plan": { "primary_test": "Two-tailed z-test for proportions; reject null if p < 0.05 and 95% CI for treatment effect does not overlap zero.", "estimate_reported": "Absolute percentage point lift and relative % lift with 95% Agresti-Coull confidence intervals.", "cutoff_date": "Pre-commit: Run test for 4 full weeks (28 days) from launch; analysis conducted 3 calendar days after final day to allow data pipeline lag. No peeking or interim analyses.", "exclusions": "Remove bot traffic via existing mobile bot filter. Exclude users with VPN/proxy flags if available. Do not exclude users by geography or device type unless pre-specified.", "segment_analysis": "Post-hoc only (not powered): conversion lift stratified by device OS (iOS vs Android), traffic source (organic vs paid), and new vs returning user. Report effect estimates with 90% CIs for transparency but do not claim significance.", "intention_to_treat": "Analyze all visitors assigned to treatment or control, regardless of whether sticky button was successfully rendered or user interacted with it." }, "decision_rule": { "ship_treatment": "If p < 0.05 (two-tailed) and point estimate of treatment effect is positive AND no material SRM violations detected.", "hold_or_revert": "If p ≥ 0.05, OR if point estimate is negative, OR if SRM checks fail (see SRM section). Do not re-run test on subset of data; report as inconclusive.", "secondary_metric_guard_rails": "If bounce rate increases >1.5 percentage points in treatment and is statistically significant at p<0.10, flag for product review even if primary metric ships." }, "what_this_test_does_not_measure": [ "Long-term effects: Test measures conversion within 24 hours of product page view. Does not capture repeat purchase behavior, customer lifetime value, or cohort retention over weeks/months.", "Network effects: No mechanism to detect social sharing, word-of-mouth, or marketplace effects from sticky button visibility.", "Novelty effects: Design assumes 4 weeks is sufficient to wash out initial user surprise. If sticky button novelty wears off after 2 weeks, test will not detect decay; recommend 8-week post-launch monitoring.", "Device-level persistence: Mobile browsers do not reliably persist cookies; sticky button assignment is per-session, not per-user. Users on same device but different sessions may see different treatment assignment.", "Interaction with other friction points: Test does not measure whether sticky button interacts with checkout funnel friction (payment form complexity, shipping costs, etc.). May be additive only if checkout is already optimized.", "Cohort selection bias: Cannot distinguish between users who self-select to use sticky button (organic) vs passive exposure. If treatment effect is driven only by high-intent users who actively tap it, effect may not generalize." ] }

SRM & Randomization Checks

{ "pre_test_setup": [ { "check": "Assignment Randomization Balance", "method": "Chi-square test on assignment (control vs treatment) by day-of-week for first 3 days post-launch. Expect ~50/50 split each day; flag if any day deviates >2 percentage points or p_chi2 < 0.05.", "threshold": "p > 0.05 to proceed; if fails, check randomization code and re-randomize if needed." }, { "check": "Covariate Balance (Pre-Period Baseline)", "method": "For ≥3 days before treatment launch, compare mobile traffic distribution across control and treatment buckets on: traffic source (organic/paid/direct), device OS (iOS/Android), new vs returning user (via first-party cookie). Compute standardized mean differences (SMD) for each covariate.", "threshold": "SMD < 0.1 for all covariates; if SMD > 0.1 for any covariate, document and check for random seed issues in assignment logic." } ], "during_test_monitoring": [ { "check": "Daily Assignment Ratio", "method": "Compute control/treatment ratio and 95% binomial CI each day. Expect 50/50 ± ~1-2% depending on volume. Plot over time to detect drift.", "action": "If ratio deviates outside [0.48, 0.52] for 2+ consecutive days, pause and investigate traffic source or sampling biases." }, { "check": "Traffic Volume Anomalies", "method": "Compute daily unique visitors for control and treatment. Flag if either arm drops >20% vs rolling 7-day average (possible infrastructure or traffic source issue).", "action": "Document and communicate; do not exclude data, but may warrant test extension if traffic drops materially." }, { "check": "Bounce Rate Parity", "method": "Daily bounce rate (no interaction) for control vs treatment. Plot and flag if treatment bounce consistently >1 percentage point above control by day 7.", "action": "Non-fatal but worth noting; may indicate sticky button is perceived as intrusive." } ], "pre_analysis_sanity_checks": [ { "check": "Final Assignment Ratio (Chi-Square Test)", "method": "Chi-square goodness-of-fit on total assignment (control vs treatment) across all 4 weeks. Expected: 50/50. Compute chi-square statistic and p-value.", "threshold": "p > 0.05. If p < 0.05, document as SRM violation and report in results; do not publish causal claim without explaining potential bias direction." }, { "check": "Post-Stratification Covariate Balance", "method": "Recompute SMD for traffic source, device OS, and user type (new vs returning) across the full analysis period. Compare to baseline.", "threshold": "SMD < 0.1; if any SMD > 0.15, consider post-stratified analysis as secondary check." }, { "check": "Day-of-Week Assignment Patterns", "method": "Compute assignment ratio (control/treatment) stratified by day-of-week (Mon–Sun) across all 4 weeks. Visualize and test for interaction via chi-square.", "threshold": "No systematic pattern; if weekdays differ materially from weekends (e.g., Fri/Sat have 45/55 split while Mon has 50/50), document and consider sensitivity analysis." }, { "check": "Bot Filter Effectiveness", "method": "Confirm that bot filter was applied uniformly to control and treatment (i.e., % of sessions flagged as bot should be ~equal). Compute proportion of bot sessions per arm.", "threshold": "Bot rate difference < 0.5 percentage points; if >0.5 pp difference, bot filter may be correlated with treatment assignment." }, { "check": "Conversion Rate Sanity", "method": "Plot cumulative conversion rate for control and treatment by day. Expect control to stabilize around 3.8% by day 5. Flag if either arm shows unexpected drift or inflection.", "action": "Check for data pipeline issues (e.g., order attribution lag, timezone misalignment) if rates are erratic." } ] }

What to edit for your situation

Replace the hypothesis, baseline conversion rate, MDE, and weekly traffic figures with your actual product and metric values; also update the test constraints (runtime, significance threshold, traffic filters) to match your team's standards.

Human review: Verify the sample size formula and computed figures against your own stats library or a trusted calculator before committing to the design — rounding choices and variance assumptions can meaningfully affect required duration.

Generate this for your own situation — free.

5 runs a day, no credit card.

Try the A/B Test Experiment Design Generator

← Browse more example outputs