Example output · Data Scientist AI
What the Dataset Profiling Plan Generator actually produces
Takes a dataset description, schema, intended ML use case, and known issues, then generates a structured profiling plan covering missing-value analysis, distribution checks, outlier detection, target leakage red flags, and a concrete pre-training checklist.
- Dataset Description:
- Historical e-commerce transactions from a mid-size fashion retailer (2019–2023). ~4.2M rows covering online and in-store purchases, returns, and cancellations.
- Schema:
- order_id (UUID), customer_id (int), order_date (datetime), delivery_date (datetime), channel (categorical: online/in-store), product_id (int), category (string), unit_price (float), quantity (int), discount_pct (float), return_flag (bool), return_date (datetime), customer_lifetime_value (float), fraud_label (bool)
- Intended Use:
- Train a binary classifier to predict order-level fraud at the time of purchase, to be served in real-time at checkout.
- Known Issues:
- return_date is null for non-returned orders; delivery_date has ~8% nulls due to in-store orders; some duplicate order_ids suspected from a 2021 ETL migration; fraud_label coverage is only 60% (labels missing for early 2019 data).
MISSING VALUES 1. Column-level nullity: Report % missing for each column. Flag if >30% for numeric features (except delivery_date, return_date by design). 2. Structural missingness: Chi-square test (α=0.05) of return_date nullness vs. return_flag. Should be perfectly aligned; any deviation flags data quality issue. 3. Delivery_date missingness by channel: Cross-tab delivery_date nulls vs. channel. Expect ~100% nulls for in-store; flag if online has >2% nulls. 4. Fraud_label coverage by date: Group by order_date month, calculate % non-null fraud_label. Flag sharp cutoff before 2019 Q3 (known issue); flag any unexplained drops post-cutoff (threshold: >5% month-over-month decrease). 5. Missingness correlated with target: For each column, calculate phi coefficient (categorical) or point-biserial correlation (numeric) with fraud_label. Flag if |r| > 0.10, indicating missing data is not MCAR. DISTRIBUTIONS & OUTLIERS 1. Unit_price distribution: Plot histogram and compute (min, 25th, 50th, 75th, max, skewness). Fit lognormal; if KS test p < 0.05 vs. lognormal, flag as non-standard. Check for mode at 0 or suspiciously round numbers (bulk discounts). 2. Quantity distribution: Tabulate top 20 values. Flag if single value covers >10% of rows (indicator of data binning or aggregation). Check max quantity; if >1000 units, manually inspect top 50 rows. 3. Discount_pct: Calculate % of orders with discount_pct=0 vs. >0. Check for bimodal distribution (structural: discounted vs. non-discounted). Flag if range exceeds [0, 100] or negative values present. 4. Customer_lifetime_value: Use isolation forest (contamination=0.05) to detect outliers. Cross-reference with order count; if single order has CLV >$100k, flag as likely aggregation error. Compute IQR on log-scale (heavy-tailed expected); flag if max > Q3 + 10*IQR. 5. Order inter-arrival times (per customer): Calculate days between consecutive orders. Flag customers with <1 day between orders (bot/test traffic indicator). Count and threshold: flag if >0.1% of customer-order pairs. CLASS IMBALANCE & TARGET DISTRIBUTION 1. Overall fraud_label distribution: Report n_positive, n_negative, % positive. Expected imbalance is acceptable; flag if <1% or >50% positive (sign of label error or dataset shift). 2. Imbalance by channel: Cross-tab fraud_label vs. channel (online/in-store). Calculate fraud_pct per channel. Flag if fraud_pct(online) > 3x fraud_pct(in-store) — may indicate detection bias. 3. Imbalance by customer_lifetime_value quartile: Stratify customers into CLV quartiles; calculate fraud_pct per quartile. Flag if range (max_fraud_pct - min_fraud_pct) > 15 percentage points (unequal learning signal). 4. Imbalance by product_category: Calculate fraud_pct per category. Flag categories with <100 labeled examples (unreliable precision) or fraud_pct outside [global_rate - 3*SE, global_rate + 3*SE]. 5. Temporal fraud_pct: Bin order_date into months; plot fraud_pct over time. Flag any trend >10 pp change month-over-month post-2019-Q3 (suggests label quality change or real drift). Flag if recent months have <50 labeled positives. KEYS & JOINS 1. order_id cardinality: Count distinct order_ids; should equal row count. If count(distinct) < count(*), flag as duplicate orders. For each duplicate set, inspect order_date, delivery_date, product_id; if identical row except order_id, confirm ETL issue from 2021 migration. 2. customer_id validity: Count nulls in customer_id. Count distinct customer_ids per order_id; flag if any order_id has >1 customer_id (data corruption). Flag if customer_id > max_int or <0. 3. product_id validation: Count nulls and negatives. Cross-check distinct product_ids against product master (if available); flag orphaned product_ids (sign of join failure). 4. Referential integrity: For orders in 2021-2022 (peak of ETL migration), compare row counts before/after deduplication. Document n_rows_removed; flag if >1% of 2021 data is duplicate. SCHEMA DRIFT (comparing versions 2019–2023 if historical snapshots available) 1. Column existence: Verify all 14 columns present in each year's snapshot. Flag any dropped/added columns. 2. Type stability: Confirm order_date, delivery_date, return_date remain datetime; discount_pct, unit_price remain float. Flag any implicit type conversions (e.g., unit_price stored as string in subset of data). 3. Category enum values: List all unique values in 'category' and 'channel' per year. Flag new categories or spelling drift (e.g., "online" vs. "Online"). Document categorization scheme change if any. 4. Value range drift on numeric columns: For unit_price, discount_pct, quantity, calculate (min, max, median, std) per year. Flag year-over-year change in max by >50% (new product lines or data error). Flag if median drifts >20%. 5. Fraud_label coverage drift: % non-null fraud_label per year (already tracked above, but confirm trend).
TARGET LEAKAGE 1. return_date and return_flag are downstream of fraud decision: Confirm they are NOT used as predictors. A fraudulent order may or may not be returned; using return info at prediction time (before return window closes) is temporal leakage. 2. delivery_date: If order was fraudulent, delivery may not occur. delivery_date is post-decision; flag if used as feature. Use only order_date and channel pre-decision features. 3. Customer_lifetime_value (CLV): CRITICAL. If CLV is computed using the target order, it leaks the target (the order itself contributes to CLV). Compute CLV using only orders strictly before order_date for each row. Verify: for each row, CLV should equal sum(revenue) for all orders with order_date < current_order_date. If CLV includes current order, remove and recompute. TRAIN-TEST LEAKAGE (temporal & entity-level) 1. Time-based split must be on order_date: If splitting on order_id or customer_id, train and test may have same customer with orders on both sides; customer behavior leaks between splits. Implement strict temporal cutoff: train = order_date < cutoff_date, test = order_date >= cutoff_date. Verify no customer_id appears in both splits. 2. Homogeneity check: After temporal split, calculate % of customers in train that also appear in test. Flag if >5% (indicates time overlap or customer behavior leakage). 3. 2021 ETL migration duplicate issue: If duplicates are not removed before split, train-test may have near-identical rows on different sides. Deduplicate BEFORE any split. PROXY LEAKAGE 1. Fraud_label vs. discount_pct correlation: Calculate point-biserial r(fraud_label, discount_pct). If |r| > 0.40, investigate: are fraudsters systematically using higher discounts? Or does the fraud detection system flag high-discount orders, creating spurious correlation? If latter, discount_pct is a proxy for fraud_score, not a causal feature. 2. Fraud_label vs. channel correlation: Calculate Cramér's V(fraud_label, channel). If V > 0.30, flag: does one channel have intrinsically higher fraud risk, or is fraud detection deployed only on one channel (selection bias)? Inspect fraud_label coverage by channel; if in-store is all 0 or all unlabeled, this is proxy leakage. 3. Customer_lifetime_value as proxy: If fraud_label is correlated with CLV (r > 0.35), and CLV is recomputed correctly (post-order), assess causality: do high-value customers commit fraud, or are high-value customers more scrutinized (detection bias)? If bias, CLV will overfit in training and fail in production. TEMPORAL LEAKAGE 1. Fraud_label availability: Confirm fraud_label is known AT order_date (or before checkout) for all labeled rows. If fraud_label is filled in post-delivery or post-return window, it is not a valid training target for a real-time checkout model. Re-examine: when was label assigned relative to order placement? 2. Future data in feature computation: If any feature (e.g., CLV, category trends) is computed from orders AFTER the current order_date, flag as temporal leakage. Use only data with order_date <= current_order_date. 3. Delivery_date in training: Even if not directly used, if model is trained on completed orders (non-null delivery_date) and deployed on orders at checkout (no delivery_date yet), the train set has information the test set doesn't. Confirm model only uses features available at checkout time: order_date, channel, product_id, unit_price, quantity, discount_pct, (pre-computed) customer_lifetime_value. FAIRNESS & PROTECTED ATTRIBUTES 1. Indirect proxy check: Dataset contains no explicit protected attributes (age, gender, race). However, customer_id and channel may proxy for socioeconomic status or geography. Calculate fraud_label distribution by channel and by customer_id decile (ranked by CLV). Flag if any segment has fraud_pct > 2x baseline (sign of disparate impact or detection bias). Document findings; do not remove, but flag for model card and fairness audit post-hoc.
[ { "item": "Duplicate order_ids removed", "criterion": "count(distinct order_id) == row_count after deduplication. Document n_rows removed from 2021 period; must be <1% of 2021 volume." }, { "item": "Fraud_label is valid at prediction time", "criterion": "Manual inspection: 20-sample audit confirms fraud_label was assigned before or at order_date (not post-delivery or post-return window). If label timing is post-delivery, dataset is unsuitable for real-time fraud model." }, { "item": "Customer_lifetime_value excludes current order", "criterion": "For 1000 random rows, verify CLV = sum(revenue) for all prior orders (order_date < current_order_date). If any row's CLV includes current order amount, recompute on entire dataset." }, { "item": "No target leakage from return_date or delivery_date", "criterion": "Confirm neither return_date nor delivery_date are included in feature set. Use only: order_date, channel, product_id, unit_price, quantity, discount_pct, customer_lifetime_value, category." }, { "item": "Missing fraud_label is structural, not random", "criterion": "Fraud_label nulls are confined to order_date < 2019-07-01 (or known cutoff). Chi-square test: fraud_label missingness must NOT be significantly correlated with any other column (p > 0.05). If correlated, investigate root cause before proceeding." }, { "item": "Class imbalance is acceptable and documented", "criterion": "Overall fraud_pct reported. Stratified fraud_pct by channel, product_category, and CLV quartile computed and reviewed. Flag any segment with <100 labeled examples or fraud_pct >3 SEs from mean. Sampling strategy (e.g., stratified split, class weights) defined before training." }, { "item": "No train-test leakage on customers", "criterion": "If using temporal split (train: order_date < cutoff, test: order_date >= cutoff), verify no customer_id in both train and test (report % overlap; must be 0%). If using random split, only acceptable if fraud_label distribution is uniform over time; else use temporal split." }, { "item": "No proxy leakage from discount_pct or channel", "criterion": "Point-biserial r(fraud_label, discount_pct) reported; if |r| > 0.40, investigate and document whether correlation is causal or detection artifact. Cramér's V(fraud_label, channel) reported; if V > 0.30, confirm fraud detection coverage is equal across channels." }, { "item": "Outliers assessed and decision made", "criterion": "Isolation forest outliers in customer_lifetime_value and unit_price identified. IQR outliers in quantity flagged. Top 1% of orders by CLV manually inspected (n >= 42k rows). Decision made: remove, cap, or keep. If kept, confirm model is robust to outliers (e.g., via tree-based model or robust scaling)." }, { "item": "Schema is stable across train and test", "criterion": "If historical versions available: column existence, types, and value ranges (min, max, median for numeric; enum values for categorical) are identical across 2019–2023. Any drift documented and justified (e.g., new product category added in 2022). No unexplained type changes." }, { "item": "Data quality threshold met", "criterion": "Rows with all required features (no nulls in: order_date, customer_id, product_id, unit_price, quantity, discount_pct, channel, category, fraud_label) = >=95% of labeled dataset. Rows excluded due to nulls documented and rationale recorded." }, { "item": "Fairness check: disparate impact screening", "criterion": "Fraud_pct by channel and by customer CLV decile computed. Max fraud_pct / min fraud_pct < 2.5. If exceeded, document in model card that fraud rates differ by segment; recommend stratified evaluation and fairness audit post-training." } ]
Swap in your own dataset description, schema column names and types, intended model use case, and any known data quality issues — the more specific you are, the more targeted the leakage flags and checklist thresholds will be.
Human review: Verify that the leakage red flags and checklist criteria match your actual data pipeline and label-generation process before treating them as ground truth — the tool reasons from the schema you provide, not from your source system.
Generate this for your own situation — free.
5 runs a day, no credit card.
Try the Dataset Profiling Plan Generator