How to Write an Experiment Pre-Registration Doc That Stops P-Hacking in Growth Teams

Ever had an A/B test that “won” on Friday and “lost” by Tuesday? That swing is often real variance, but it’s also a sign the team is touching the dials mid-flight. When goals are aggressive and dashboards update in real time, it’s easy to chase a green number.

An experiment pre-registration doc fixes that by doing one simple thing: it forces you to write down your intent before you see the outcome. Think of it like sealing your analysis plan in an envelope before you open the results.

What p-hacking looks like in growth teams (and why it happens)

Clean, modern vector illustration in split panels contrasting p-hacking pitfalls like metric switching, optional stopping, and repeated peeks on the left with stable pre-registration practices on the right.
Common p-hacking traps versus a locked pre-registration plan, created with AI.

P-hacking in growth work rarely looks like fraud. It looks like “being agile.” Common patterns:

  • Metric switching: You planned to judge on activation rate, but retention moved, so retention becomes the headline.
  • Optional stopping: The test is called early when it looks good, or extended when it doesn’t.
  • Repeated peeks: You check results daily and stop the moment p < 0.05.
  • Post-hoc segments: “It didn’t work overall, but it worked for mobile users in Canada.”
  • Removing ‘bad’ data: Excluding outliers, refunds, or “weird days” after seeing they hurt the result.

These behaviors are so common that many teams barely notice them anymore. If you want a practical, growth-focused breakdown, Jason Cohen’s write-up on p-hacking your A/B tests is a good mirror to hold up to your process.

What an experiment pre-registration doc is (for A/B tests)

Pre-registration is popular in academic research, but it maps cleanly to product, marketing, and lifecycle tests. You write down:

  • what you’re changing
  • what “success” means
  • how long you’ll run
  • what analysis you’ll use
  • what you will not change after launch

If you want a canonical reference, Open Science Framework’s overview of registrations and preregistrations is a solid starting point.

This is also aligned with the American Statistical Association’s guidance on not treating p-values like a magic pass or fail button. The ASA statement is short and worth bookmarking: ASA statement on p-values (PDF).

The doc sections that block the usual p-hacking moves

A clean, modern vector-style illustration in landscape ratio showing an open document with structured sections like Hypothesis, Metrics, Sample Size, and Analysis Plan. Background features a data funnel, locked folder, and team handshake, using blue/teal colors with flat design for an organized, trustworthy feel.
An example of a structured pre-registration document layout, created with AI.

A good pre-reg doc is short, but it’s opinionated. These fields do most of the work.

1) Primary metric + decision rule (stops metric switching)

Write one primary metric, one definition, one decision rule.

Example: “Primary metric = activation within 24 hours. Ship only if effect is positive and statistically significant at alpha 0.05, and guardrails pass.”

Also list secondary metrics, but label them as supporting evidence, not the thing you will use to declare victory.

2) Fixed run length + stopping rule (stops optional stopping and peeking)

Pre-commit to either:

Fixed horizon: “Run for 14 full days, evaluate once at the end.”

Or sequential testing (allowed peeks): “Evaluate at day 7 and day 14 with alpha spending.” You don’t need heavy math in the doc, just state the method. Two readable intros are Understanding Group Sequential Testing and Error Spending in Sequential Testing Explained.

Key point: if peeking is allowed, it must be structured. If it’s not structured, it’s p-hacking with better charts.

3) Population, unit, and bucketing (stops “we changed who counts”)

Lock:

  • Unit of randomization (user, account, session)
  • Eligibility window (new signups only, last 30 days)
  • Exposure definition (what counts as “saw treatment”)
  • One user, one bucket rules (no cross-device reassignment, if possible)

This prevents redefining the denominator after the fact.

4) Data exclusions and quality rules (stops removing ‘bad’ data)

Write exclusions before launch. Keep them narrow and operational.

Good: bot traffic filters, internal users, known tracking outages with timestamps, duplicate accounts rule.

Risky: “Remove extreme spenders,” “remove angry users,” or “remove days where conversion was weird.”

If you must exclude anything subjective, require an amendment and a separate “exploratory” result.

5) Segmentation plan (stops post-hoc segments)

Pre-specify the only segments you’ll treat as confirmatory.

Example: “Confirmatory segments: device (mobile vs desktop) and plan (free vs trial). All other slices are exploratory.”

This doesn’t ban exploration. It just stops you from presenting a lucky slice as if you planned it.

6) Multiple comparisons controls (stops false wins when you test many things)

Growth teams often test:

  • many metrics
  • many variants
  • many segments
  • many experiments per month

That’s a multiple comparisons problem. Your pre-reg doc should pick one approach:

  • Pre-specified hierarchy: one primary metric, then only test secondary metrics if primary passes.
  • Bonferroni or Holm: more conservative, simple to explain for a small set of metrics.
  • False Discovery Rate (FDR) control: useful when you’re screening many hypotheses.

You don’t need to teach stats in the doc. You just need to state what rule you’ll follow.

Governance: what must be locked before launch vs what can change

In 2025, experimentation is faster than ever, but governance still matters. The easiest policy is “lock the parts that can create a false win.”

ItemMust be locked before launchCan change with amendment log
Hypothesis and primary metricYesNo (start a new experiment)
Eligibility, unit, bucketingYesRarely (only for bugs)
Stopping rule and peek scheduleYesNo (start a new experiment)
Exclusions and data quality rulesYesYes (with timestamps and reason)
Secondary metrics and segmentsYesYes (but marked exploratory)
Instrumentation detailsNoYes
Run dates (if incident occurs)NoYes (with documented incident)

Amendment log rule: if you change anything that would make the result easier to “win,” you either restart the test or treat outcomes as exploratory.

Copy/paste experiment pre-registration template (Markdown)

Experiment pre-registration (v1.0)

  • Experiment name:
  • Owner:
  • Reviewer (data/analytics):
  • Decision maker:
  • Created on (date):
  • Planned launch (date):

1) Goal and hypothesis

  • Change description:
  • Hypothesis (directional):
  • Primary decision: ship, iterate, or stop

2) Primary metric (confirmatory)

  • Primary metric name:
  • Metric definition (numerator/denominator, window):
  • Decision rule (include alpha and direction):

3) Guardrails

  • Guardrail metrics (and fail thresholds):

4) Population and assignment

  • Eligibility:
  • Unit of randomization:
  • Variants (control, treatment):
  • Bucketing method:
  • Exposure definition:

5) Sample size and duration

  • Planned duration:
  • Target sample size (or MDE assumptions):
  • Seasonality risks (if any):

6) Stopping and peeking

  • Stopping rule (fixed horizon or sequential):
  • Peek schedule (if any):
  • Early stop criteria (efficacy, futility, safety):

7) Analysis plan

  • Primary test method:
  • Handling repeated users/sessions:
  • Multiple comparisons control (hierarchy, Holm, FDR):
  • Segment plan (confirmatory segments only):
  • Missing data and tracking checks:

8) Exclusions (pre-committed)

  • Exclude:
  • Do not exclude:

9) Reporting plan

  • Where results will be posted:
  • Template for final readout:

Amendment log

  • Date:
  • Change:
  • Reason:
  • Impact on confirmatory vs exploratory:
  • Approved by:

Filled example: onboarding email subject line test (growth team)

Experiment name: Onboarding Email 1 Subject Line
Owner: Lifecycle PM
Reviewer: Analytics Lead
Planned launch: Jan 6, 2026

Goal and hypothesis
Change: Subject line “Welcome to Acme” (control) vs “Your first win in 5 minutes” (treatment).
Hypothesis: Treatment increases activation within 24 hours.

Primary metric (confirmatory)
Primary metric: Activation rate within 24 hours of signup.
Definition: Activated users / delivered-email recipients, 24-hour window from signup.
Decision rule: Ship if uplift > 0 and significant at 0.05, and guardrails pass.

Guardrails
Unsubscribe rate: do not increase by more than 0.15 percentage points.
Spam complaint rate: do not increase by more than 0.02 percentage points.

Population and assignment
Eligibility: New signups, excluding internal domains and known bots.
Unit: User.
Exposure: Email delivered within 30 minutes of signup.
Bucketing: 50/50 split by user_id hash.

Sample size and duration
Duration: 14 days to cover weekday cycles.
Sample size: Run until 20,000 delivered emails total (based on prior baseline variance).

Stopping and peeking
Sequential plan: Two looks (day 7 and day 14) using alpha spending (pre-set). No other peeks.

Analysis plan
Primary method: Two-proportion test on activation rate, report effect size and confidence interval.
Multiple comparisons: Hierarchy (primary metric first; then guardrails; then secondary metrics).
Segments: Confirmatory segments are device (mobile/desktop) only. Any other segments are exploratory.

Exclusions
Exclude: internal users, bot signups, known tracking outage window (if it occurs, logged).
Do not exclude: low-engagement users, refunds, “weird days” without incident ticket.

Conclusion

A strong experiment pre-registration doc doesn’t slow growth teams down, it stops you from arguing with your past self. It makes wins more believable, losses more useful, and post-test decisions less political. Start with one template, enforce the locked fields, and keep an amendment log that’s painful to abuse. If your next “win” can’t survive that process, it wasn’t a win you could trust.

Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Discover more from Decision Driven Test Repository→ GrowthLayer.app

Subscribe now to keep reading and get access to the full archive.

Continue reading