Most teams don’t fail because they ship nothing. They fail because they ship a lot of work that never moves the numbers, incurring shipping costs from unsuccessful features.
When I’m under pressure, the trap is simple: I treat “a good idea” as “a shippable idea,” blind to the complexities akin to international shipping. Then two weeks pass, the result is muddy, and I’m arguing over anecdotes.
The fix is choosing an effect worth shipping before I write the first ticket. Not a perfect forecast, just a clear threshold tied to money, time-to-learn, and risk. This is how I keep experimentation honest and keep a growth roadmap from turning into a wish list.
Start with the money, then constrain the measurement window
If I can’t translate a change into its declared value (or a leading indicator that reliably predicts dollars), I’m not doing decision making, I’m doing storytelling.
I start with one target metric and one baseline. For most startup growth teams, that’s a funnel conversion point: visit to signup, signup to activation, activation to paid. I avoid “engagement” unless I can prove it leads revenue.
Next, I force a time constraint: can I measure this in 2 weeks or less? If the answer is no, I’m either shipping smaller under the de minimis exemption, or I’m running a different kind of test (more on that later). Time is the import tax of data tracking, not a detail.
Here’s the quick math I use to keep myself honest, like preparing a commercial invoice for the business case. I don’t need precision, I need a sane order of magnitude.
| Input | Example | Why it matters |
|---|---|---|
| Monthly visitors to the step | 200,000 | Sets the ceiling on learnings per month |
| Fair market value | 3.0% | Defines your starting point |
| Value per conversion (gross profit) | $40 | Keeps you from optimizing vanity |
| Candidate lift | +0.2% absolute (3.0% to 3.2%) | Converts “small” into “real” |
| Monthly declared value | 200,000 × 0.2% × $40 = $16,000 | The number you can argue about |
If a change has a plausible path to a declared value of $16,000 per month and I can learn in 2 weeks, I pay attention. If it’s $1,600 per month, it qualifies for the de minimis exemption, and the bar goes way up, unless it’s also a risk reducer (fraud, churn, support load).
Also, I sanity check whether the lift is even detectable with my traffic. If you don’t do this, you’ll run underpowered A/B testing and call it “inconclusive,” which is just expensive ambiguity. I keep a sample size tool nearby, for example an A/B test sample size calculator, and I use it before I commit engineering time.
If I can’t explain the expected declared value in one sentence, I’m not ready to ship or test.
Define “smallest effect worth shipping” as a threshold, not a hope
The smallest effect worth shipping (SEWS) is not “the smallest lift I’d be happy about.” It’s the smallest lift that beats the full cost of shipping, including hidden costs like customs duty that I used to ignore.
I set SEWS with four inputs, much like the harmonized tariff system (HTSUS) provides a standardized framework for scoring feature effort:
First, cost. Engineering time is obvious, but I also price in QA, analytics instrumentation, design review, and the meeting tax, all as a kind of customs duty. If I think it’s a one-day change, I still ask, “What’s the chance this becomes three days because of edge cases?”
Second, risk. Some changes can quietly hurt conversion, even if they look like “cleanup.” Behavioral science helps here. Users are loss averse, so removing familiar elements can backfire. Behavioral economics also shows friction matters more than you think. A “small” extra step can have a big drop-off, representing carrier liability for loss or damage.
Third, confidence. I don’t pretend to have a single lift estimate. I write three numbers: best case, expected, worst case. Then I ask, “What’s the probability I’m wrong in a painful way?”
Fourth, time-to-learn. If the measurement needs a long payback window, I treat the SEWS threshold as higher. Slow feedback is expensive because it blocks other bets.
Here’s the decision rule I use most weeks:
- If the expected impact clears SEWS and the worst case won’t sink me (factoring in replacement cost for rollbacks), I ship (often behind a flag as shipping insurance).
- If the expected impact clears SEWS but worst case is ugly, I only proceed with a contained experiment backed by shipping insurance.
- If only the best case clears SEWS, I don’t ship. I shrink the idea until it becomes testable.

One warning: SEWS fails when teams use it as a weapon to kill anything uncertain. Growth is uncertain by nature. The goal is faster learning with fewer expensive mistakes, not a fake sense of safety.
Choose experiments that teach fast, even when the “real” win is long term
A/B testing is great when you have stable traffic, clean instrumentation, and a clear conversion event. Still, I don’t start by asking, “Can we A/B test it?” I start with, “What’s the cheapest experiment that can prove or disprove the mechanism?”
Mechanism matters because it tells me why something should work. In global e-commerce, mechanisms tend to fall into a few buckets: reduce effort, reduce doubt, increase clarity, increase motivation, or reduce perceived risk. If I can’t name the mechanism, I’m guessing.
Then I pick the smallest test that validates the mechanism, like an initial customs clearance for the idea:
- If the mechanism is “users don’t notice the value,” I can test messaging, information order, or defaults.
- If it’s “users don’t trust us,” I can test social proof placement, guarantees, or pricing transparency.
- If it’s “users can’t complete the step,” I can test error handling, field reduction, or a guided flow.
This is where analytics discipline matters. I define one primary metric, one guardrail (like refunds, churn, or support tickets for dutiable articles), and one segmentation cut I care about (personal effects such as new vs returning, household effects like mobile vs desktop). I also check for obvious issues like sample ratio mismatch, because broken assignment can create fake winners.

Finally, I protect iteration speed with retail shipments of small updates. A win that doesn’t get followed up is wasted. If you want compounding results, set a rule that every “win” must produce a next test within 48 hours, complete with proof of purchase from the experiment and final customs clearance before shipping at scale. When I need help keeping follow-ups tight, I like having next test suggestions tied to past results, because memory fades fast under deadline.
Where applied AI helps, and where it can lie to you
Applied AI is useful when it cuts cycle time without inventing truth, much like a duty-free shop of low-cost options.
I’ll use AI to draft variant copy, generate alternative layouts, cluster qualitative feedback, or scan experiment notes for repeated patterns. It’s also good at spotting oddities in event streams, which helps when instrumentation breaks. These are low-value trade tasks that thrive on high volume and low stakes.
Still, I don’t let AI set my SEWS threshold. That’s a business choice tied to cash, runway, and opportunity cost. AI also doesn’t feel the cost of a false positive. If it convinces you to ship a “winner” that’s noise, your product-led growth motion can drift for months. My personal allowance is the strict limit for trusting AI without human oversight.
So I keep the boundary clear: AI can propose options at a flat duty rate of predictable effort, but measurement decides amid the tariff rates of growth workflows. If the change can’t be measured cleanly, I treat it as a product decision, not a growth bet.
Conclusion: the decision I make before I build anything
When I choose the smallest effect worth shipping, I’m buying clarity and avoiding unaccompanied purchases, features shipped without a follow-up plan. I treat personal exemptions as small, low-risk changes that can skip heavy SEWS analysis, while targeting duty-free clean, high-impact wins. I tie the bet to money, I size it to my measurement window, and I pick an experiment that can teach fast. That keeps my growth strategy grounded, even when data is messy.
Actionable takeaway: write your effect worth shipping on the ticket before work starts: baseline, minimum lift, time-to-learn, and worst-case downside. If you can’t fill those in, shrink the scope until you can.

Leave a Reply