“We ran an A/B test. B won. Ship it.”
I’ve heard this exact conversation at least 50 times in my 18 years of UX work. And almost every time, the team is about to make a mistake.
Not because A/B testing doesn’t work—it does. But because most teams fundamentally misunderstand what their A/B tests are actually telling them.
They think A/B testing is simple:
- Create two versions
- Split traffic 50/50
- Wait for a winner
- Implement the winner everywhere
This approach is so common that entire platforms are built around it. And it’s leading to terrible decisions backed by “data.”
The Five Ways Teams Screw Up A/B Tests
Let me walk you through the most common mistakes I see—and how they lead to wrong conclusions.
Mistake #1: Using Inadequate Sample Sizes
What teams do: Run a test with 500 visitors to each variant, see that variant B converted 2% better than variant A, and call it a win.
Why this is wrong: With small sample sizes, random variation can easily produce a 2% difference even if there’s no real effect.
Real example from a client:
Their test:
- Variant A: 247 visitors, 12 conversions (4.9%)
- Variant B: 251 visitors, 17 conversions (6.8%)
- Conclusion: “B wins by 39%! Ship it!”
The problem: Sample size was nowhere near large enough for statistical significance. I ran the numbers:
- Statistical significance: p = 0.34
- Confidence level: 66%
- Translation: 34% chance this “win” is just random noise
What happened: They implemented B. Conversion rate stayed exactly the same over the next month. They’d wasted two weeks of testing and engineering time on a false positive.
What they should have done: Calculate required sample size BEFORE running the test.
For their baseline conversion rate (5%) and desired lift (39%), they needed:
- Minimum 1,200 visitors per variant for 95% confidence
- Test duration: 3-4 weeks at their traffic levels
How to Calculate Proper Sample Size
Don’t guess. Use this formula (or an online calculator):
Required sample size depends on:
- Baseline conversion rate: Your current performance
- Minimum detectable effect (MDE): Smallest improvement worth detecting
- Statistical power: Usually 80% (probability of detecting a real effect)
- Significance level: Usually 95% (p < 0.05)
Example:
- Baseline conversion: 3%
- Want to detect: 20% relative improvement (3% → 3.6%)
- Power: 80%
- Significance: 95%
- Required sample: ~7,800 visitors per variant
If you only get 1,000 visitors per day, that’s an 8-day test minimum.
The rule: If you can’t reach adequate sample size in a reasonable timeframe, don’t run the test. You’ll just waste time.
Mistake #2: Ending Tests Too Early
What teams do: Check results daily, see B ahead by 15% on day 3, and declare victory.
Why this is wrong: Early in a test, random variance creates big swings. What looks like a massive win on day 3 often regresses to zero by day 14.
This is called “peeking” and it’s dangerous.
Real example:
I consulted with an e-commerce company that was “data-driven.” They ran A/B tests constantly. And they were making worse decisions because of it.
Their process:
- Launch test Monday morning
- Check results Wednesday afternoon
- If B is winning, end test and implement
- Move to next test
Their reasoning: “We don’t want to waste time. If we see a winner, we implement it fast.”
The problem:
I audited their last 15 tests. Here’s what actually happened:
| Test | Day 3 Winner | Final Winner (Day 14) | Did They Ship? | Actual Result |
|---|---|---|---|---|
| 1 | B (+22%) | A (-3%) | Shipped B | Conversion dropped 4% |
| 2 | B (+18%) | B (+4%, p=0.09) | Shipped B | No significant change |
| 3 | A (+31%) | B (+7%, p=0.03) | Kept A | Missed 7% lift opportunity |
| 4 | B (+15%) | No difference | Shipped B | No change |
| 5 | B (+26%) | A (-8%) | Shipped B | Conversion dropped 9% |
Out of 15 tests, they made the wrong call 11 times because they peeked and ended early.
Why this happens:
Early in a test, you have small sample sizes, which means high variance. A few random conversions can swing the numbers dramatically.
Imagine:
- Day 1: 50 visitors to A, 50 to B
- Variant A: 1 conversion (2%)
- Variant B: 4 conversions (8%)
- “B is winning by 300%!”
But this means nothing. With such small samples, one or two conversions change everything.
By day 14, with proper sample sizes, the difference usually disappears.
The solution:
1. Decide your test duration BEFORE launching
Based on:
- Required sample size
- Your traffic levels
- Seasonality (include different days of week)
2. Don’t peek at results
Or if you must peek, use sequential testing methods that account for multiple looks (Bayesian approaches or adjusted significance thresholds).
3. Commit to running the full duration
Even if it looks like you have a winner early. Discipline beats impatience.
Mistake #3: Using the Wrong Baseline Metrics
What teams do: Test a new checkout flow and only measure final conversion rate.
Why this is wrong: Final conversion rate doesn’t tell you where and why behavior changed.
Real example:
Client: SaaS company with complex 5-step signup flow
Their test:
- Variant A (control): 5-step signup
- Variant B: Simplified 3-step signup
- Metric: Signup completion rate
Their result:
- Variant B: +12% signup completion
- “Ship it immediately!”
What they didn’t measure:
I ran a funnel analysis on both variants:
| Step | Variant A | Variant B | Change |
|---|---|---|---|
| Start signup | 100% | 100% | - |
| Enter email | 78% | 89% | +14% |
| Choose plan | 62% | 71% | +15% |
| Enter payment | 51% | 58% | +14% |
| Complete signup | 43% | 48% | +12% |
| Activate (7 days) | 35% | 21% | -40% |
| Paying (30 days) | 29% | 15% | -48% |
What actually happened:
Variant B made signup easier—too easy. It removed friction that helped qualify serious users.
Result:
- More signups (+12%)
- But worse quality leads
- Lower activation (-40%)
- Lower conversion to paying customers (-48%)
Net business impact: -36% revenue
If they’d measured only signup rate, they’d have shipped a variant that destroyed their business.
The lesson:
Measure your actual business goal, not just the proximate metric.
- Selling products? Measure revenue, not just checkout completion
- SaaS signup? Measure activation and retention, not just registration
- Content site? Measure engagement and return visits, not just clicks
Mistake #4: Ignoring Segment Performance
This is the most common and most damaging mistake.
What teams do: Look at overall results: “B won by 10%, ship it!”
Why this is wrong: Variant B might work great for some users and terribly for others. Overall performance hides this.
Real example (this happens constantly):
Client: E-commerce site testing new product page design
Their result:
- Overall conversion: Variant B +8.2% (p = 0.03)
- “Winner! Implement B everywhere!”
What I found when I segmented the data:
| Segment | Variant A | Variant B | Change |
|---|---|---|---|
| Mobile | 2.1% | 3.2% | +52% ✓ |
| Desktop | 4.8% | 4.1% | -15% ✗ |
| New users | 2.7% | 3.4% | +26% ✓ |
| Returning users | 5.1% | 4.6% | -10% ✗ |
| Organic traffic | 3.9% | 4.2% | +8% ✓ |
| Paid traffic | 2.2% | 3.1% | +41% ✓ |
Translation:
Variant B is:
- Excellent for mobile users (+52%)
- Terrible for desktop users (-15%)
- Great for new users (+26%)
- Worse for returning customers (-10%)
What they should have done:
- Implement variant B for mobile only → Capture the +52% lift
- Keep variant A for desktop → Avoid the -15% drop
- Test a new variant C for desktop → Try to find what works for that segment
Net impact of smart segmentation vs. blanket implementation:
If they implemented B everywhere:
- Mobile gains: +52% on 40% of traffic = +20.8%
- Desktop losses: -15% on 60% of traffic = -9.0%
- Net effect: +11.8%
If they segmented properly:
- Mobile gains: +52% on 40% of traffic = +20.8%
- Desktop stays same: 0% on 60% of traffic = 0%
- Then test variant C for desktop
- Net effect: +20.8% immediate, with upside potential from variant C
By not segmenting, they left +9% lift on the table.
The Segments You Must Analyze
Don’t just look at overall numbers. Segment by:
1. Device type
- Mobile vs. tablet vs. desktop
- iOS vs. Android
- Different screen sizes
2. User type
- New vs. returning visitors
- Logged in vs. anonymous
- Customer vs. prospect
3. Traffic source
- Organic search
- Paid search
- Social media
- Direct
4. Demographics (if available)
- Geographic location
- Time of day / day of week
- Language preference
5. Behavior patterns
- High intent (viewed pricing, added to cart)
- Low intent (browsing)
- Cart value (high vs. low AOV)
One variant rarely wins for everyone. Segment analysis reveals the nuance.
Mistake #5: Misinterpreting Statistical Significance
What teams think: “95% confidence means we’re 95% sure B is better than A.”
What it actually means: “If there’s truly no difference between A and B, there’s only a 5% chance we’d see a result this extreme due to random chance.”
These are not the same thing.
Real example:
Client test result:
- Variant B: +15% conversion
- p-value: 0.04
- Confidence: 96%
Client interpretation: “We’re 96% sure variant B is better!”
Actual meaning: “If A and B were truly the same, we’d see a result this extreme only 4% of the time.”
Why this matters:
A statistically significant result doesn’t tell you:
- How big the real effect is (could be smaller than measured)
- Whether the effect will persist (could be temporary)
- Whether it’s worth the implementation cost (significant ≠ meaningful)
What you should actually look at:
-
Confidence interval, not just p-value
- “B improves conversion by 8-22% (95% CI)”
- This tells you the range of plausible effect sizes
- If the lower bound is still worth implementing, proceed
-
Practical significance, not just statistical significance
- A +2% lift might be statistically significant but not worth engineering effort
- A +15% lift might not reach significance but worth further testing
-
Consistency across segments
- Does the effect hold across different user types?
- Or is it driven entirely by one small segment?
How to Actually Run Meaningful A/B Tests
Here’s my process after 18 years of running (and fixing) A/B tests:
Before Launch: Test Design
1. Define your primary metric
- What business outcome are you trying to improve?
- Not a proxy metric—the actual goal
2. Define secondary and guardrail metrics
- Secondary: Additional positive indicators
- Guardrail: Metrics that shouldn’t get worse (bounce rate, page load time, etc.)
3. Calculate required sample size
- Based on baseline conversion, minimum detectable effect, and power
- Use a calculator: https://www.evanmiller.org/ab-testing/sample-size.html
4. Determine test duration
- Based on traffic and required sample size
- Include full weeks to account for day-of-week variance
- Account for seasonality
5. Decide on segment analysis plan
- Which segments will you analyze?
- Pre-register your analysis plan to avoid data dredging
During the Test: Discipline
1. Don’t peek (or use proper sequential testing methods) 2. Don’t stop early unless there’s a technical problem 3. Don’t change the test mid-flight 4. Monitor for technical issues (tracking errors, load time problems)
After the Test: Deep Analysis
1. Check for statistical significance
- Is p < 0.05?
- What’s the confidence interval?
2. Analyze segment performance
- Does the effect hold across all major segments?
- Are there segments where B performs significantly worse?
3. Review secondary and guardrail metrics
- Did we accidentally break something?
- Are there unexpected negative effects?
4. Make segmented decisions
- Implement winning variant for segments where it won
- Keep control for segments where it lost
- Test new variants for losing segments
5. Monitor post-implementation
- Does the lift persist after full rollout?
- Or was the test result a fluke?
Real-World Case Study: Doing It Right
Client: Mid-size e-commerce company, $12M annual revenue
Goal: Improve product page conversion rate
Current performance: 3.2% add-to-cart rate
Test hypothesis: Adding trust badges and reviews above the fold will increase conversions
Proper test design:
Metrics:
- Primary: Add-to-cart rate
- Secondary: Revenue per visitor
- Guardrail: Bounce rate, time on page
Sample size calculation:
- Baseline: 3.2%
- Minimum detectable effect: 15% relative improvement (3.2% → 3.7%)
- Power: 80%
- Significance: 95%
- Required: 8,400 visitors per variant
Test duration:
- Traffic: 2,000 visitors/day to product pages
- Required: 16,800 total visitors
- Duration: 9 days minimum (rounded up to 14 days to include 2 full weeks)
Results after 14 days:
Overall:
- Variant A (control): 3.18% (534 conversions, 16,792 visitors)
- Variant B (trust badges): 3.64% (614 conversions, 16,871 visitors)
- Lift: +14.5% (p = 0.021, 95% CI: 2.3% to 28.1%)
Segment analysis:
| Segment | Variant A | Variant B | Lift | p-value |
|---|---|---|---|---|
| Mobile | 2.1% | 2.9% | +38% | 0.008 ✓ |
| Desktop | 4.7% | 4.9% | +4% | 0.54 ✗ |
| New visitors | 2.4% | 3.1% | +29% | 0.015 ✓ |
| Returning | 5.2% | 5.3% | +2% | 0.78 ✗ |
| Products <$50 | 4.1% | 5.2% | +27% | 0.019 ✓ |
| Products >$50 | 2.2% | 2.1% | -5% | 0.71 ✗ |
Interpretation:
Variant B works significantly better for:
- Mobile users
- New visitors
- Lower-priced products
Variant B shows no significant effect for:
- Desktop users
- Returning visitors
- Higher-priced products
Decision:
Phase 1: Implement for winning segments
- Show trust badges to mobile users → +38% lift on 45% of traffic
- Show trust badges to new visitors → +29% lift on 60% of traffic
- Show trust badges on products under $50 → +27% lift on 70% of traffic
Phase 2: Test new variant for losing segments
- Hypothesis: Desktop users and returning visitors don’t need trust badges (they’re already familiar/trusting)
- Hypothesis: High-value products need different trust signals (warranties, detailed specs, expert reviews)
- Test variant C: Social proof and detailed specifications for high-value products
Projected impact:
Naive implementation (B for everyone): +14.5% overall
Segmented implementation:
- Mobile lift: +38% × 45% = +17.1%
- New visitor lift: +29% × 60% = +17.4%
- Low-price lift: +27% × 70% = +18.9%
- (Overlapping segments, actual combined lift: ~22%)
By segmenting properly, they extracted 50% more value from the same test.
The Bottom Line
Your A/B test isn’t telling you “B is better than A.”
It’s telling you:
- For which users B performs better
- Under what conditions the improvement holds
- How confident you can be in the measured effect size
- What trade-offs you’re making (are any metrics getting worse?)
Most teams see “B wins” and stop thinking.
The best teams see “B wins overall” and start asking:
- For whom?
- Why?
- Where does it lose?
- How can we capture the gains and eliminate the losses?
A/B testing isn’t about finding winners. It’s about understanding your users deeply enough to serve different segments optimally.
The data is there. You just have to look past the surface-level number.
After 18 years of running tests, here’s what I know: The teams that win aren’t the ones running the most tests. They’re the ones extracting the most insight from each test.
Run fewer tests. Analyze them properly. Make segmented decisions.
Your conversion rate will thank you.