Your A/B Test Isn't Telling You What You Think It Is

“We ran an A/B test. B won. Ship it.”

I’ve heard this exact conversation at least 50 times in my 18 years of UX work. And almost every time, the team is about to make a mistake.

Not because A/B testing doesn’t work—it does. But because most teams fundamentally misunderstand what their A/B tests are actually telling them.

They think A/B testing is simple:

Create two versions
Split traffic 50/50
Wait for a winner
Implement the winner everywhere

This approach is so common that entire platforms are built around it. And it’s leading to terrible decisions backed by “data.”

The Five Ways Teams Screw Up A/B Tests

Let me walk you through the most common mistakes I see—and how they lead to wrong conclusions.

Mistake #1: Using Inadequate Sample Sizes

What teams do: Run a test with 500 visitors to each variant, see that variant B converted 2% better than variant A, and call it a win.

Why this is wrong: With small sample sizes, random variation can easily produce a 2% difference even if there’s no real effect.

Real example from a client:

Their test:

Variant A: 247 visitors, 12 conversions (4.9%)
Variant B: 251 visitors, 17 conversions (6.8%)
Conclusion: “B wins by 39%! Ship it!”

The problem: Sample size was nowhere near large enough for statistical significance. I ran the numbers:

Statistical significance: p = 0.34
Confidence level: 66%
Translation: 34% chance this “win” is just random noise

What happened: They implemented B. Conversion rate stayed exactly the same over the next month. They’d wasted two weeks of testing and engineering time on a false positive.

What they should have done: Calculate required sample size BEFORE running the test.

For their baseline conversion rate (5%) and desired lift (39%), they needed:

Minimum 1,200 visitors per variant for 95% confidence
Test duration: 3-4 weeks at their traffic levels

How to Calculate Proper Sample Size

Don’t guess. Use this formula (or an online calculator):

Required sample size depends on:

Baseline conversion rate: Your current performance
Minimum detectable effect (MDE): Smallest improvement worth detecting
Statistical power: Usually 80% (probability of detecting a real effect)
Significance level: Usually 95% (p < 0.05)

Example:

Baseline conversion: 3%
Want to detect: 20% relative improvement (3% → 3.6%)
Power: 80%
Significance: 95%
Required sample: ~7,800 visitors per variant

If you only get 1,000 visitors per day, that’s an 8-day test minimum.

The rule: If you can’t reach adequate sample size in a reasonable timeframe, don’t run the test. You’ll just waste time.

Mistake #2: Ending Tests Too Early

What teams do: Check results daily, see B ahead by 15% on day 3, and declare victory.

Why this is wrong: Early in a test, random variance creates big swings. What looks like a massive win on day 3 often regresses to zero by day 14.

This is called “peeking” and it’s dangerous.

Real example:

I consulted with an e-commerce company that was “data-driven.” They ran A/B tests constantly. And they were making worse decisions because of it.

Their process:

Launch test Monday morning
Check results Wednesday afternoon
If B is winning, end test and implement
Move to next test

Their reasoning: “We don’t want to waste time. If we see a winner, we implement it fast.”

The problem:

I audited their last 15 tests. Here’s what actually happened:

Test	Day 3 Winner	Final Winner (Day 14)	Did They Ship?	Actual Result
1	B (+22%)	A (-3%)	Shipped B	Conversion dropped 4%
2	B (+18%)	B (+4%, p=0.09)	Shipped B	No significant change
3	A (+31%)	B (+7%, p=0.03)	Kept A	Missed 7% lift opportunity
4	B (+15%)	No difference	Shipped B	No change
5	B (+26%)	A (-8%)	Shipped B	Conversion dropped 9%

Out of 15 tests, they made the wrong call 11 times because they peeked and ended early.

Why this happens:

Early in a test, you have small sample sizes, which means high variance. A few random conversions can swing the numbers dramatically.

Imagine:

Day 1: 50 visitors to A, 50 to B
Variant A: 1 conversion (2%)
Variant B: 4 conversions (8%)
“B is winning by 300%!”

But this means nothing. With such small samples, one or two conversions change everything.

By day 14, with proper sample sizes, the difference usually disappears.

The solution:

1. Decide your test duration BEFORE launching

Based on:

Required sample size
Your traffic levels
Seasonality (include different days of week)

2. Don’t peek at results

Or if you must peek, use sequential testing methods that account for multiple looks (Bayesian approaches or adjusted significance thresholds).

3. Commit to running the full duration

Even if it looks like you have a winner early. Discipline beats impatience.

Mistake #3: Using the Wrong Baseline Metrics

What teams do: Test a new checkout flow and only measure final conversion rate.

Why this is wrong: Final conversion rate doesn’t tell you where and why behavior changed.

Real example:

Client: SaaS company with complex 5-step signup flow

Their test:

Variant A (control): 5-step signup
Variant B: Simplified 3-step signup
Metric: Signup completion rate

Their result:

Variant B: +12% signup completion
“Ship it immediately!”

What they didn’t measure:

I ran a funnel analysis on both variants:

Step	Variant A	Variant B	Change
Start signup	100%	100%	-
Enter email	78%	89%	+14%
Choose plan	62%	71%	+15%
Enter payment	51%	58%	+14%
Complete signup	43%	48%	+12%
Activate (7 days)	35%	21%	-40%
Paying (30 days)	29%	15%	-48%

What actually happened:

Variant B made signup easier—too easy. It removed friction that helped qualify serious users.

Result:

More signups (+12%)
But worse quality leads
Lower activation (-40%)
Lower conversion to paying customers (-48%)

Net business impact: -36% revenue

If they’d measured only signup rate, they’d have shipped a variant that destroyed their business.

The lesson:

Measure your actual business goal, not just the proximate metric.

Selling products? Measure revenue, not just checkout completion
SaaS signup? Measure activation and retention, not just registration
Content site? Measure engagement and return visits, not just clicks

Mistake #4: Ignoring Segment Performance

This is the most common and most damaging mistake.

What teams do: Look at overall results: “B won by 10%, ship it!”

Why this is wrong: Variant B might work great for some users and terribly for others. Overall performance hides this.

Real example (this happens constantly):

Client: E-commerce site testing new product page design

Their result:

Overall conversion: Variant B +8.2% (p = 0.03)
“Winner! Implement B everywhere!”

What I found when I segmented the data:

Segment	Variant A	Variant B	Change
Mobile	2.1%	3.2%	+52% ✓
Desktop	4.8%	4.1%	-15% ✗
New users	2.7%	3.4%	+26% ✓
Returning users	5.1%	4.6%	-10% ✗
Organic traffic	3.9%	4.2%	+8% ✓
Paid traffic	2.2%	3.1%	+41% ✓

Translation:

Variant B is:

Excellent for mobile users (+52%)
Terrible for desktop users (-15%)
Great for new users (+26%)
Worse for returning customers (-10%)

What they should have done:

Implement variant B for mobile only → Capture the +52% lift
Keep variant A for desktop → Avoid the -15% drop
Test a new variant C for desktop → Try to find what works for that segment

Net impact of smart segmentation vs. blanket implementation:

If they implemented B everywhere:

Mobile gains: +52% on 40% of traffic = +20.8%
Desktop losses: -15% on 60% of traffic = -9.0%
Net effect: +11.8%

If they segmented properly:

Mobile gains: +52% on 40% of traffic = +20.8%
Desktop stays same: 0% on 60% of traffic = 0%
Then test variant C for desktop
Net effect: +20.8% immediate, with upside potential from variant C

By not segmenting, they left +9% lift on the table.

The Segments You Must Analyze

Don’t just look at overall numbers. Segment by:

1. Device type

Mobile vs. tablet vs. desktop
iOS vs. Android
Different screen sizes

2. User type

New vs. returning visitors
Logged in vs. anonymous
Customer vs. prospect

3. Traffic source

Organic search
Paid search
Social media
Email
Direct

4. Demographics (if available)

Geographic location
Time of day / day of week
Language preference

5. Behavior patterns

High intent (viewed pricing, added to cart)
Low intent (browsing)
Cart value (high vs. low AOV)

One variant rarely wins for everyone. Segment analysis reveals the nuance.

Mistake #5: Misinterpreting Statistical Significance

What teams think: “95% confidence means we’re 95% sure B is better than A.”

What it actually means: “If there’s truly no difference between A and B, there’s only a 5% chance we’d see a result this extreme due to random chance.”

These are not the same thing.

Real example:

Client test result:

Variant B: +15% conversion
p-value: 0.04
Confidence: 96%

Client interpretation: “We’re 96% sure variant B is better!”

Actual meaning: “If A and B were truly the same, we’d see a result this extreme only 4% of the time.”

Why this matters:

A statistically significant result doesn’t tell you:

How big the real effect is (could be smaller than measured)
Whether the effect will persist (could be temporary)
Whether it’s worth the implementation cost (significant ≠ meaningful)

What you should actually look at:

Confidence interval, not just p-value
- “B improves conversion by 8-22% (95% CI)”
- This tells you the range of plausible effect sizes
- If the lower bound is still worth implementing, proceed
Practical significance, not just statistical significance
- A +2% lift might be statistically significant but not worth engineering effort
- A +15% lift might not reach significance but worth further testing
Consistency across segments
- Does the effect hold across different user types?
- Or is it driven entirely by one small segment?

How to Actually Run Meaningful A/B Tests

Here’s my process after 18 years of running (and fixing) A/B tests:

Before Launch: Test Design

1. Define your primary metric

What business outcome are you trying to improve?
Not a proxy metric—the actual goal

2. Define secondary and guardrail metrics

Secondary: Additional positive indicators
Guardrail: Metrics that shouldn’t get worse (bounce rate, page load time, etc.)

3. Calculate required sample size

Based on baseline conversion, minimum detectable effect, and power
Use a calculator: https://www.evanmiller.org/ab-testing/sample-size.html

4. Determine test duration

Based on traffic and required sample size
Include full weeks to account for day-of-week variance
Account for seasonality

5. Decide on segment analysis plan

Which segments will you analyze?
Pre-register your analysis plan to avoid data dredging

During the Test: Discipline

1. Don’t peek (or use proper sequential testing methods) 2. Don’t stop early unless there’s a technical problem 3. Don’t change the test mid-flight 4. Monitor for technical issues (tracking errors, load time problems)

After the Test: Deep Analysis

1. Check for statistical significance

Is p < 0.05?
What’s the confidence interval?

2. Analyze segment performance

Does the effect hold across all major segments?
Are there segments where B performs significantly worse?

3. Review secondary and guardrail metrics

Did we accidentally break something?
Are there unexpected negative effects?

4. Make segmented decisions

Implement winning variant for segments where it won
Keep control for segments where it lost
Test new variants for losing segments

5. Monitor post-implementation

Does the lift persist after full rollout?
Or was the test result a fluke?

Real-World Case Study: Doing It Right

Client: Mid-size e-commerce company, $12M annual revenue

Goal: Improve product page conversion rate

Current performance: 3.2% add-to-cart rate

Test hypothesis: Adding trust badges and reviews above the fold will increase conversions

Proper test design:

Metrics:

Primary: Add-to-cart rate
Secondary: Revenue per visitor
Guardrail: Bounce rate, time on page

Sample size calculation:

Baseline: 3.2%
Minimum detectable effect: 15% relative improvement (3.2% → 3.7%)
Power: 80%
Significance: 95%
Required: 8,400 visitors per variant

Test duration:

Traffic: 2,000 visitors/day to product pages
Required: 16,800 total visitors
Duration: 9 days minimum (rounded up to 14 days to include 2 full weeks)

Results after 14 days:

Overall:

Variant A (control): 3.18% (534 conversions, 16,792 visitors)
Variant B (trust badges): 3.64% (614 conversions, 16,871 visitors)
Lift: +14.5% (p = 0.021, 95% CI: 2.3% to 28.1%)

Segment analysis:

Segment	Variant A	Variant B	Lift	p-value
Mobile	2.1%	2.9%	+38%	0.008 ✓
Desktop	4.7%	4.9%	+4%	0.54 ✗
New visitors	2.4%	3.1%	+29%	0.015 ✓
Returning	5.2%	5.3%	+2%	0.78 ✗
Products <$50	4.1%	5.2%	+27%	0.019 ✓
Products >$50	2.2%	2.1%	-5%	0.71 ✗

Interpretation:

Variant B works significantly better for:

Mobile users
New visitors
Lower-priced products

Variant B shows no significant effect for:

Desktop users
Returning visitors
Higher-priced products

Decision:

Phase 1: Implement for winning segments

Show trust badges to mobile users → +38% lift on 45% of traffic
Show trust badges to new visitors → +29% lift on 60% of traffic
Show trust badges on products under $50 → +27% lift on 70% of traffic

Phase 2: Test new variant for losing segments

Hypothesis: Desktop users and returning visitors don’t need trust badges (they’re already familiar/trusting)
Hypothesis: High-value products need different trust signals (warranties, detailed specs, expert reviews)
Test variant C: Social proof and detailed specifications for high-value products

Projected impact:

Naive implementation (B for everyone): +14.5% overall

Segmented implementation:

Mobile lift: +38% × 45% = +17.1%
New visitor lift: +29% × 60% = +17.4%
Low-price lift: +27% × 70% = +18.9%
(Overlapping segments, actual combined lift: ~22%)

By segmenting properly, they extracted 50% more value from the same test.

The Bottom Line

Your A/B test isn’t telling you “B is better than A.”

It’s telling you:

For which users B performs better
Under what conditions the improvement holds
How confident you can be in the measured effect size
What trade-offs you’re making (are any metrics getting worse?)

Most teams see “B wins” and stop thinking.

The best teams see “B wins overall” and start asking:

For whom?
Why?
Where does it lose?
How can we capture the gains and eliminate the losses?

A/B testing isn’t about finding winners. It’s about understanding your users deeply enough to serve different segments optimally.

The data is there. You just have to look past the surface-level number.

After 18 years of running tests, here’s what I know: The teams that win aren’t the ones running the most tests. They’re the ones extracting the most insight from each test.

Run fewer tests. Analyze them properly. Make segmented decisions.

Your conversion rate will thank you.

Your A/B Test Isn't Telling You What You Think It Is

The Five Ways Teams Screw Up A/B Tests

Mistake #1: Using Inadequate Sample Sizes

How to Calculate Proper Sample Size

Mistake #2: Ending Tests Too Early

Mistake #3: Using the Wrong Baseline Metrics

Mistake #4: Ignoring Segment Performance

The Segments You Must Analyze

Mistake #5: Misinterpreting Statistical Significance

How to Actually Run Meaningful A/B Tests

Before Launch: Test Design

During the Test: Discipline

After the Test: Deep Analysis

Real-World Case Study: Doing It Right

The Bottom Line

Related Articles

The Conversion Copy Problem Most E-commerce Sites Have (And How I Fix It in Hours, Not Days)

Direct Sales vs. Traditional E-Commerce: What Designers Need to Know

How AI Helps Me Deliver Faster Without Cutting Corners

Want to discuss your project?