Back to Blog
A/B Testing Conversion Optimization Analytics UX Best Practices

Your A/B Test Isn't Telling You What You Think It Is

Savelle McThias
Your A/B Test Isn't Telling You What You Think It Is

“We ran an A/B test. B won. Ship it.”

I’ve heard this exact conversation at least 50 times in my 18 years of UX work. And almost every time, the team is about to make a mistake.

Not because A/B testing doesn’t work—it does. But because most teams fundamentally misunderstand what their A/B tests are actually telling them.

They think A/B testing is simple:

  1. Create two versions
  2. Split traffic 50/50
  3. Wait for a winner
  4. Implement the winner everywhere

This approach is so common that entire platforms are built around it. And it’s leading to terrible decisions backed by “data.”

The Five Ways Teams Screw Up A/B Tests

Let me walk you through the most common mistakes I see—and how they lead to wrong conclusions.

Mistake #1: Using Inadequate Sample Sizes

What teams do: Run a test with 500 visitors to each variant, see that variant B converted 2% better than variant A, and call it a win.

Why this is wrong: With small sample sizes, random variation can easily produce a 2% difference even if there’s no real effect.

Real example from a client:

Their test:

  • Variant A: 247 visitors, 12 conversions (4.9%)
  • Variant B: 251 visitors, 17 conversions (6.8%)
  • Conclusion: “B wins by 39%! Ship it!”

The problem: Sample size was nowhere near large enough for statistical significance. I ran the numbers:

  • Statistical significance: p = 0.34
  • Confidence level: 66%
  • Translation: 34% chance this “win” is just random noise

What happened: They implemented B. Conversion rate stayed exactly the same over the next month. They’d wasted two weeks of testing and engineering time on a false positive.

What they should have done: Calculate required sample size BEFORE running the test.

For their baseline conversion rate (5%) and desired lift (39%), they needed:

  • Minimum 1,200 visitors per variant for 95% confidence
  • Test duration: 3-4 weeks at their traffic levels

How to Calculate Proper Sample Size

Don’t guess. Use this formula (or an online calculator):

Required sample size depends on:

  • Baseline conversion rate: Your current performance
  • Minimum detectable effect (MDE): Smallest improvement worth detecting
  • Statistical power: Usually 80% (probability of detecting a real effect)
  • Significance level: Usually 95% (p < 0.05)

Example:

  • Baseline conversion: 3%
  • Want to detect: 20% relative improvement (3% → 3.6%)
  • Power: 80%
  • Significance: 95%
  • Required sample: ~7,800 visitors per variant

If you only get 1,000 visitors per day, that’s an 8-day test minimum.

The rule: If you can’t reach adequate sample size in a reasonable timeframe, don’t run the test. You’ll just waste time.

Mistake #2: Ending Tests Too Early

What teams do: Check results daily, see B ahead by 15% on day 3, and declare victory.

Why this is wrong: Early in a test, random variance creates big swings. What looks like a massive win on day 3 often regresses to zero by day 14.

This is called “peeking” and it’s dangerous.

Real example:

I consulted with an e-commerce company that was “data-driven.” They ran A/B tests constantly. And they were making worse decisions because of it.

Their process:

  1. Launch test Monday morning
  2. Check results Wednesday afternoon
  3. If B is winning, end test and implement
  4. Move to next test

Their reasoning: “We don’t want to waste time. If we see a winner, we implement it fast.”

The problem:

I audited their last 15 tests. Here’s what actually happened:

TestDay 3 WinnerFinal Winner (Day 14)Did They Ship?Actual Result
1B (+22%)A (-3%)Shipped BConversion dropped 4%
2B (+18%)B (+4%, p=0.09)Shipped BNo significant change
3A (+31%)B (+7%, p=0.03)Kept AMissed 7% lift opportunity
4B (+15%)No differenceShipped BNo change
5B (+26%)A (-8%)Shipped BConversion dropped 9%

Out of 15 tests, they made the wrong call 11 times because they peeked and ended early.

Why this happens:

Early in a test, you have small sample sizes, which means high variance. A few random conversions can swing the numbers dramatically.

Imagine:

  • Day 1: 50 visitors to A, 50 to B
  • Variant A: 1 conversion (2%)
  • Variant B: 4 conversions (8%)
  • “B is winning by 300%!”

But this means nothing. With such small samples, one or two conversions change everything.

By day 14, with proper sample sizes, the difference usually disappears.

The solution:

1. Decide your test duration BEFORE launching

Based on:

  • Required sample size
  • Your traffic levels
  • Seasonality (include different days of week)

2. Don’t peek at results

Or if you must peek, use sequential testing methods that account for multiple looks (Bayesian approaches or adjusted significance thresholds).

3. Commit to running the full duration

Even if it looks like you have a winner early. Discipline beats impatience.

Mistake #3: Using the Wrong Baseline Metrics

What teams do: Test a new checkout flow and only measure final conversion rate.

Why this is wrong: Final conversion rate doesn’t tell you where and why behavior changed.

Real example:

Client: SaaS company with complex 5-step signup flow

Their test:

  • Variant A (control): 5-step signup
  • Variant B: Simplified 3-step signup
  • Metric: Signup completion rate

Their result:

  • Variant B: +12% signup completion
  • “Ship it immediately!”

What they didn’t measure:

I ran a funnel analysis on both variants:

StepVariant AVariant BChange
Start signup100%100%-
Enter email78%89%+14%
Choose plan62%71%+15%
Enter payment51%58%+14%
Complete signup43%48%+12%
Activate (7 days)35%21%-40%
Paying (30 days)29%15%-48%

What actually happened:

Variant B made signup easier—too easy. It removed friction that helped qualify serious users.

Result:

  • More signups (+12%)
  • But worse quality leads
  • Lower activation (-40%)
  • Lower conversion to paying customers (-48%)

Net business impact: -36% revenue

If they’d measured only signup rate, they’d have shipped a variant that destroyed their business.

The lesson:

Measure your actual business goal, not just the proximate metric.

  • Selling products? Measure revenue, not just checkout completion
  • SaaS signup? Measure activation and retention, not just registration
  • Content site? Measure engagement and return visits, not just clicks

Mistake #4: Ignoring Segment Performance

This is the most common and most damaging mistake.

What teams do: Look at overall results: “B won by 10%, ship it!”

Why this is wrong: Variant B might work great for some users and terribly for others. Overall performance hides this.

Real example (this happens constantly):

Client: E-commerce site testing new product page design

Their result:

  • Overall conversion: Variant B +8.2% (p = 0.03)
  • “Winner! Implement B everywhere!”

What I found when I segmented the data:

SegmentVariant AVariant BChange
Mobile2.1%3.2%+52%
Desktop4.8%4.1%-15%
New users2.7%3.4%+26%
Returning users5.1%4.6%-10%
Organic traffic3.9%4.2%+8%
Paid traffic2.2%3.1%+41%

Translation:

Variant B is:

  • Excellent for mobile users (+52%)
  • Terrible for desktop users (-15%)
  • Great for new users (+26%)
  • Worse for returning customers (-10%)

What they should have done:

  1. Implement variant B for mobile only → Capture the +52% lift
  2. Keep variant A for desktop → Avoid the -15% drop
  3. Test a new variant C for desktop → Try to find what works for that segment

Net impact of smart segmentation vs. blanket implementation:

If they implemented B everywhere:

  • Mobile gains: +52% on 40% of traffic = +20.8%
  • Desktop losses: -15% on 60% of traffic = -9.0%
  • Net effect: +11.8%

If they segmented properly:

  • Mobile gains: +52% on 40% of traffic = +20.8%
  • Desktop stays same: 0% on 60% of traffic = 0%
  • Then test variant C for desktop
  • Net effect: +20.8% immediate, with upside potential from variant C

By not segmenting, they left +9% lift on the table.

The Segments You Must Analyze

Don’t just look at overall numbers. Segment by:

1. Device type

  • Mobile vs. tablet vs. desktop
  • iOS vs. Android
  • Different screen sizes

2. User type

  • New vs. returning visitors
  • Logged in vs. anonymous
  • Customer vs. prospect

3. Traffic source

  • Organic search
  • Paid search
  • Social media
  • Email
  • Direct

4. Demographics (if available)

  • Geographic location
  • Time of day / day of week
  • Language preference

5. Behavior patterns

  • High intent (viewed pricing, added to cart)
  • Low intent (browsing)
  • Cart value (high vs. low AOV)

One variant rarely wins for everyone. Segment analysis reveals the nuance.

Mistake #5: Misinterpreting Statistical Significance

What teams think: “95% confidence means we’re 95% sure B is better than A.”

What it actually means: “If there’s truly no difference between A and B, there’s only a 5% chance we’d see a result this extreme due to random chance.”

These are not the same thing.

Real example:

Client test result:

  • Variant B: +15% conversion
  • p-value: 0.04
  • Confidence: 96%

Client interpretation: “We’re 96% sure variant B is better!”

Actual meaning: “If A and B were truly the same, we’d see a result this extreme only 4% of the time.”

Why this matters:

A statistically significant result doesn’t tell you:

  • How big the real effect is (could be smaller than measured)
  • Whether the effect will persist (could be temporary)
  • Whether it’s worth the implementation cost (significant ≠ meaningful)

What you should actually look at:

  1. Confidence interval, not just p-value

    • “B improves conversion by 8-22% (95% CI)”
    • This tells you the range of plausible effect sizes
    • If the lower bound is still worth implementing, proceed
  2. Practical significance, not just statistical significance

    • A +2% lift might be statistically significant but not worth engineering effort
    • A +15% lift might not reach significance but worth further testing
  3. Consistency across segments

    • Does the effect hold across different user types?
    • Or is it driven entirely by one small segment?

How to Actually Run Meaningful A/B Tests

Here’s my process after 18 years of running (and fixing) A/B tests:

Before Launch: Test Design

1. Define your primary metric

  • What business outcome are you trying to improve?
  • Not a proxy metric—the actual goal

2. Define secondary and guardrail metrics

  • Secondary: Additional positive indicators
  • Guardrail: Metrics that shouldn’t get worse (bounce rate, page load time, etc.)

3. Calculate required sample size

4. Determine test duration

  • Based on traffic and required sample size
  • Include full weeks to account for day-of-week variance
  • Account for seasonality

5. Decide on segment analysis plan

  • Which segments will you analyze?
  • Pre-register your analysis plan to avoid data dredging

During the Test: Discipline

1. Don’t peek (or use proper sequential testing methods) 2. Don’t stop early unless there’s a technical problem 3. Don’t change the test mid-flight 4. Monitor for technical issues (tracking errors, load time problems)

After the Test: Deep Analysis

1. Check for statistical significance

  • Is p < 0.05?
  • What’s the confidence interval?

2. Analyze segment performance

  • Does the effect hold across all major segments?
  • Are there segments where B performs significantly worse?

3. Review secondary and guardrail metrics

  • Did we accidentally break something?
  • Are there unexpected negative effects?

4. Make segmented decisions

  • Implement winning variant for segments where it won
  • Keep control for segments where it lost
  • Test new variants for losing segments

5. Monitor post-implementation

  • Does the lift persist after full rollout?
  • Or was the test result a fluke?

Real-World Case Study: Doing It Right

Client: Mid-size e-commerce company, $12M annual revenue

Goal: Improve product page conversion rate

Current performance: 3.2% add-to-cart rate

Test hypothesis: Adding trust badges and reviews above the fold will increase conversions

Proper test design:

Metrics:

  • Primary: Add-to-cart rate
  • Secondary: Revenue per visitor
  • Guardrail: Bounce rate, time on page

Sample size calculation:

  • Baseline: 3.2%
  • Minimum detectable effect: 15% relative improvement (3.2% → 3.7%)
  • Power: 80%
  • Significance: 95%
  • Required: 8,400 visitors per variant

Test duration:

  • Traffic: 2,000 visitors/day to product pages
  • Required: 16,800 total visitors
  • Duration: 9 days minimum (rounded up to 14 days to include 2 full weeks)

Results after 14 days:

Overall:

  • Variant A (control): 3.18% (534 conversions, 16,792 visitors)
  • Variant B (trust badges): 3.64% (614 conversions, 16,871 visitors)
  • Lift: +14.5% (p = 0.021, 95% CI: 2.3% to 28.1%)

Segment analysis:

SegmentVariant AVariant BLiftp-value
Mobile2.1%2.9%+38%0.008 ✓
Desktop4.7%4.9%+4%0.54 ✗
New visitors2.4%3.1%+29%0.015 ✓
Returning5.2%5.3%+2%0.78 ✗
Products <$504.1%5.2%+27%0.019 ✓
Products >$502.2%2.1%-5%0.71 ✗

Interpretation:

Variant B works significantly better for:

  • Mobile users
  • New visitors
  • Lower-priced products

Variant B shows no significant effect for:

  • Desktop users
  • Returning visitors
  • Higher-priced products

Decision:

Phase 1: Implement for winning segments

  • Show trust badges to mobile users → +38% lift on 45% of traffic
  • Show trust badges to new visitors → +29% lift on 60% of traffic
  • Show trust badges on products under $50 → +27% lift on 70% of traffic

Phase 2: Test new variant for losing segments

  • Hypothesis: Desktop users and returning visitors don’t need trust badges (they’re already familiar/trusting)
  • Hypothesis: High-value products need different trust signals (warranties, detailed specs, expert reviews)
  • Test variant C: Social proof and detailed specifications for high-value products

Projected impact:

Naive implementation (B for everyone): +14.5% overall

Segmented implementation:

  • Mobile lift: +38% × 45% = +17.1%
  • New visitor lift: +29% × 60% = +17.4%
  • Low-price lift: +27% × 70% = +18.9%
  • (Overlapping segments, actual combined lift: ~22%)

By segmenting properly, they extracted 50% more value from the same test.

The Bottom Line

Your A/B test isn’t telling you “B is better than A.”

It’s telling you:

  • For which users B performs better
  • Under what conditions the improvement holds
  • How confident you can be in the measured effect size
  • What trade-offs you’re making (are any metrics getting worse?)

Most teams see “B wins” and stop thinking.

The best teams see “B wins overall” and start asking:

  • For whom?
  • Why?
  • Where does it lose?
  • How can we capture the gains and eliminate the losses?

A/B testing isn’t about finding winners. It’s about understanding your users deeply enough to serve different segments optimally.

The data is there. You just have to look past the surface-level number.

After 18 years of running tests, here’s what I know: The teams that win aren’t the ones running the most tests. They’re the ones extracting the most insight from each test.

Run fewer tests. Analyze them properly. Make segmented decisions.

Your conversion rate will thank you.

Share this article

Want to discuss your project?

I'm always open to new opportunities and collaborations.

Get in Touch