Define and test scientific hypotheses with pre-registration support to ensure experimental rigor. The hypothesis testing framework helps prevent p-hacking and post-hoc rationalization.
Scientific experimentation follows a structured process:
- Define hypothesis before collecting data
- Specify endpoints and success criteria
- Calculate sample size for adequate power
- Run experiment and collect outcomes
- Analyze results against pre-registered hypothesis
Pre-registration prevents:
- P-hacking: Testing multiple hypotheses until one is significant
- HARKing: Hypothesizing After Results are Known
- Selective reporting: Only reporting significant results
By defining hypotheses before analysis, you maintain scientific integrity and produce credible results.
Test if treatment is better than control:
using ExperimentFramework.Science.Builders;
using ExperimentFramework.Data.Models;
var hypothesis = new HypothesisBuilder("checkout-optimization")
.Superiority()
.NullHypothesis("The new checkout has no effect on conversion")
.AlternativeHypothesis("The new checkout increases conversion rate")
.PrimaryEndpoint("purchase_completed", OutcomeType.Binary, ep => ep
.Description("Purchase completion rate")
.HigherIsBetter())
.ExpectedEffectSize(0.05) // 5% improvement
.WithSuccessCriteria(c => c
.Alpha(0.05)
.Power(0.80)
.MinimumSampleSize(1000))
.DefinedNow()
.Build();Test if treatment is not worse than control (by more than a margin):
var hypothesis = new HypothesisBuilder("api-migration")
.NonInferiority()
.NullHypothesis("The new API is inferior to the current API")
.AlternativeHypothesis("The new API is not worse than current by more than 50ms")
.PrimaryEndpoint("response_time", OutcomeType.Duration, ep => ep
.Description("API response latency")
.Unit("milliseconds")
.LowerIsBetter())
.ExpectedEffectSize(-10) // Expect 10ms improvement
.WithSuccessCriteria(c => c
.Alpha(0.05)
.Power(0.80)
.NonInferiorityMargin(50)) // Acceptable if within 50ms
.DefinedNow()
.Build();Test if treatment and control are equivalent (within a margin):
var hypothesis = new HypothesisBuilder("algorithm-validation")
.Equivalence()
.NullHypothesis("The algorithms produce different results")
.AlternativeHypothesis("The algorithms produce equivalent results")
.PrimaryEndpoint("accuracy", OutcomeType.Continuous, ep => ep
.Description("Prediction accuracy")
.HigherIsBetter())
.ExpectedEffectSize(0.0) // Expect no difference
.WithSuccessCriteria(c => c
.Alpha(0.05)
.Power(0.80)
.EquivalenceMargin(0.02)) // Within 2% is equivalent
.DefinedNow()
.Build();Test if there is any difference (direction unknown):
var hypothesis = new HypothesisBuilder("layout-experiment")
.TwoSided()
.NullHypothesis("The new layout has no effect on engagement")
.AlternativeHypothesis("The new layout affects engagement (positive or negative)")
.PrimaryEndpoint("session_duration", OutcomeType.Duration, ep => ep
.Description("Time spent on site")
.HigherIsBetter())
.ExpectedEffectSize(0.3) // Cohen's d = 0.3 (small effect)
.WithSuccessCriteria(c => c
.Alpha(0.05)
.Power(0.80))
.DefinedNow()
.Build();The main outcome that determines success:
.PrimaryEndpoint("conversion", OutcomeType.Binary, ep => ep
.Description("User completes purchase")
.HigherIsBetter()
.ExpectedBaseline(0.25) // Current 25% conversion
.MinimumImportantDifference(0.03)) // 3pp lift is meaningfulAdditional outcomes that provide context:
.PrimaryEndpoint("conversion", OutcomeType.Binary, ...)
.SecondaryEndpoint("revenue", OutcomeType.Continuous, ep => ep
.Description("Order value")
.Unit("USD")
.HigherIsBetter()
.ExpectedBaseline(50.0)
.ExpectedVariance(625)) // Std dev = 25
.SecondaryEndpoint("cart_additions", OutcomeType.Count, ep => ep
.Description("Items added to cart")
.HigherIsBetter())
.SecondaryEndpoint("checkout_time", OutcomeType.Duration, ep => ep
.Description("Time to complete checkout")
.Unit("seconds")
.LowerIsBetter()).WithSuccessCriteria(c => c
.Alpha(0.05) // 5% significance level
.Power(0.80) // 80% power
.MinimumSampleSize(500)).WithSuccessCriteria(c => c
.Alpha(0.05)
.Power(0.80)
.MinimumSampleSize(1000)
.MinimumEffectSize(0.02) // Reject if effect < 2%
.PrimaryEndpointOnly() // Only primary must be significant
.WithMultipleComparisonCorrection() // Apply correction for multiple tests
.MinimumDuration(TimeSpan.FromDays(14)) // Run at least 2 weeks
.RequirePositiveEffect()) // Effect must be in expected direction// Require only primary endpoint significant
.PrimaryEndpointOnly()
// Require all endpoints significant
.AllEndpoints()using ExperimentFramework.Science.Builders;
using ExperimentFramework.Data.Models;
public class ExperimentDefinition
{
public static HypothesisDefinition CheckoutOptimization()
{
return new HypothesisBuilder("checkout-v2-superiority")
.Description("Testing the streamlined checkout flow's impact on conversion")
.Superiority()
.NullHypothesis("The streamlined checkout has no effect on conversion rate")
.AlternativeHypothesis("The streamlined checkout improves conversion by at least 5%")
.Rationale("""
Prior user research showed frustration with the current 5-step checkout.
The new streamlined flow reduces steps to 2 and pre-fills shipping info.
Similar changes at Company X showed a 7% conversion lift.
""")
.Control("legacy-checkout")
.Treatment("streamlined-checkout")
.PrimaryEndpoint("purchase_completed", OutcomeType.Binary, ep => ep
.Description("User successfully completes a purchase")
.HigherIsBetter()
.ExpectedBaseline(0.25)
.MinimumImportantDifference(0.03))
.SecondaryEndpoint("checkout_time", OutcomeType.Duration, ep => ep
.Description("Time from cart to order confirmation")
.Unit("seconds")
.LowerIsBetter()
.ExpectedBaseline(180)
.ExpectedVariance(3600))
.SecondaryEndpoint("cart_abandonment", OutcomeType.Binary, ep => ep
.Description("User abandons cart before purchase")
.LowerIsBetter()
.ExpectedBaseline(0.75))
.ExpectedEffectSize(0.05)
.WithSuccessCriteria(c => c
.Alpha(0.05)
.Power(0.80)
.MinimumSampleSize(2000)
.MinimumEffectSize(0.02)
.PrimaryEndpointOnly()
.WithMultipleComparisonCorrection()
.MinimumDuration(TimeSpan.FromDays(14))
.RequirePositiveEffect())
.WithMetadata("analyst", "data-team@company.com")
.WithMetadata("jira_ticket", "EXP-1234")
.DefinedNow()
.Build();
}
}using ExperimentFramework.Data;
using ExperimentFramework.Science;
public class ExperimentService
{
private readonly IOutcomeRecorder _recorder;
private readonly HypothesisDefinition _hypothesis;
public ExperimentService(IOutcomeRecorder recorder)
{
_recorder = recorder;
_hypothesis = ExperimentDefinition.CheckoutOptimization();
}
public async Task RecordCheckoutOutcome(
string userId,
string assignedTrial,
bool purchaseCompleted,
TimeSpan checkoutDuration)
{
var experimentName = _hypothesis.Name;
// Record primary endpoint
await _recorder.RecordBinaryAsync(
experimentName, assignedTrial, userId,
_hypothesis.PrimaryEndpoint.Name,
purchaseCompleted);
// Record secondary endpoints
await _recorder.RecordDurationAsync(
experimentName, assignedTrial, userId,
"checkout_time",
checkoutDuration);
await _recorder.RecordBinaryAsync(
experimentName, assignedTrial, userId,
"cart_abandonment",
!purchaseCompleted);
}
}using ExperimentFramework.Science.Analysis;
using ExperimentFramework.Science.Reporting;
public class AnalysisService
{
private readonly IExperimentAnalyzer _analyzer;
private readonly IExperimentReporter _reporter;
public AnalysisService(
IExperimentAnalyzer analyzer,
IExperimentReporter reporter)
{
_analyzer = analyzer;
_reporter = reporter;
}
public async Task<string> AnalyzeExperiment(HypothesisDefinition hypothesis)
{
var report = await _analyzer.AnalyzeAsync(
hypothesis.Name,
hypothesis,
new AnalysisOptions
{
Alpha = hypothesis.SuccessCriteria.Alpha,
TargetPower = hypothesis.SuccessCriteria.Power,
ApplyMultipleComparisonCorrection =
hypothesis.SuccessCriteria.ApplyMultipleComparisonCorrection
});
// Check against success criteria
var success = EvaluateSuccess(hypothesis, report);
// Generate report
return await _reporter.GenerateAsync(report);
}
private bool EvaluateSuccess(HypothesisDefinition hypothesis, ExperimentReport report)
{
var criteria = hypothesis.SuccessCriteria;
var primaryResult = report.PrimaryResult;
if (primaryResult == null)
return false;
// Check significance
if (!primaryResult.IsSignificant)
return false;
// Check effect direction
if (criteria.RequirePositiveEffect)
{
var effect = primaryResult.PointEstimate;
if (hypothesis.PrimaryEndpoint.HigherIsBetter && effect <= 0)
return false;
if (!hypothesis.PrimaryEndpoint.HigherIsBetter && effect >= 0)
return false;
}
// Check minimum effect size
if (criteria.MinimumEffectSize.HasValue)
{
var absEffect = Math.Abs(primaryResult.PointEstimate);
if (absEffect < criteria.MinimumEffectSize.Value)
return false;
}
return true;
}
}Store hypothesis definitions for auditing:
using ExperimentFramework.Science.Models.Snapshots;
using ExperimentFramework.Science.Snapshots;
public class PreRegistrationService
{
private readonly ISnapshotStore _snapshots;
public PreRegistrationService(ISnapshotStore snapshots)
{
_snapshots = snapshots;
}
public async Task PreRegisterHypothesis(HypothesisDefinition hypothesis)
{
var snapshot = new ExperimentSnapshot
{
Id = Guid.NewGuid().ToString(),
ExperimentName = hypothesis.Name,
Timestamp = DateTimeOffset.UtcNow,
Type = SnapshotType.PreRegistration,
Hypothesis = hypothesis,
Environment = new EnvironmentInfo
{
ApplicationVersion = Assembly.GetExecutingAssembly().GetName().Version?.ToString(),
RuntimeVersion = RuntimeInformation.FrameworkDescription,
MachineName = Environment.MachineName
}
};
await _snapshots.SaveAsync(snapshot);
}
}Always define hypotheses before looking at data:
// Good - hypothesis defined at experiment start
var hypothesis = new HypothesisBuilder("experiment")
.Superiority()
.NullHypothesis("No effect")
.AlternativeHypothesis("Treatment improves conversion by 5%")
.ExpectedEffectSize(0.05)
.DefinedNow() // Timestamp for audit trail
.Build();
// Bad - defining after seeing results
var result = await analyzer.AnalyzeAsync("experiment");
if (result.PrimaryResult?.PValue > 0.05)
{
// Don't change the hypothesis now!
}Base expected effect size on prior evidence:
// Good - based on prior research
.ExpectedEffectSize(0.05)
.Rationale("Similar changes showed 5% lift in previous A/B test")
// Bad - arbitrary or optimistic
.ExpectedEffectSize(0.50) // Unrealistic, leads to underpowered testsAvoid multiple primary endpoints to prevent multiple testing issues:
// Good - one primary
.PrimaryEndpoint("conversion", OutcomeType.Binary)
.SecondaryEndpoint("revenue", OutcomeType.Continuous)
.SecondaryEndpoint("time_on_site", OutcomeType.Duration)
// Bad - multiple primaries inflate false positive rate
.PrimaryEndpoint("conversion")
.PrimaryEndpoint("revenue") // This is really a secondaryExplain why you expect the treatment to work:
.Rationale("""
Based on:
1. User research showing checkout friction
2. Competitor analysis of streamlined flows
3. Previous test showing 3% lift from removing one step
We expect the combined improvements to yield 5% conversion lift.
""")Ensure adequate sample before analyzing:
.WithSuccessCriteria(c => c
.MinimumSampleSize(2000)
.MinimumDuration(TimeSpan.FromDays(14)))Primary Endpoint: purchase_completed
Control: 25.0%, Treatment: 30.2%
Difference: +5.2pp, 95% CI [3.1%, 7.3%]
p-value: 0.0001
Result: SIGNIFICANT, supports alternative hypothesis
Conclusion: The streamlined checkout increases conversion.
Recommend: Roll out to 100% of users.
Primary Endpoint: purchase_completed
Control: 25.0%, Treatment: 22.1%
Difference: -2.9pp, 95% CI [-5.1%, -0.7%]
p-value: 0.0098
Result: SIGNIFICANT, but effect is negative
Conclusion: The streamlined checkout DECREASES conversion.
Recommend: Investigate why. Do not roll out.
Primary Endpoint: purchase_completed
Control: 25.0%, Treatment: 26.1%
Difference: +1.1pp, 95% CI [-0.8%, 3.0%]
p-value: 0.254
Result: NOT SIGNIFICANT
Conclusion: No evidence the streamlined checkout affects conversion.
Recommend: Consider larger sample or different approach.
- Statistical Analysis - Statistical test details
- Power Analysis - Sample size calculation
- Data Collection - Recording outcomes