question

Max Guernsey, III avatar image
Max Guernsey, III asked

Recommended volume to ensure Experiment scorecard significance?

Hello,

I ran two experiments. Both showed variations that seemed large but had too high a p-value to be considered statistically significant.

For future reference, I (and probably others) would like to know:

What is the recommended number of users to run through an experimental variant in order to give yourself a good chance of producing a statistically significant result?

There are some assumptions in this question that I would like to make explicit:

  • The change in a variable will either be 0 or at least 20%.
  • We only care about one variable.

Facebook, for instance, gives you a guess of your chance of generating a statistically significant outcome when you are setting up an A/B test in an ad campaign. You can pay more for a better chance or less for a lower chance. I'm not asking for a feature like that. I'm just hoping someone has some guidance that can help me build better experiments.

An alternative form of this question that would be equally informative in the same way is this:

How did PlayFab engineers and product team members envision experiments being used? e.g.,

  • Expected volume of players
  • Anticipated experiment length
  • Potentially destabilizing external forces

Any information would definitely help me build a business using a Lean Startup approach along with your platform and will probably help a lot of other engineers.

Thanks,

Max

analytics
2 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Seth Du avatar image Seth Du ♦ commented ·

I will consult the team and will update this thread if there are any feedback from the team.

1 Like 1 ·
Max Guernsey, III avatar image Max Guernsey, III Seth Du ♦ commented ·

Any word on this?

0 Likes 0 ·

1 Answer

·
Seth Du avatar image
Seth Du answered

Srroy for the late response, according to the feedback from the team, we currently do indicate the statistical significant results inferred in conjunction to the p-value which is calculated with a threshold of 0.05.


Today, we don’t support the capability for a recommendation build within the service for required volume of players and experiment length, though, please note these are things are considered in the algorithm for experiment analysis scorecard to examine and determine the statistical significance and destabilizing factors like sample ratio mismatch.

Moreover, we do recommendation on the duration of the Experiments as 14 days for optimal results as it includes both weekdays and weekends, and isn’t so long as to dramatically slow development cycles.

For the statistically significant effect, we recommend evaluating three sensitivity factors as defined below:

  1. Consistency of behavior: It refers to how you measure the user behavior with the goal metric of the experiment irrespective of the experiment design. The consistency can ensure lower variance in the metric and be able to identify smaller relative changes as statistically significant.
  2. Size of the effect: For any change to an application (treatment) there is an effect on user behavior. For a given code change the size of the effect is static, but when planning experiments and code changes a team and consider larger or smaller impacts. So, this need to be thought of while formulating the experiment design
  3. Size of the target audience: The volume of population participating in the experiment and variants (type – smaller vs larger) which needs to bode well with each other. As a good rule of thumb there are three buckets of population size measured by the number of users in the smaller variant. Please note below measurements are between two variants.
    Bucket Min Users Max Users
    Ok 100 10,000
    Good 10,000 1,000,000
    Great 1,000,000 unlimited
  4. Interaction of different drivers:It’s important to think about all three drivers as they interact. With millions of daily users, one can measure the impact of small impacts on secondary features. With a thousand users, one needs to make material changes and measure with well-designed metrics. The exact details will vary for every application and experiment. Running an A/A test, where both variants give users the same experience, is a good way to assess how small of a movement can be identified as stat sig.
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Max Guernsey, III avatar image Max Guernsey, III commented ·

Thank you. This is very informative. Although those numbers are kind of daunting for indie developers, it will definitely help people plan experiments in the future.

0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.