Recommended volume to ensure Experiment scorecard significance?

Question

question

Max Guernsey, III asked Jul 8, '20

Recommended volume to ensure Experiment scorecard significance?

Hello,

I ran two experiments. Both showed variations that seemed large but had too high a p-value to be considered statistically significant.

For future reference, I (and probably others) would like to know:

What is the recommended number of users to run through an experimental variant in order to give yourself a good chance of producing a statistically significant result?

There are some assumptions in this question that I would like to make explicit:

The change in a variable will either be 0 or at least 20%.
We only care about one variable.

Facebook, for instance, gives you a guess of your chance of generating a statistically significant outcome when you are setting up an A/B test in an ad campaign. You can pay more for a better chance or less for a lower chance. I'm not asking for a feature like that. I'm just hoping someone has some guidance that can help me build better experiments.

An alternative form of this question that would be equally informative in the same way is this:

How did PlayFab engineers and product team members envision experiments being used? e.g.,

Expected volume of players
Anticipated experiment length
Potentially destabilizing external forces

Any information would definitely help me build a business using a Lean Startup approach along with your platform and will probably help a lot of other engineers.

Thanks,

Max

analytics

2

10 |1200

Attachments: Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Seth Du ♦ commented · Jul 09, 2020 at 09:18 AM

I will consult the team and will update this thread if there are any feedback from the team.

1 ·

Max Guernsey, III Seth Du ♦ commented · Jul 22, 2020 at 09:52 PM

Any word on this?

0 ·

Answer 1 · 2020-07-23T06:31:24Z

Seth Du answered Jul 23, '20

Srroy for the late response, according to the feedback from the team, we currently do indicate the statistical significant results inferred in conjunction to the p-value which is calculated with a threshold of 0.05.

Today, we don’t support the capability for a recommendation build within the service for required volume of players and experiment length, though, please note these are things are considered in the algorithm for experiment analysis scorecard to examine and determine the statistical significance and destabilizing factors like sample ratio mismatch.

Moreover, we do recommendation on the duration of the Experiments as 14 days for optimal results as it includes both weekdays and weekends, and isn’t so long as to dramatically slow development cycles.

For the statistically significant effect, we recommend evaluating three sensitivity factors as defined below:

Consistency of behavior: It refers to how you measure the user behavior with the goal metric of the experiment irrespective of the experiment design. The consistency can ensure lower variance in the metric and be able to identify smaller relative changes as statistically significant.
Size of the effect: For any change to an application (treatment) there is an effect on user behavior. For a given code change the size of the effect is static, but when planning experiments and code changes a team and consider larger or smaller impacts. So, this need to be thought of while formulating the experiment design

Size of the target audience: The volume of population participating in the experiment and variants (type – smaller vs larger) which needs to bode well with each other. As a good rule of thumb there are three buckets of population size measured by the number of users in the smaller variant. Please note below measurements are between two variants.

Bucket	Min Users	Max Users
Ok	100	10,000
Good	10,000	1,000,000
Great	1,000,000	unlimited

Interaction of different drivers:It’s important to think about all three drivers as they interact. With millions of daily users, one can measure the impact of small impacts on secondary features. With a thousand users, one needs to make material changes and measure with well-designed metrics. The exact details will vary for every application and experiment. Running an A/A test, where both variants give users the same experience, is a good way to assess how small of a movement can be identified as stat sig.

1

10 |1200

Attachments: Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Max Guernsey, III commented · Jul 23, 2020 at 06:39 AM

Thank you. This is very informative. Although those numbers are kind of daunting for indie developers, it will definitely help people plan experiments in the future.

0 ·

question