Galileo: Analysis and Reporting Of Experiments

Summary
In order to run an analysis on experiments running on Galileo, we have built the analysis and reporting module of Galileo. Instead of teams running their analysis manually, Galileo provides the results of the experiment on the go. The report generated as a result includes every aspect of the experiment to help make the right decision.

Team
Data & AI

Author(s)
Talal Shoukat

About the Author(s)
Talal Shoukat at Data Scientist in Careem, working in the Data & AI team and helping to build a state-of-the-art experimentation platform.

As the Careem Super App has expanded over the past few years, we have been able to deliver on our purpose of simplifying and improving the lives of people to a whole new level. But building a simple product is complex. With an expanding list of services and a growing Customer base, our processes ecosystem has become more and more complicated. 
As we entered a new stage of growth we knew that leveraging data and insights would be key to building a Customer-centric experience. We needed to build a structured experimentation platform to iterate and test new ideas in a safe space and ensure all business decisions resulted in an efficient allocation of resources and customer satisfaction.
To fulfill this aim, we decided to build an in-house experimentation platform, internally named Galileo, that was trustworthy, easy to use, and extensible. Teams at Careem can now rely upon Galileo to accelerate innovation and ultimately make better, data-driven decisions. Galileo randomizes respondents into different groups and their responses are analyzed. It makes sure that the end-to-end journey from defining a hypothesis to performing analysis needs to be error-free in order to make better decisions that drive growth.  The platform supports experiments across all of Careem’s services and is widely used to run A/B/N and switch back experiments.

In the past, teams working in Careem had to assign a dedicated data scientist to evaluate the results of their experiments which was costly, inefficient and lacked uniformity in reports across the organization. This resulted in a disparity between different team practices since teams were doing testing independently and used their own judgment in test selection, pre-processing, and visualization. Galileo solves these problems by providing centralized analytics using consistent statistical tests, sanity checks, and post-processing techniques. Analysis reports are being generated on a daily basis allowing for experiment owners to make informed decisions using a bird-eye view.  
The Galileo platform was built upon the following standards: 

Trustworthy: Experiment reports should be trustworthy with reliable statistical results.
Centralized platform: A centralized view of all the experiments running in the company and interaction between them should be identified. 
Efficient: With a large number of experiments running on a daily basis, it should be able to generate reports as required. Flexible: Different teams have different experimentation requirements, and the platform should be able to extend the analysis by customizing configuration. 
Ease of usability: The platform should be designed in such a way that any member of the organization can easily create, analyze and roll out experiments.

The diagram below illustrates the steps Galileo undertakes to perform the experiment analysis. 

Figure 1: Experiment Analysis Flow in Galileo
The first step in the pipeline of any experiment analysis is the computation of metrics. In Galileo, we fetch experiment data from the events table and join it with the metric repo. The data includes attributes, variant names, and the value of the metric. This data is then processed through a series of steps which include pre-processing, variance reduction, statistical testing, P-value correction, and post-processing. The output Yaml document schema is then validated before sending it out for visualization on the Galileo portal or python notebook.
Python SDK makes use of Kepler and Hubble to fetch details of the metric and compute metrics for the analysis. The separation of concern improves the maintainability with minimal overlapping with other functionalities. 
The high-level architecture of how python code interacts with other services is shown below:

Figure 2: Python SDK interaction with Kepler and Hubble

Metrics are used to quantify the impact of an experiment that the feature/innovation to be tested has on the business. The choice of metric plays a critical role in the successful execution of the experiment. In Galileo, users can define a single goal metric (primary metric) as well as multiple supporting metrics. 
The purpose of the goal metric is to determine the success of the experiment by either rejecting or accepting the hypothesis. Supporting metrics helps in highlighting the interaction between different metrics. The impact on one metric can have multiple implications on various product goals. Having secondary metrics allows teams to have a deeper understanding of interactions between metrics and help them prioritize their goals accordingly. For instance, a change in the application interface may cause an increase in the goal metric of engagement time but can decrease revenue and conversion rate. The teams using these experiments will be able to quickly identify the trade-off, if any, between their primary and secondary metrics and can eventually take a better-informed decision.
Also, Galileo uses guardrail metrics to ensure a smooth customer experience. These are system-added metrics that help in detecting experiments that cause the application to crash and alert the concerned teams after bringing their experiment to an immediate halt.

To ensure the reliability of the results, our platform performs rigorous sanity checks before performing analysis. These include the following:

Sample Size Imbalance:  A Sample Ratio Mismatch test is performed to check if the traffic is being split as per the weights assigned in the experiment configuration. In an ideal case scenario, the variance of the observed sample ratio can be modeled as random error. Hence chi-square statistics can be used to check if the likelihood of the difference between the ratios is outside of normal chance. The Alert is generated when the p-value of the test falls below the threshold of 0.01. The chi-square statistic is the sum of each of the observed values minus the expected values squared, divided by the expected. The equation of the test can be found below:
 
In the experiment report, the Sample Size Balance badge informs the user if there is any deviation from the sample size initially configured in the experiment. 

Flickers: Flickers are respondents that have switched between control and treatment groups. For instance, a Careem application user may have an older version of the OS and chooses to upgrade it while the experiment is running, leading them to experience both the control and treatment effect. Such a user constitutes a flicker in the experiment.Ideally, there should be no flickers in an experiment but depending upon the configuration, we may get them. Galileo removes flickers before running the analysis and also reports the percentage of flickers in the experiment.
Bias detection: To check if there is any bias between sampled variants, Galileo applies bias detection which evaluates the goal metric on pre-experiment data. If the bias is detected, a badge notifies the user about the potential bias. It runs a difference-in-difference method on the data to get reliable results

Variance reduction helps in decreasing the duration of an experiment. Galileo uses CUPED for variance reduction in which variance that can be explained by pre-experiment data is removed. This method increases the statistical power of the hypothesis test without having an impact on the output results. This section will be discussed in detail in one of the future blogs.

For any experimentation platform, the statistical engine acts as a core component in decision making. It helps the team to decide whether to roll out a feature or not. In Galileo, analysis can be performed either from the UI or from the python SDK, which leverages data scientists with flexibility in analysis. They can not only use SDK to customize the settings of the test but can also use the Python notebook to compute their metrics and visualize them in a way they want.
The p-value helps decide if a variation is statistically significant or not and decides on the validation of the tested hypothesis. The significance level threshold set for the test determines the acceptability criteria of a null hypothesis. To stricten the threshold against the occurrence of false positives, the default p-value set in statistical test configurations is 0.01. While this increases the duration of an experiment as opposed to higher significance thresholds, the error rate remains low. The figure below illustrates how statistical level and power value have an impact on Type I (False Positive) and Type II errors (False Negative).
We perform tests based on the category and distribution of the metric. 
The following flow chart illustrates how tests are selected:

Figure 3: Test Selection in Galileo

Welch’s t-test, the default test used for continuous metrics. Welch’s t-test assumes normality but does not have the assumption of equal variance. Welch’s tends to give the same result as Student’s t-test when variants have equal sample size and variance. However, when sample sizes and variances are unequal, the latter is quite unreliable; Welch’s t-test performs better.
The Mann-Whitney U test is better suited for skewed distributions. Its nonparametric property allows fewer assumptions about the sample data, hence its application is wider in scope than parametric statistics.
The bootstrap method is a statistical technique for estimating quantities of a population by oversampling data samples. Hence, it performs better when the sample size is small.
The difference in difference uses the outcome of the control group as a proxy for what would have occurred in the treatment group had there been no treatment. The difference in the average post-treatment outcomes between the treatment and control groups is then used to measure the treatment effects.

After the statistical testing is executed and conclusions are drawn, the lift is calculated along with the confidence interval of the lift value. The required sample size to declare a lift statistically significant is also computed. These stats are to provide users with a deep dive into the results of the test. 

Figure 4: Snapshot of the experiment report
Lift:
Lift is the mean average difference between the two variants. It should be close to 0 in the case of the A/A test. The mean of variants is used to calculate the lift using the following formula.

Percentage lift is calculated to measure the change in treatment_group with respect to the control_group. Since lift can have positive and negative values, we use two-tailed tests to calculate the significance of our AB tests. 
Confidence Interval:
The confidence interval is the range in which the lift is most likely to be found. We can be certain that the lift will lie within the range by checking the confidence level. Galileo calculates the confidence interval by assuming that the metric follows the normal distribution. Figure 3 illustrates the metric with a confidence interval between [5.5, 6.9]
Required Sample Size:
The required sample size determines the significance of the test results. It depends upon the difference in the mean value of the two variants and the pooled variance of the data. With the increase in data points, if the distributions are getting similar to one another, then the required sample size will grow, otherwise, it will shrink. 

After statistical testing, Galileo performs Benjamini-Hochberg (BH) correction. The method ranks the P-value from the lowest to the highest and then adjusts the p-value of each variant accordingly.
The need for this arises because Galileo allows users to perform multivariate testing, which leads to an increase in false-positive results. For instance, let’s say we have nine variants and a statistical significance threshold is set to 0.05. In that scenario, the probability of getting a false-positive result will increase to 0.37. The probability can be calculated using the formula given below:

To mitigate the problem, we use BH correction. 

Where k is the rank and m is the number of variants. This approach is less conservative than the Bonferroni correction, which has the same impact on alpha value for all the variants.

The Galileo dashboard provides users with interactive reports which give visibility to teams and company stakeholders. The analysis report provides an overview that enables teams to easily compare all relevant metrics for each of the experiment variations. These reports are generated on a daily basis and after the experiment concludes. Weekly reports are emailed to the stakeholders to update them on the weekly progress.
The snapshot below reports the result of an experiment conducted on Galileo. The goal metric of interest has improved by a lift of 2.7% and a confidence interval of +/- 0.8. It also shows that the sample size required to prove this significance is 4,968. The report highlights the importance of including supporting metrics. It shows that although the goal metric is showing improvement, one of the supporting metrics has worsened. 

Figure 5: Experiment Report
Teams can also use the python notebook to customize the analysis. The SDK returns Yaml document and metric data which can then be used to visualize the desired results as per requirements.  
A variant level analysis can also be visualized in the reports which can help the owner to understand how the experiment evolves with time.

Figure 6: Graphical detail on the metric level
The above screenshot illustrates the time series graphs of P-value, lift, and sample size. It is recommended to avoid jumping to conclusions before the experiment concludes itself, the practice of peeking p-value is discouraged and the user’s own discretion is advised.

In this article, we have presented an end-to-end workflow of the analytical and reporting components of Galileo. We explained the features and backend implementation of the experimentation platform and provided further evidence to establish Galileo as a trustworthy, efficient, and flexible centralized platform. 
We have seen that the internal teams are more confident in running their experiments on Galileo and their constant feedback has helped us identify scope for improvement. Looking forward, we are planning to incorporate Bayesian analysis which will further help us in improving the accuracy of the results and reducing experiment duration.

Go to Source