Coframe utilizes a modified Multi-Armed Bandit (MAB) algorithm to dynamically optimize website content. This advanced algorithm enhances traditional MAB strategies by continuously evolving variants based on performance, ensuring adaptability to user behavior over time.

Key Enhancements

  1. Maintaining an Adaptive Holdout Baseline: A control variant is retained for to estimate the baseline performance, its traffic allocation is gradually reduced to a minimum level.
  2. Dropping Underperforming Arms: Variants that consistently underperform are removed based on their allocations which is proportional to the belief of their performance.
  3. Intelligent Variant Generation: Our AI models generate new variants by analyzing and incorporating insights from ongoing experimental data.
  4. Incubation Period for New Arms: New variants are given a protection period to gather sufficient data before being evaluated for potential removal.

Standard Multi-Armed Bandit Overview

In the classical MAB problem, an agent selects among KK arms (variants), each with an unknown reward distribution. The objective is to maximize the cumulative reward over time by balancing exploration (trying out less-known arms) and exploitation (selecting the best-known arms).

Mathematical Formulation

  • Arms: A={1,2,,K}\mathcal{A} = \{1, 2, \dots, K\}
  • At each time tt:
    • Select an arm atAa_t \in \mathcal{A}.
    • Receive a reward rtr_t drawn from the probability distribution associated with ata_t.

The agent aims to minimize regret R(T)R(T), defined as:

R(T)=Tμt=1TrtR(T) = T \mu^* - \sum_{t=1}^T r_t

where μ\mu^* is the expected reward of the optimal arm.

In the context of website optimization, regret is closely tied to conversion rate. The regret represents the difference between the cumulative reward (conversions) obtained by always choosing the optimal arm and the actual cumulative reward obtained by the algorithm. A lower regret implies that the algorithm is performing closer to the optimal strategy, which in turn means it’s achieving a higher overall conversion rate. By minimizing regret, the algorithm effectively maximizes the conversion rate over time.

Coframe’s Approach

Algorithm Steps

Initialization

  • Initial Set of Arms: A={0,1,2,,K}\mathcal{A} = \{0, 1, 2, \dots, K\}, where 00 represents the baseline or control variant.
  • Initial Traffic Allocation: Distribute traffic equally among all arms, i.e., each arm receives 1K+1\frac{1}{K+1} of the traffic.
  • Initial Belief: For each arm, we initialize its belief with the same prior.
  • Iteration Start: The algorithm begins its iterations, with each iteration representing a one-hour time period.

At Each Time Step tt (Hourly Iteration)

  1. Arm Selection:
    • Use a selection policy (Thompson Sampling) to choose an arm ata_t.
  2. Reward Observation:
    • Observe reward rtr_t (rt=1r_t = 1 if a conversion occurs, rt=0r_t = 0 otherwise).
  3. Parameter Update:
    • Update the success (SiS_i) and failure (FiF_i) counts for arm ata_t.

Note on Experimental Unit: Our experimental unit is “unique visitor hours”. Within each one-hour iteration, exposures are deduplicated by visitor using a cookie. This means that the same person within the same iteration will see the same variant/arm, ensuring consistency in the user experience and data collection.

Dropping Underperforming Arms

  • Criteria for Dropping:
    • An arm is considered for removal if its traffic allocation falls below a predefined threshold (e.g., 2%) and its incubation period has elapsed.
  • Rationale:
    • This allows for the addition of new variants to replace underperforming

Introducing New Arms

  • Generation:
    • New variants are continuously generated by our AI models and can be added to an ongoing MAB when other variants are dropped.
  • Conditioning Factors:
    • Historical rewards {ri}\{ r_i \}
    • Content features {vi}\{ v_i \}
    • The model leverages this information to produce variants with a higher expected reward.

Traffic Reallocation

  • Adjust Traffic Allocation:
    • Traffic allocation for each variant is automatically adjusted every iteration.

Selection Policy: Thompson Sampling

Thompson Sampling is used for its effective balance between exploration and exploitation.

Posterior Sampling

For each arm ii:

  • Successes: SiS_i
  • Failures: FiF_i
  • Beta Distribution: Sample θiBeta(Si+alpha0,Fi+β0)\theta_i \sim \text{Beta}(S_i + alpha_0, F_i + \beta_0)
  • Arm Selection: Choose arm ata_t by sampling from all arms and selecting the arm associated to the maximum value
    • Batch processing: We process observations in batches and sample from the posterior distribution after all the observations in one iteration have been accounted for.

Traffic Allocation Constraints

  • Minimum Allocation for Holdout Baseline:
    • Ensure the holdout baseline arm a0a_0 retains at least a fraction α\alpha of the traffic (e.g., α=0.05\alpha = 0.05).

Advantages Over Standard MAB

  1. Continuous Evolution: New variants are regularly introduced, keeping content fresh and aligned with user preferences.
  2. Efficient Resource Utilization: Underperforming variants are removed, reallocating resources to better-performing ones.
  3. Accelerated Optimization: AI-generated variants, informed by past performance, may lead to higher rewards more quickly.
  4. Robust Benchmarking: Maintaining a constant baseline variant allows for consistent performance tracking over time.

Mathematical Details

Reward Estimation

For each arm ii:

  • Estimated Conversion Rate:
μ^i=Sini\hat{\mu}_i = \frac{S_i}{n_i}

where ni=Si+Fin_i = S_i + F_i is the total number of times arm ii has been selected.

  • Credibility Intervals:

Use Bayesian methods to estimate the credibility interval for μ^i\hat{\mu}_i. This provides a probabilistic range of plausible values for the true conversion rate, given the observed data and prior beliefs.

For each arm ii, we can calculate the posterior distribution:

P(μidata)P(dataμi)P(μi)P(\mu_i | \text{data}) \propto P(\text{data} | \mu_i) \cdot P(\mu_i)

where P(μi)P(\mu_i) is the prior distribution and P(dataμi)P(\text{data} | \mu_i) is the likelihood.

The 95% credibility interval can then be derived from the posterior distribution, representing the range within which we believe the true conversion rate lies with 95% probability.

Dropping Criterion Justification

  • Statistical Significance:

Ensure that enough data has been collected (e.g., minimum sample size nminn_{\text{min}}) before considering an arm for removal.

  • Balancing Exploration and Exploitation:

By allowing an incubation period, the algorithm balances the need to explore new variants with the need to exploit known good performers.

AI-Conditioned Variant Generation

  • Objective:

Generate a new variant vnewv_{\text{new}} expected to have a higher reward than the average of the dropped arms.

  • Model Inputs:

    • Features of previous variants {vi}\{ v_i \}
    • Performance metrics {μ^i}\{ \hat{\mu}_i \}
  • Model Output:

    • A new variant vnewv_{\text{new}} predicted to perform better.
  • Expected Reward of New Variant:

    The expected reward of the new variant is:

    μnew=E[rnewvnew]\mu_{\text{new}} = \mathbb{E}\left[ r_{\text{new}} \mid v_{\text{new}} \right]

    Our goal is for μnew>μ^avg\mu_{\text{new}} > \hat{\mu}_{\text{avg}}, where μ^avg\hat{\mu}_{\text{avg}} is the average performance of previous arms.

Implementation Considerations

Data Sufficiency

  • Warm-Up Period:

    Allow new arms to collect sufficient data before making decisions about their performance.

Exploration-Exploitation Balance

  • Adaptive Parameters:

    Adjust parameters like the minimum allocation and incubation period based on traffic patterns and performance.

Comparison with Standard MAB

AspectStandard MABCoframe’s Modified MAB
Arm PoolFixedDynamic (arms can be added/dropped)
Underperforming ArmsContinue to receive some trafficDropped when below performance thresholds
New Variant IntroductionNot typically includedIntroduces new arms via AI conditioning
Baseline VariantNot necessarily maintainedConstant holdout baseline is maintained
AdaptabilityLimited to initial arm setContinuously adapts with new variants

Practical Example

Assume we start with:

  • Holdout Baseline: v0v_0
  • Initial Variants: v1v_1, v2v_2, v3v_3

Process Flow

  1. Initialization:

    • Equal traffic allocation: each variant receives 25% of traffic.
  2. Data Collection:

    • Collect rewards over time for each variant.
  3. Evaluation:

    • After sufficient data, suppose v2v_2 falls below the 2% allocation threshold.
  4. Dropping Underperformer:

    • Drop v2v_2 from the active arm set.
  5. Variant Generation:

    • Generate a new variant vnewv_{\text{new}} conditioned on past variants and their performance.
  6. Reallocation:

    • Redistribute traffic among v0v_0, v1v_1, v3v_3, and vnewv_{\text{new}}.
  7. Iteration:

    • Repeat the process, continuously refining the variant pool.

Conclusion

Coframe’s modified multi-armed bandit algorithm offers a robust and efficient method for website content optimization by:

  • Leveraging Past Performance: AI-generated variants are informed by historical data, increasing the likelihood of success.
  • Dynamic Adaptation: The ability to drop and introduce arms keeps the optimization process responsive to changing user behaviors.
  • Maintaining a Baseline: A constant control variant ensures that performance improvements are measured accurately.

References

  • Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-Time Analysis of the Multiarmed Bandit Problem. Machine Learning.
  • Kleinberg, R., & Slivkins, A. (2010). Multi-Armed Bandits in Metric Spaces. Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC).
  • Bubeck, S., Munos, R., & Stoltz, G. (2011). Pure Exploration in Finitely-Armed and Continuous-Armed Bandits. Theoretical Computer Science.