Counterfactual Data-Fusion for Online Reinforcement Learners

DSA ADS Course - 2021

Counterfactual Data-Fusion for Online Reinforcement Learners

Causal Reinforcement Learning, Counterfactuals, Counterfactual Data-Fusion, Online Reinforcement Learners

June, 2017


The Multi-Armed Bandit problem with Unobserved Confounders (MABUC) considers decision-making settings where unmeasured variables can influence both the agent’s decisions and received rewards (Bareinboim et al., 2015). Recent findings showed that unobserved confounders (UCs) pose a unique challenge to algorithms based on standard randomization (i.e., experimental data); if UCs are naively averaged out, these algorithms behave sub-optimally, possibly incurring infinite regret. In this paper, we show how counterfactual-based decision-making circumvents these problems and leads to a coherent fusion of observational and experimental data. We then demonstrate this new strategy in an enhanced Thompson Sampling bandit player, and support our findings’ efficacy with extensive simulations.

Resource Type: