\(\newcommand{\bbO}{\mathbb{O}}\) \(\newcommand{\bbD}{\mathbb{D}}\) \(\newcommand{\bbP}{\mathbb{P}}\) \(\newcommand{\bbR}{\mathbb{R}}\) \(\newcommand{\Algo}{\widehat{\mathcal{A}}}\) \(\newcommand{\Algora}{\widetilde{\mathcal{A}}}\) \(\newcommand{\calF}{\mathcal{F}}\) \(\newcommand{\calM}{\mathcal{M}}\) \(\newcommand{\calP}{\mathcal{P}}\) \(\newcommand{\calO}{\mathcal{O}}\) \(\newcommand{\calQ}{\mathcal{Q}}\) \(\newcommand{\defq}{\doteq}\) \(\newcommand{\Exp}{\textrm{E}}\) \(\newcommand{\IC}{\textrm{IC}}\) \(\newcommand{\Gbar}{\bar{G}}\) \(\newcommand{\one}{\textbf{1}}\) \(\newcommand{\psinos}{\psi_{n}^{\textrm{os}}}\) \(\renewcommand{\Pr}{\textrm{Pr}}\) \(\newcommand{\Phat}{P^{\circ}}\) \(\newcommand{\Psihat}{\widehat{\Psi}}\) \(\newcommand{\Qbar}{\bar{Q}}\) \(\newcommand{\tcg}[1]{\textcolor{olive}{#1}}\) \(\DeclareMathOperator{\Dirac}{Dirac}\) \(\DeclareMathOperator{\expit}{expit}\) \(\DeclareMathOperator{\logit}{logit}\) \(\DeclareMathOperator{\Rem}{Rem}\) \(\DeclareMathOperator{\Var}{Var}\)

B Basic results and their proofs

B.1 NPSEM

The experiment can also be summarized by a nonparametric system of structural equations: for some deterministic functions \(f_w\), \(f_a\), \(f_y\) and independent sources of randomness \(U_w\), \(U_a\), \(U_y\),

sample the context where the counterfactual rewards will be generated, the action will be undertaken and the actual reward will be obtained, \(W = f_{w}(U_w)\);
sample the two counterfactual rewards of the two actions that can be undertaken, \(Y_{0} = f_{y}(0, W, U_y)\) and \(Y_{1} = f_{y}(1, W, U_y)\);
sample which action is carried out in the given context, \(A = f_{a} (W, U_a)\);
define the corresponding reward, \(Y = A Y_{1} + (1-A) Y_{0}\);
summarize the course of the experiment with the observation \(O = (W, A, Y)\), thus concealing \(Y_{0}\) and \(Y_{1}\).

B.2 Identification

Let \(\bbP_{0}\) be an experiment that generates \(\bbO \defq (W, Y_{0}, Y_{1}, A, Y)\). We think of \(W\) as the context where an action is undertaken, of \(Y_{0}\) and \(Y_{1}\) as the counterfactual (potential) rewards that actions \(a=0\) and \(a=1\) would entail, of \(A\) as the action carried out, and of \(Y\) as the reward received in response to action \(A\). Consider the following assumptions:

Randomization: under \(\bbP_{0}\), the counterfactual rewards \(Y_0\), \(Y_1\) and action \(A\) are conditionally independent given \(W\), i.e., \(Y_a \perp A \mid W\) for \(a=0,1\).
Consistency: under \(\bbP_{0}\), if action \(A\) is undertaken then reward \(Y_{A}\) is received, i.e., \(Y = Y_{A}\) (or \(Y=Y_{a}\) given that \(A=a\)).
Positivity: under \(\bbP_{0}\), both actions \(a=0\) and \(a=1\) have (\(\bbP_{0}\)-almost surely) a positive probability to be undertaken given \(W\), i.e., \(\Pr_{\bbP_0}(\ell\Gbar_0(a,W) > 0) = 1\) for \(a=0,1\).

Proposition B.1 (Identification) Under the above assumptions, it holds that \[\begin{equation*} \psi_{0} = \Exp_{\bbP_{0}} \left(Y_{1} - Y_{0}\right) = \Exp_{\bbP_{0}}(Y_1) - \Exp_{\bbP_{0}}(Y_0). \end{equation*}\]

Proof. Set arbitrarily \(a \in \{0,1\}\). By the randomization assumption on the one hand (second equality) and by the consistency and positivity assumptions on the other hand (third equality), it holds that \[\begin{align*} \Exp_{\bbP_0}(Y_a) &= \int \Exp_{\bbP_0}(Y_a \mid W = w) dQ_{0,W}(w) = \int \Exp_{\bbP_0}(Y_a \mid A = a, W = w) dQ_{0,W}(w) \\ &= \int \Exp_{P_0}(Y \mid A = a, W = w) dQ_{0,W}(w) = \int \Qbar_0(a,W) dQ_{0,W}(w). \end{align*}\] The stated result easily follows.

Remark. The positivity assumption is needed for \(\Exp_{P_0}(Y \mid A = a, W) \defq \Qbar_{0}(a,W)\) to be well-defined.

B.3 Building a confidence interval

Let \(\Phi\) be the standard normal distribution function. Let \(X_{1}\), \(\ldots\), \(X_{n}\) be independently drawn from a given law.

B.3.1 CLT & Slutsky’s lemma

Assume that \(\sigma^{2} \defq \Var(X_{1})\) is finite. Let \(m \defq \Exp(X_{1})\) be the mean of \(X_{1}\) and \(\bar{X}_{n} \defq n^{-1} \sum_{i=1}^{n} X_{i}\) be the empirical mean. By the central limit theorem (CLT), it holds that \(\sqrt{n} (\bar{X}_{n} - m)\) converges in law as \(n\) grows to the centered Gaussian law with variance \(\sigma^{2}\).

Moreover, if \(\sigma_{n}^{2}\) is a (positive) consistent estimator of \(\sigma^{2}\) then, by Slutsky’s lemma, \(\sqrt{n}/\sigma_{n} (\bar{X}_{n} - m)\) converges in law to the standard normal law. The empirical variance \(n^{-1} \sum_{i=1}^{n}(X_{i} - \bar{X}_{n})^{2}\) is such an estimator.

Proposition B.2 Under the above assumptions, \[\begin{equation*} \left[\bar{X}_{n} \pm \Phi^{-1}(1-\alpha) \frac{\sigma_{n}}{\sqrt{n}}\right] \end{equation*}\] is a confidence interval for \(m\) with asymptotic level \((1-2\alpha)\).

B.3.2 CLT and order statistics

Suppose that the law of \(X_{1}\) admits a continuous distribution function \(F\). Set \(p \in ]0,1[\) and, assuming that \(n\) is large, find \(k\geq 1\) and \(l \geq 1\) such that \[\begin{equation*} \frac{k}{n} \approx p - \Phi^{-1}(1-\alpha) \sqrt{\frac{p(1-p)}{n}} \end{equation*}\] and \[\begin{equation*} \frac{l}{n} \approx p + \Phi^{-1}(1-\alpha) \sqrt{\frac{p(1-p)}{n}}. \end{equation*}\]

Proposition B.3 Under the above assumptions, \([X_{(k)},X_{(l)}]\) is a confidence interval for \(F^{-1}(p)\) with asymptotic level \(1 - 2\alpha\).

B.4 Another representation of the parameter of interest

For notational simplicitiy, note that \((2a-1)\) equals 1 if \(a=1\) and \(-1\) if \(a=0\). Now, for each \(a = 0,1\), \[\begin{align*} \Exp_{P_{0}}\left(\frac{\one\{A = a\}Y}{\ell\Gbar_{0}(a,W)}\right) &= \Exp_{P_{0}}\left(\Exp_{P_{0}}\left(\frac{\one\{A = a\}Y}{\ell\Gbar_{0}(a,W)} \middle| A, W \right) \right) \\ &= \Exp_{P_{0}}\left(\frac{\one\{A = a\}}{\ell\Gbar_{0}(a,W)} \Qbar_{0}(A, W) \right) \\ &= \Exp_{P_{0}}\left(\frac{\one\{A = a\}}{\ell\Gbar_{0}(a,W)} \Qbar_{0}(a, W)\right) \\ &= \Exp_{P_{0}}\left(\Exp_{P_{0}}\left(\frac{\one\{A = a\}}{\ell\Gbar_{0}(a,W)} \Qbar_{0}(a, W) \middle| W \right) \right) \\& = \Exp_{P_{0}}\left(\frac{\ell\Gbar_{0}(a,W)}{\ell\Gbar_{0}(a,W)} \Qbar_{0}(a, W) \middle| W \right) \\& = \Exp_{P_{0}} \left( \Qbar_{0}(a, W) \right), \end{align*}\] where the first, fourth and sixth equalities follow from the tower rule²⁵, and the second and fifth hold by definition of the conditional expectation. This completes the proof.

B.5 The delta-method

Let \(f\) be a map from \(\Theta \subset \bbR^{p}\) to \(\bbR^{q}\) that is differentiable at \(\theta\in \Theta\). Let \(X_{n}\) be a random vector taking its values in \(\Theta\).

Proposition B.4 If \(\sqrt{n} (X_{n} - \theta)\) converges in law to the Gaussian law with mean \(\mu\) and covariance matrix \(\Sigma\), then \(\sqrt{n} (f(X_{n}) - f(\theta))\) converge in law to the Gaussian law with mean \(\nabla f(\theta) \times \mu\) and covariance matrix \(\nabla f(\theta) \times \Sigma \times \nabla f(\theta)^{\top}\). In addition, if \(\Sigma_{n}\) estimates \(\Sigma\) consistently then, by Slutsky’s lemma, the asymptotic variance of \(\sqrt{n} (f(X_{n}) - f(\theta))\) is consistently estimated with \(\nabla f(X_{n}) \times \Sigma_{n} \times \nabla f(X_{n})^{\top}\).

B.6 The oracle logistic risk

First, let us recall the definition of the Kullback-Leibler divergence between Bernoulli laws of parameters \(p,q\in]0,1[\): \[\begin{equation*}\text{KL}(p,q) \defq p \log\left(\frac{p}{q}\right) + (1-p) \log \left(\frac{1-p}{1-q}\right).\end{equation*}\] It satisfies \(\text{KL}(p,q) \geq 0\) where the equality holds if and only if \(p=q\).

Let \(f:[0,1] \times \{0,1\} \times [0,1] \to [0,1]\) be a (measurable) function. Applying the tower rule shows that the oracle logistic risk satisfies \[\begin{align} \Exp_{P_{0}} \left(L_{y} (f)(O)\right)&=\Exp_{P_{0}} \left(-\Qbar_{0}(A,W) \log f(A,W) - \left(1 - \Qbar_{0} (A,W)\right) \log \left(1 - f(A,W)\right)\right)\notag\\&=\Exp_{P_{0}} \left(\text{KL}\left(\Qbar_{0}(A,W), f(A,W)\right)\right) + \text{constant},\tag{B.1} \end{align}\] where the above constant equals \[\begin{equation*} -\Exp_{P_{0}}\left(\Qbar_{0}(A,W) \log \Qbar_{0}(A,W) - \left(1 - \Qbar_{0} (A,W)\right) \log \left(1 - \Qbar_{0,W}(A,W)\right)\right). \end{equation*}\]

In light of (B.1), \(\Qbar_{0}\) minimizes \(f \mapsto \Exp_{P_{0}} \left(L_{y} (f)(O)\right)\) over the set of (measurable) functions mapping \([0,1] \times \{0,1\} \times [0,1]\) to \([0,1]\). Moreover, as an average of measures of discrepancy, \(\Exp_{P_{0}} \left(L_{y} (f)(O)\right)\) is also a measure of discrepancy.

For any random variable \((U,V)\) such that \(\Exp(U|V)\) and \(\Exp(U)\) are well defined, it holds that \(\Exp(\Exp(U|V)) = \Exp(U)\).↩︎