\(\newcommand{\bbO}{\mathbb{O}}\) \(\newcommand{\bbD}{\mathbb{D}}\) \(\newcommand{\bbP}{\mathbb{P}}\) \(\newcommand{\bbR}{\mathbb{R}}\) \(\newcommand{\Algo}{\widehat{\mathcal{A}}}\) \(\newcommand{\Algora}{\widetilde{\mathcal{A}}}\) \(\newcommand{\calF}{\mathcal{F}}\) \(\newcommand{\calM}{\mathcal{M}}\) \(\newcommand{\calP}{\mathcal{P}}\) \(\newcommand{\calO}{\mathcal{O}}\) \(\newcommand{\calQ}{\mathcal{Q}}\) \(\newcommand{\defq}{\doteq}\) \(\newcommand{\Exp}{\textrm{E}}\) \(\newcommand{\IC}{\textrm{IC}}\) \(\newcommand{\Gbar}{\bar{G}}\) \(\newcommand{\one}{\textbf{1}}\) \(\newcommand{\psinos}{\psi_{n}^{\textrm{os}}}\) \(\renewcommand{\Pr}{\textrm{Pr}}\) \(\newcommand{\Phat}{P^{\circ}}\) \(\newcommand{\Psihat}{\widehat{\Psi}}\) \(\newcommand{\Qbar}{\bar{Q}}\) \(\newcommand{\tcg}[1]{\textcolor{olive}{#1}}\) \(\DeclareMathOperator{\Dirac}{Dirac}\) \(\DeclareMathOperator{\expit}{expit}\) \(\DeclareMathOperator{\logit}{logit}\) \(\DeclareMathOperator{\Rem}{Rem}\) \(\DeclareMathOperator{\Var}{Var}\)

B Basic results and their proofs

B.1 NPSEM

The experiment can also be summarized by a nonparametric system of structural equations: for some deterministic functions \(f_w\), \(f_a\), \(f_y\) and independent sources of randomness \(U_w\), \(U_a\), \(U_y\),

  1. sample the context where the counterfactual rewards will be generated, the action will be undertaken and the actual reward will be obtained, \(W = f_{w}(U_w)\);

  2. sample the two counterfactual rewards of the two actions that can be undertaken, \(Y_{0} = f_{y}(0, W, U_y)\) and \(Y_{1} = f_{y}(1, W, U_y)\);

  3. sample which action is carried out in the given context, \(A = f_{a} (W, U_a)\);

  4. define the corresponding reward, \(Y = A Y_{1} + (1-A) Y_{0}\);

  5. summarize the course of the experiment with the observation \(O = (W, A, Y)\), thus concealing \(Y_{0}\) and \(Y_{1}\).

B.2 Identification

Let \(\bbP_{0}\) be an experiment that generates \(\bbO \defq (W, Y_{0}, Y_{1}, A, Y)\). We think of \(W\) as the context where an action is undertaken, of \(Y_{0}\) and \(Y_{1}\) as the counterfactual (potential) rewards that actions \(a=0\) and \(a=1\) would entail, of \(A\) as the action carried out, and of \(Y\) as the reward received in response to action \(A\). Consider the following assumptions:

  1. Randomization: under \(\bbP_{0}\), the counterfactual rewards \(Y_0\), \(Y_1\) and action \(A\) are conditionally independent given \(W\), i.e., \(Y_a \perp A \mid W\) for \(a=0,1\).

  2. Consistency: under \(\bbP_{0}\), if action \(A\) is undertaken then reward \(Y_{A}\) is received, i.e., \(Y = Y_{A}\) (or \(Y=Y_{a}\) given that \(A=a\)).

  3. Positivity: under \(\bbP_{0}\), both actions \(a=0\) and \(a=1\) have (\(\bbP_{0}\)-almost surely) a positive probability to be undertaken given \(W\), i.e., \(\Pr_{\bbP_0}(\ell\Gbar_0(a,W) > 0) = 1\) for \(a=0,1\).

Proposition B.1 (Identification) Under the above assumptions, it holds that \[\begin{equation*} \psi_{0} = \Exp_{\bbP_{0}} \left(Y_{1} - Y_{0}\right) = \Exp_{\bbP_{0}}(Y_1) - \Exp_{\bbP_{0}}(Y_0). \end{equation*}\]

Proof. Set arbitrarily \(a \in \{0,1\}\). By the randomization assumption on the one hand (second equality) and by the consistency and positivity assumptions on the other hand (third equality), it holds that \[\begin{align*} \Exp_{\bbP_0}(Y_a) &= \int \Exp_{\bbP_0}(Y_a \mid W = w) dQ_{0,W}(w) = \int \Exp_{\bbP_0}(Y_a \mid A = a, W = w) dQ_{0,W}(w) \\ &= \int \Exp_{P_0}(Y \mid A = a, W = w) dQ_{0,W}(w) = \int \Qbar_0(a,W) dQ_{0,W}(w). \end{align*}\] The stated result easily follows.

Remark. The positivity assumption is needed for \(\Exp_{P_0}(Y \mid A = a, W) \defq \Qbar_{0}(a,W)\) to be well-defined.

B.3 Building a confidence interval

Let \(\Phi\) be the standard normal distribution function. Let \(X_{1}\), \(\ldots\), \(X_{n}\) be independently drawn from a given law.

B.3.1 CLT & Slutsky’s lemma

Assume that \(\sigma^{2} \defq \Var(X_{1})\) is finite. Let \(m \defq \Exp(X_{1})\) be the mean of \(X_{1}\) and \(\bar{X}_{n} \defq n^{-1} \sum_{i=1}^{n} X_{i}\) be the empirical mean. By the central limit theorem (CLT), it holds that \(\sqrt{n} (\bar{X}_{n} - m)\) converges in law as \(n\) grows to the centered Gaussian law with variance \(\sigma^{2}\).

Moreover, if \(\sigma_{n}^{2}\) is a (positive) consistent estimator of \(\sigma^{2}\) then, by Slutsky’s lemma, \(\sqrt{n}/\sigma_{n} (\bar{X}_{n} - m)\) converges in law to the standard normal law. The empirical variance \(n^{-1} \sum_{i=1}^{n}(X_{i} - \bar{X}_{n})^{2}\) is such an estimator.

Proposition B.2 Under the above assumptions, \[\begin{equation*} \left[\bar{X}_{n} \pm \Phi^{-1}(1-\alpha) \frac{\sigma_{n}}{\sqrt{n}}\right] \end{equation*}\] is a confidence interval for \(m\) with asymptotic level \((1-2\alpha)\).

B.3.2 CLT and order statistics

Suppose that the law of \(X_{1}\) admits a continuous distribution function \(F\). Set \(p \in ]0,1[\) and, assuming that \(n\) is large, find \(k\geq 1\) and \(l \geq 1\) such that \[\begin{equation*} \frac{k}{n} \approx p - \Phi^{-1}(1-\alpha) \sqrt{\frac{p(1-p)}{n}} \end{equation*}\] and \[\begin{equation*} \frac{l}{n} \approx p + \Phi^{-1}(1-\alpha) \sqrt{\frac{p(1-p)}{n}}. \end{equation*}\]

Proposition B.3 Under the above assumptions, \([X_{(k)},X_{(l)}]\) is a confidence interval for \(F^{-1}(p)\) with asymptotic level \(1 - 2\alpha\).

B.4 Another representation of the parameter of interest

For notational simplicitiy, note that \((2a-1)\) equals 1 if \(a=1\) and \(-1\) if \(a=0\). Now, for each \(a = 0,1\), \[\begin{align*} \Exp_{P_{0}}\left(\frac{\one\{A = a\}Y}{\ell\Gbar_{0}(a,W)}\right) &= \Exp_{P_{0}}\left(\Exp_{P_{0}}\left(\frac{\one\{A = a\}Y}{\ell\Gbar_{0}(a,W)} \middle| A, W \right) \right) \\ &= \Exp_{P_{0}}\left(\frac{\one\{A = a\}}{\ell\Gbar_{0}(a,W)} \Qbar_{0}(A, W) \right) \\ &= \Exp_{P_{0}}\left(\frac{\one\{A = a\}}{\ell\Gbar_{0}(a,W)} \Qbar_{0}(a, W)\right) \\ &= \Exp_{P_{0}}\left(\Exp_{P_{0}}\left(\frac{\one\{A = a\}}{\ell\Gbar_{0}(a,W)} \Qbar_{0}(a, W) \middle| W \right) \right) \\& = \Exp_{P_{0}}\left(\frac{\ell\Gbar_{0}(a,W)}{\ell\Gbar_{0}(a,W)} \Qbar_{0}(a, W) \middle| W \right) \\& = \Exp_{P_{0}} \left( \Qbar_{0}(a, W) \right), \end{align*}\] where the first, fourth and sixth equalities follow from the tower rule25, and the second and fifth hold by definition of the conditional expectation. This completes the proof.

B.5 The delta-method

Let \(f\) be a map from \(\Theta \subset \bbR^{p}\) to \(\bbR^{q}\) that is differentiable at \(\theta\in \Theta\). Let \(X_{n}\) be a random vector taking its values in \(\Theta\).

Proposition B.4 If \(\sqrt{n} (X_{n} - \theta)\) converges in law to the Gaussian law with mean \(\mu\) and covariance matrix \(\Sigma\), then \(\sqrt{n} (f(X_{n}) - f(\theta))\) converge in law to the Gaussian law with mean \(\nabla f(\theta) \times \mu\) and covariance matrix \(\nabla f(\theta) \times \Sigma \times \nabla f(\theta)^{\top}\). In addition, if \(\Sigma_{n}\) estimates \(\Sigma\) consistently then, by Slutsky’s lemma, the asymptotic variance of \(\sqrt{n} (f(X_{n}) - f(\theta))\) is consistently estimated with \(\nabla f(X_{n}) \times \Sigma_{n} \times \nabla f(X_{n})^{\top}\).

B.6 The oracle logistic risk

First, let us recall the definition of the Kullback-Leibler divergence between Bernoulli laws of parameters \(p,q\in]0,1[\): \[\begin{equation*}\text{KL}(p,q) \defq p \log\left(\frac{p}{q}\right) + (1-p) \log \left(\frac{1-p}{1-q}\right).\end{equation*}\] It satisfies \(\text{KL}(p,q) \geq 0\) where the equality holds if and only if \(p=q\).

Let \(f:[0,1] \times \{0,1\} \times [0,1] \to [0,1]\) be a (measurable) function. Applying the tower rule shows that the oracle logistic risk satisfies \[\begin{align} \Exp_{P_{0}} \left(L_{y} (f)(O)\right)&=\Exp_{P_{0}} \left(-\Qbar_{0}(A,W) \log f(A,W) - \left(1 - \Qbar_{0} (A,W)\right) \log \left(1 - f(A,W)\right)\right)\notag\\&=\Exp_{P_{0}} \left(\text{KL}\left(\Qbar_{0}(A,W), f(A,W)\right)\right) + \text{constant},\tag{B.1} \end{align}\] where the above constant equals \[\begin{equation*} -\Exp_{P_{0}}\left(\Qbar_{0}(A,W) \log \Qbar_{0}(A,W) - \left(1 - \Qbar_{0} (A,W)\right) \log \left(1 - \Qbar_{0,W}(A,W)\right)\right). \end{equation*}\]

In light of (B.1), \(\Qbar_{0}\) minimizes \(f \mapsto \Exp_{P_{0}} \left(L_{y} (f)(O)\right)\) over the set of (measurable) functions mapping \([0,1] \times \{0,1\} \times [0,1]\) to \([0,1]\). Moreover, as an average of measures of discrepancy, \(\Exp_{P_{0}} \left(L_{y} (f)(O)\right)\) is also a measure of discrepancy.


  1. For any random variable \((U,V)\) such that \(\Exp(U|V)\) and \(\Exp(U)\) are well defined, it holds that \(\Exp(\Exp(U|V)) = \Exp(U)\).↩︎