# Section 6 A simple inference strategy

## 6.1 A cautionary detour

Let us introduce first the following estimator:

\[\begin{align} \notag \psi_{n}^{a} &\defq \frac{\Exp_{P_{n}} (AY)}{\Exp_{P_{n}} (A)} - \frac{\Exp_{P_{n}} ((1-A)Y)}{\Exp_{P_{n}}(1-A)} \\ &= \frac{\sum_{i=1}^{n} \one\{A_{i}=Y_{i}=1\}}{\sum_{i=1}^{n} \one\{A_{i}=1\}} - \frac{\sum_{i=1}^{n} \one\{A_{i}=0,Y_{i}=1\}}{\sum_{i=1}^{n}\one\{A_{i}=0\}}. \tag{6.1} \end{align}\]

Note that \(\psi_n^a\) is simply the difference in sample means between observations with \(A = 1\) and observations with \(A = 0\). It estimates \[\begin{align*}\Phi(P_{0}) &\defq \frac{\Exp_{P_{0}} (AY)}{\Exp_{P_{0}} (A)} - \frac{\Exp_{P_{0}} ((1-A)Y)}{\Exp_{P_{0}} (1-A)}\\&= \Exp_{P_{0}} (Y | A=1) - \Exp_{P_{0}} (Y | A=0).\end{align*}\]

We seize this opportunity to demonstrate numerically that
\(\psi_{n}^{a}\) *does not* estimate \(\Psi(P_{0})\) because, in general,
\(\Psi(P_{0})\) and \(\Phi(P_{0})\) differ. This is apparent in the following
alternative expression of \(\Phi(P_{0})\):

\[\begin{align*} \Phi(P_{0}) &= \Exp_{P_{0}} \left(\Exp_{P_0}(Y \mid A, W) |A=1) \right) - \Exp_{P_{0}} \left(\Exp_{P_0}(Y \mid A, W) | A=0\right)\\ &= \int \Qbar_{0}(1, w) dP_{0,W|A=1}(w) - \int \Qbar_{0}(0, w) dP_{0,W|A=0}(w). \end{align*}\]

Contrast the above equalities and (2.1). In the latter, the outer integral is against the marginal law of \(W\) under \(P_{0}\). In the former, the outer integrals are respectively against the conditional laws of \(W\) given \(A=1\) and \(A=0\) under \(P_{0}\). Thus \(\Psi(P_0)\) and \(\Phi(P_0)\) will enjoy equality when the conditional laws of \(W\) given \(A = 1\) and \(W\) given \(A = 0\) both equal the marginal law of \(W\). This can sometimes be ensured by design, as in an experiment where \(A\) is randomly allocated to observations, irrespective of \(W\). However, barring this level of experimental control, in general the naive estimator may lead us astray in our goal of drawing causal inference.

## 6.2 ⚙ Delta-method

Consider the next chunk of code:

```
<- function(obs) {
compute_irrelevant_estimator <- as_tibble(obs)
obs <- pull(obs, Y)
Y <- pull(obs, A)
A <- mean(A * Y) / mean(A) - mean((1 - A) * Y) / mean(1 - A)
psi_n <- cov(cbind(A * Y, A, (1 - A) * Y, (1 - A)))
Var_n <- c(1 / mean(A), -mean(A * Y) / mean(A)^2,
phi_n -1 / mean(1 - A),
mean((1 - A) * Y) / mean(1 - A)^2)
<- as.numeric(t(phi_n) %*% Var_n %*% phi_n)
var_n <- sqrt(var_n / nrow(obs))
sig_n tibble(psi_n = psi_n, sig_n = sig_n)
}
```

Function `compute_irrelevant_estimator`

computes the estimator \(\psi_{n}^{a}\)
(6.1) based on the data set in `obs`

.

Introduce \(X_{n} \defq n^{-1}\sum_{i=1}^{n} \left(A_{i}Y_{i}, A_{i}, (1-A_{i})Y_{i}, 1-A_{i}\right)^{\top}\) and \(X \defq \left(AY, A, (1-A)Y, 1-A\right)^{\top}\). It happens that \(X_{n}\) is asymptotically Gaussian: as \(n\) goes to infinity,\[\begin{equation*}\sqrt{n} \left(X_{n} - \Exp_{P_{0}} (X)\right)\end{equation*}\] converges in law to the centered Gaussian law with covariance matrix \[\begin{equation*}V_{0} \defq \Exp_{P_{0}} \left((X - \Exp_{P_{0}} (X)) \times (X- \Exp_{P_{0}} (X))^{\top}\right).\end{equation*}\]

Let \(f:\bbR\times \bbR^{*} \times \bbR\times \bbR^{*}\) be given by \(f(r,s,t,u) = r/s - t/u\). The function is differentiable.

Check that \(\psi_{n}^{a} = f(X_{n})\). Point out the line where \(\psi_{n}^{a}\) is computed in the body of

`compute_irrelevant_estimator`

. Also point out to the line where the above asymptotic variance of \(X_{n}\) is estimated with its empirical counterpart, say \(V_{n}\).☡ Argue how the delta-method yields that \(\sqrt{n}(\psi_{n}^{a} - \Phi(P_{0}))\) converges in law to the centered Gaussian law with a variance that can be estimated with \[\begin{equation} v_{n}^{a} \defq \nabla f(X_{n}) \times V_{n} \times \nabla f(X_{n})^{\top}. \tag{6.2} \end{equation}\]

Check that the gradient \(\nabla f\) of \(f\) is given by \(\nabla f(r,s,t,u) \defq (1/s, -r/s^{2}, -1/u, t/u^{2})\). Point out to the line where the asymptotic variance of \(\psi_{n}^{a}\) is estimated.

For instance,

```
compute_irrelevant_estimator(head(obs, 1e3)))
(#> # A tibble: 1 × 2
#> psi_n sig_n
#> <dbl> <dbl>
#> 1 0.126 0.0181
```

computes the numerical values of \(\psi_{n}^{a}\) and \(v_{n}^{a}\) based on the
1000 first observations in `obs`

.

## 6.3 IPTW estimator assuming the mechanism of action known

### 6.3.1 A simple estimator

Let us assume for a moment that we know \(\Gbar_{0}\). This would have been the case indeed if we had experimental control over the actions taken by observational units, as in a randomized experiment. Note that, on the contrary, assuming \(\Qbar_{0}\) known would be difficult to justify.

`<- get_feature(experiment, "Gbar") Gbar `

The alternative expression (2.7) suggests to estimate \(\Psi(P_{0})\) with

\[\begin{align} \psi_{n}^{b} &\defq \Exp_{P_{n}} \left( \frac{2A-1}{\ell \Gbar_{0}(A,W)} Y \right) \\ & = \frac{1}{n} \sum_{i=1}^{n} \left(\frac{2A_{i}-1}{ \ell\Gbar_{0}(A_{i},W_{i})}Y_{i} \right). \tag{6.3} \end{align}\]

Note how \(P_{n}\) is substituted for \(P_{0}\) in (6.3) relative to
(2.7). However, we cannot call \(\psi_{n}^{b}\) a *substitution
estimator* because it does not write as \(\Psi\) evaluated at an element of
\(\calM\). This said, it is dubbed an IPTW (inverse probability of treatment
weighted) estimator because of the denominators \(\ell\Gbar_{0}(A_{i},W_{i})\)
in its definition.^{10}

In Section 8.2, we develop another IPTW estimator that does not assume that \(\Gbar_{0}\) is known beforehand.

### 6.3.2 Elementary statistical properties

It is easy to check that \(\psi_{n}^{b}\) estimates \(\Psi(P_{0})\) consistently, but this is too little to request from an estimator of \(\psi_{0}\). Better, \(\psi_{n}^{b}\) also satisfies a central limit theorem: \(\sqrt{n} (\psi_{n}^{b} - \psi_{0})\) converges in law to a centered Gaussian law with asymptotic variance \[\begin{equation*}v^{b} \defq \Var_{P_{0}} \left(\frac{2A-1}{\ell\Gbar_{0}(A,W)}Y\right),\end{equation*}\] where \(v^{b}\) can be consistently estimated by its empirical counterpart

\[\begin{align} \tag{6.4} v_{n}^{b} &\defq \Var_{P_{n}} \left(\frac{2A-1}{\ell\Gbar_{0}(A,W)}Y\right) \\ &= \frac{1}{n} \sum_{i=1}^{n}\left(\frac{2A_{i}-1}{\ell\Gbar_{0} (A_{i},W_{i})} Y_{i} - \psi_{n}^{b}\right)^{2}. \end{align}\]

We investigate *empirically* the statistical behavior of \(\psi_{n}^{b}\) in
Section 6.3.3.

### 6.3.3 Empirical investigation

The package `tlrider`

includes `compute_iptw`

, a function that implements the
IPTW inference strategy (simply run `?compute_iptw`

to see its man page). For
instance,

```
compute_iptw(head(obs, 1e3), Gbar))
(#> # A tibble: 1 × 2
#> psi_n sig_n
#> <dbl> <dbl>
#> 1 0.167 0.0533
```

computes the numerical values of \(\psi_{n}^{b}\) and \(v_{n}^{b}\) based
upon the 1000 first observations contained in `obs`

.

The next chunk of code investigates the empirical behaviors of estimators
\(\psi_{n}^{a}\) and \(\psi_{n}^{b}\). As explained in Section 5,
we first make 1000 data sets out of the `obs`

data set (second line), then
build the estimators on each of them (fourth and fifth lines). After the first
series of commands the object `psi_hat_ab`

, a `tibble`

, contains 2000
rows and four columns. For each smaller data set (identified by its `id`

),
two rows contain the values of either \(\psi_{n}^{a}\) and
\(\sqrt{v_{n}^{a}}/\sqrt{n}\) (if `type`

equals `a`

) or \(\psi_{n}^{b}\) and
\(\sqrt{v_{n}^{b}}/\sqrt{n}\) (if `type`

equals `b`

).

After the second series of commands, the object `psi_hat_ab`

contains, in
addition, the values of the recentered (with respect to \(\psi_{0}\)) and
renormalized \(\sqrt{n}/\sqrt{v_{n}^{a}} (\psi_{n}^{a} - \psi_{0})\) and
\(\sqrt{n}/\sqrt{v_{n}^{b}} (\psi_{n}^{b} - \psi_{0})\), where \(v_{n}^{a}\)
(6.2) and \(v_{n}^{b}\) (6.4) estimate the asymptotic
variances of \(\psi_{n}^{a}\) and \(\psi_{n}^{b}\), respectively. Finally,
`bias_ab`

reports amounts of bias (at the renormalized scale).

We will refer to the recentering (always with respect to \(\psi_{0}\)) then
renormalizing procedure as a *renormalization scheme*, because what may vary
between two such procedures is the renormalization factor. We will judge
whether or not the renormalization scheme is adequate for instance by
comparing kernel density estimators of the law of estimators gone through a
renormalization procedure with the standard normal density, as in Figure
6.1 below.

```
<- obs %>% as_tibble %>%
psi_hat_ab mutate(id = (seq_len(n()) - 1) %% iter) %>%
nest(obs = c(W, A, Y)) %>%
mutate(est_a = map(obs, ~ compute_irrelevant_estimator(.)),
est_b = map(obs, ~ compute_iptw(as.matrix(.), Gbar))) %>%
pivot_longer(c(`est_a`, `est_b`),
names_to = "type", values_to = "estimates") %>%
extract(type, "type", "_([ab])$") %>%
unnest(estimates) %>% select(-obs)
(psi_hat_ab)#> # A tibble: 2,000 × 4
#> id type psi_n sig_n
#> <dbl> <chr> <dbl> <dbl>
#> 1 0 a 0.140 0.0174
#> 2 0 b 0.0772 0.0546
#> 3 1 a 0.107 0.0169
#> 4 1 b 0.138 0.0541
#> 5 2 a 0.107 0.0172
#> 6 2 b 0.0744 0.0553
#> # … with 1,994 more rows
<- psi_hat_ab %>%
psi_hat_ab group_by(id) %>%
mutate(clt = (psi_n - psi_zero) / sig_n)
(psi_hat_ab)#> # A tibble: 2,000 × 5
#> # Groups: id [1,000]
#> id type psi_n sig_n clt
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 0 a 0.140 0.0174 3.23
#> 2 0 b 0.0772 0.0546 -0.109
#> 3 1 a 0.107 0.0169 1.40
#> 4 1 b 0.138 0.0541 1.01
#> 5 2 a 0.107 0.0172 1.38
#> 6 2 b 0.0744 0.0553 -0.159
#> # … with 1,994 more rows
<- psi_hat_ab %>%
(bias_ab group_by(type) %>% summarize(bias = mean(clt), .groups = "drop"))
#> # A tibble: 2 × 2
#> type bias
#> <chr> <dbl>
#> 1 a 1.39
#> 2 b 0.0241
<- ggplot() +
fig_bias_ab geom_line(aes(x = x, y = y),
data = tibble(x = seq(-3, 3, length.out = 1e3),
y = dnorm(x)),
linetype = 1, alpha = 0.5) +
geom_density(aes(clt, fill = type, colour = type),
alpha = 0.1) +
psi_hat_ab, geom_vline(aes(xintercept = bias, colour = type),
size = 1.5, alpha = 0.5)
bias_ab,
+
fig_bias_ab labs(y = "",
x = bquote(paste(sqrt(n/v[n]^{list(a, b)})*
^{list(a, b)} - psi[0])))) (psi[n]
```

By the above chunk of code, the averages of \(\sqrt{n/v_{n}^{a}} (\psi_{n}^{a} - \psi_{0})\) and \(\sqrt{n/v_{n}^{b}} (\psi_{n}^{b} - \psi_{0})\)
computed across the realizations of the two estimators are respectively equal
to 1.387 and
0.024 (see `bias_ab`

). Interpreted as amounts of bias, those two
quantities are represented by vertical lines in Figure
6.1. The red and blue bell-shaped curves represent the
empirical laws of \(\psi_{n}^{a}\) and \(\psi_{n}^{b}\) (recentered with respect
to \(\psi_{0}\), and renormalized) as estimated by kernel density estimation.
The latter is close to the black curve, which represents the standard normal
density.

We could have used the alternative expression IPAW, where A (like action) is substituted for T (like treatment).↩︎