# Section 9 One-step correction

## 9.1 ☡ General analysis of plug-in estimators

Recall that \(\Algo_{Q_{W}}\) is an algorithm designed for the estimation of \(Q_{0,W}\) (see Section 7.3) and that we denote by \(Q_{n,W} \defq \Algo_{Q_{W}}(P_{n})\) the output of the algorithm trained on \(P_{n}\). Likewise, \(\Algo_{\Gbar}\) and \(\Algo_{\Qbar}\) are two generic algorithms designed for the estimation of \(\Gbar_{0}\) and of \(\Qbar_{0}\) (see Sections 7.4 and 7.6), \(\Gbar_{n} \defq \Algo_{\Gbar}(P_{n})\) and \(\Qbar_{n} \defq \Algo_{\Qbar}(P_{n})\) are their outputs once trained on \(P_{n}\).

Let us now introduce \(\Phat_n\) a law in \(\calM\) such that the \(Q_{W}\), \(\Gbar\)
and \(\Qbar\) features of \(\Phat_n\) equal \(Q_{n,W}\), \(\Gbar_{n}\) and
\(\Qbar_{n}\), respectively. We say that any such law is *compatible* with
\(Q_{n,W}\), \(\Gbar_n\) and \(\Qbar_n\).

Now, let us substitute \(\Phat_n\) for \(P\) in (4.1):

\[\begin{equation} \tag{9.1} \Psi(\Phat_n) - \Psi(P_0) = - P_0 D^*(\Phat_n) + \Rem_{P_0}(\Phat_n) . \end{equation}\]

We show there in Appendix
C.2 that, under conditions on the
complexity/versatility of algorithms \(\Algo_{\Gbar}\) and \(\Algo_{\Qbar}\)
(often referred to as *regularity conditions*) and assuming that their outputs
\(\Gbar_{n}\) and \(\Qbar_{n}\) both consistently estimate their targets
\(\Gbar_{0}\) and \(\Qbar_{0}\), it holds that

\[\begin{align} \Psi(\Phat_n) - \Psi(P_0) &= - P_{n} D^*(\Phat_n) + P_{n} D^*(P_0) + o_{P_0}(1/\sqrt{n}) \tag{9.2} \\ \notag &= - P_{n} D^*(\Phat_n)+ \frac{1}{n} \sum_{i=1}^n D^*(P_0)(O_i) + o_{P_0}(1/\sqrt{n}). \end{align}\]

In light of (3.7), \(\Psi(\Phat_{n})\) would be asymptotically linear with influence curve \(\IC=D^{*}(P_{0})\) in the absence of the random term \(-P_{n} D^*(\Phat_n)\). Unfortunately, it turns out that this term can degrade dramatically the behavior of the plug-in estimator \(\Psi(\Phat_{n})\).

## 9.2 One-step correction

Luckily, *a very simple workaround* allows to circumvent the problem. Proposed
in (Le Cam 1969) (see also (Pfanzagl 1982) and (Vaart 1998)), the workaround merely
consists in adding the random term to the initial estimator, that is, in
estimating \(\Psi(P_0)\) not with \(\Psi(\Phat_n)\) but instead with
\[\begin{equation}
\psinos \defq \Psi(\Phat_n) + P_{n} D^*(\Phat_n) = \Psi(\Phat_n) +
\frac{1}{n} \sum_{i=1}^{n} D^*(\Phat_n)(O_{i}). \tag{9.3}
\end{equation}\]

Obviously, in light of (9.2), \(\psinos\) is asymptotically linear with influence curve \(\IC=D^{*}(P_{0})\). Thus, by the central limit theorem, \(\sqrt{n} (\psinos - \Psi(P_0))\) converges in law to a centered Gaussian distribution with variance \[\begin{equation} \Var_{P_0}(D^{*}(P_{0})(O)) = \Exp_{P_0}(D^{*}(P_{0})(O)). \end{equation}\]

The detailed general analysis of plug-in estimators developed there in Appendix C.2 also revealed that the above asymptotic variance is consistently estimated with \[\begin{equation} P_{n} D^{*}(\Phat_{n})^{2} = \frac{1}{n} \sum_{i=1}^{n} D^*(\Phat_n)^{2}(O_{i}). \end{equation}\] Therefore, by the central limit theorem and Slutsky’s lemma (see the argument there in Appendix B.3.1), \[\begin{equation*} \left[\psinos \pm \Phi^{-1}(1-\alpha) \frac{P_{n} D^{*}(\Phat_{n})^{2}}{\sqrt{n}}\right] \end{equation*}\] is a confidence interval for \(\Psi(P_0)\) with asymptotic level \((1-2\alpha)\).

## 9.3 Empirical investigation

In light of (9.3) if the estimator equals \(\Psi(\Phat_{n})\),
then performing a one-step correction essentially boils down to computing two
quantities, \(-P_{n} D^{*}(\Phat_{n})\) (the bias term) and \(P_{n} D^{*}(\Phat_{n})^{2}\) (an estimator of the asymptotic variance of
\(\psinos\)). The `tlrider`

package makes the operation very easy thanks to the
function `apply_one_step_correction`

.

Let us illustrate its use by updating the G-computation estimator built on the
\(n=1000\) first observations in `obs`

by relying on \(\Algo_{\Qbar,\text{kNN}}\),
that is, on the algorithm for the estimation of \(\Qbar_{0}\) as it is
implemented in `estimate_Qbar`

with its argument `algorithm`

set to the
built-in `kknn_algo`

(see Section 7.6.2). The algorithm has
been trained earlier on this data set and produced the object `Qbar_hat_kknn`

.
The following chunk of code re-computes the corresponding G-computation
estimator, using again the estimator `QW_hat`

of the marginal law of \(W\) under
\(P_{0}\) (see Section 7.3), then applied the one-step correction:

```
<- compute_gcomp(QW_hat, wrapper(Qbar_hat_kknn, FALSE), 1e3))
(psin_kknn #> # A tibble: 1 × 2
#> psi_n sig_n
#> <dbl> <dbl>
#> 1 0.101 0.00306
<- apply_one_step_correction(head(obs, 1e3),
(psin_kknn_os wrapper(Gbar_hat, FALSE),
wrapper(Qbar_hat_kknn, FALSE),
$psi_n))
psin_kknn#> # A tibble: 1 × 3
#> psi_n sig_n crit_n
#> <dbl> <dbl> <dbl>
#> 1 0.0999 0.0171 -0.000633
```

In the call to `apply_one_step_correction`

we provide *(i)* the data set at
hand (first line), *(ii)* the estimator `Gbar_hat`

of \(\Gbar_{0}\) that we
built earlier by using algorithm \(\Algo_{\Gbar,1}\) (second line; see Section
7.4.2), *(iii)* the estimator `Qbar_hat_kknn`

of \(\Qbar_{0}\)
and the G-computation estimator `psin_kknn`

that resulted from it (third and
fourth lines).

To assess what is the impact of the one-step correction, let us apply the
one-step correction to the estimators that we built in Section
8.4.3. The object `learned_features_fixed_sample_size`

already contains the estimated features of \(P_{0}\) that are needed to perform
the one-step correction to the estimators \(\psi_{n}^{d}\) and \(\psi_{n}^{e}\),
namely, thus we merely have to call the function `apply_one_step_correction`

.

```
<- learned_features_fixed_sample_size %>%
psi_hat_de_one_step mutate(os_est_d =
pmap(list(obs, Gbar_hat, Qbar_hat_d, est_d),
~ apply_one_step_correction(as.matrix(..1),
wrapper(..2, FALSE),
wrapper(..3, FALSE),
4$psi_n)),
..os_est_e =
pmap(list(obs, Gbar_hat, Qbar_hat_e, est_e),
~ apply_one_step_correction(as.matrix(..1),
wrapper(..2, FALSE),
wrapper(..3, FALSE),
4$psi_n))) %>%
..select(os_est_d, os_est_e) %>%
pivot_longer(c(`os_est_d`, `os_est_e`),
names_to = "type", values_to = "estimates") %>%
extract(type, "type", "_([de])$") %>%
mutate(type = paste0(type, "_one_step")) %>%
unnest(estimates) %>%
group_by(type) %>%
mutate(sig_alt = sd(psi_n)) %>%
mutate(clt_ = (psi_n - psi_zero) / sig_n,
clt_alt = (psi_n - psi_zero) / sig_alt) %>%
pivot_longer(c(`clt_`, `clt_alt`),
names_to = "key", values_to = "clt") %>%
extract(key, "key", "_(.*)$") %>%
mutate(key = ifelse(key == "", TRUE, FALSE)) %>%
rename("auto_renormalization" = key)
<- psi_hat_de_one_step %>%
(bias_de_one_step group_by(type, auto_renormalization) %>%
summarize(bias = mean(clt)) %>% ungroup)
#> # A tibble: 4 × 3
#> type auto_renormalization bias
#> <chr> <lgl> <dbl>
#> 1 d_one_step FALSE -0.0142
#> 2 d_one_step TRUE -0.0307
#> 3 e_one_step FALSE 0.0126
#> 4 e_one_step TRUE -0.00460
<- ggplot() +
fig geom_line(aes(x = x, y = y),
data = tibble(x = seq(-4, 4, length.out = 1e3),
y = dnorm(x)),
linetype = 1, alpha = 0.5) +
geom_density(aes(clt, fill = type, colour = type),
alpha = 0.1) +
psi_hat_de_one_step, geom_vline(aes(xintercept = bias, colour = type),
size = 1.5, alpha = 0.5) +
bias_de_one_step, facet_wrap(~ auto_renormalization,
labeller =
as_labeller(c(`TRUE` = "auto-renormalization: TRUE",
`FALSE` = "auto-renormalization: FALSE")),
scales = "free")
+
fig labs(y = "",
x = bquote(paste(sqrt(n/v[n]^{list(d, e, os)})*
^{list(d, e, os)} - psi[0])))) (psi[n]
```

It seems that the one-step correction performs quite well (in particular,
compare `bias_de`

with `bias_de_one_step`

):

```
bind_rows(bias_de, bias_de_one_step) %>%
filter(!auto_renormalization) %>%
arrange(type)
#> # A tibble: 4 × 3
#> type auto_renormalization bias
#> <chr> <lgl> <dbl>
#> 1 d FALSE 0.234
#> 2 d_one_step FALSE -0.0142
#> 3 e FALSE 0.107
#> 4 e_one_step FALSE 0.0126
```

What about the estimation of the asymptotic variance, and of the mean squared-error of the estimators?

```
%>%
psi_hat_de full_join(psi_hat_de_one_step) %>%
filter(auto_renormalization) %>%
group_by(type) %>%
summarize(sd = mean(sig_n),
se = sd(psi_n),
mse = mean((psi_n - psi_zero)^2) * n()) %>%
arrange(type)
#> # A tibble: 4 × 4
#> type sd se mse
#> <chr> <dbl> <dbl> <dbl>
#> 1 d 0.00206 0.0171 0.309
#> 2 d_one_step 0.0173 0.0171 0.291
#> 3 e 0.00288 0.0172 0.298
#> 4 e_one_step 0.0166 0.0175 0.305
```

The `sd`

(*estimator* of the asymptotic standard deviation) and `se`

(*empirical* standard deviation) entries are similar. This indicates that the
inference of the asymptotic variance of the one-step estimators based on the
influence curve is rather accurate for both the `d`

- and `e`

-variants that we
implemented. As for the mean square error, it is made respectively slightly
smaller or bigger or by the one-step update for type `d`

and `e`

, the
`d_one_step`

estimator exhibiting the smallest mean square error.

## 9.4 ⚙ Investigating further the one-step correction methodology

Use

`estimate_Gbar`

to create an oracle algorithm \(\Algora_{\Gbar,s}\) for the estimation of \(\Gbar_{0}\) that, for any \(s > 0\), estimates \(\Gbar_{0}(w)\) with \[\begin{equation*} \Gbar_{n}(w) \defq \Algora_{\Gbar,s} (P_{n})(w) \defq \expit\left(\logit\left(\Gbar_{0}(w)\right) + s Z\right) \end{equation*}\] where \(Z\) is a standard normal random variable.^{20}What would happen if one chose \(s=0\) in the above definition? What happens when \(s\) converges to 0? Explain why the algorithm is said to be an*oracle algorithm*.Use

`estimate_Qbar`

to create an oracle algorithm \(\Algora_{\Qbar,s}\) for the estimation of \(\Qbar_{0}\) that, for any \(s > 0\), estimates \(\Qbar_{0}(a,w)\) with \[\begin{equation*} \Qbar_{n}(a,w) \defq \Algora_{\Gbar,s} (P_{n})(a,w) \defq \expit\left(\logit\left(\Qbar_{0}(a,w)\right) + s Z\right) \end{equation*}\] where \(Z\) is a standard normal random variable. The comments made about \(\Algora_{\Gbar,s}\) in the above problem also apply to \(\Algora_{\Qbar,s}\).Reproduce the simulation study developed in Sections 8.4.3 and 9.3 with the oracle algorithms \(\Algora_{\Gbar,s}\) and \(\Algora_{\Qbar,s'}\) substituted for used in these sections. Change the values of \(s,s' > 0\) and compare how well the estimating procedure performs depending on the product \(ss'\). What do you observe? We invite you to refer to Section C.2.1.

Note that the algorithm does not really to be trained.↩︎