Section 10 Targeted minimum loss-based estimation
10.1 Motivations
10.1.1 Falling outside the parameter space
Section 9 introduced the one-step corrected estimator \(\psinos\) of \(\psi_0\). It is obtained by adding a correction term to an initial plug-in estimator \(\Psi(\Phat_{n})\), resulting in an estimator that is asymptotically linear with influence curve \(\IC = D^{*}(P_{0})\) under mild conditions (see Section 9.2 and the detailed proof there in Appendix C.2).
Unfortunately, the one-step estimator lacks a desirable feature of a plug-in estimator: plug-in estimators always lie in the parameter space whereas a one-step estimator does not necessarily do so. For example, it must also be true that \(\psi_0 = \Psi(P_{0}) \in [-1,1]\) yet it may be the case that \(\psinos \not\in[-1,1]\).
It is typically easy to shape an algorithm \(\Algo_{\Qbar}\) for the estimation of \(\Qbar_0\) to output estimators \(\Qbar_n\) that, like \(\Qbar_{0}\), take their values in \([0,1]\). The plug-in estimator \(\psi_{n}\) (8.1) based on such a \(\Qbar_{n}\) necessarily satisfies \(\psi_{n} \in [-1,1]\). However, the one-step estimator derived from \(\psi_{n}\) may fall outside of the interval if the correction term \(P_{n} D^{*} (\Phat_{n})\) is large.
10.1.2 The influence curve equation
Upon closer examination of the influence curve \(D^*(\Phat_{n})\), we see that this may occur more frequently when \(\ell \Gbar_n(A_i,W_i)\) is close to zero for at least some \(1 \leq i\leq n\). In words, this may happen if there are observations in our data set that we observed under actions \(A_i\) that were estimated to be unlikely given their respective contexts \(W_i\). In such cases, \(D^*(\Phat_n)(O_i)\), and consequently the one-step correction term, may be large and cause the one-step estimate to fall outside \([-1,1]\).
Another way to understand this behavior is to recognize the one-step estimator as an initial plug-in estimator that is corrected in the parameter space of \(\psi_0\). One of the pillars of targeted learning is to perform, instead, a correction in the parameter space of \(\Qbar_0\).
In particular, consider a law \(P_{n}^{*}\) estimating \(P_{0}\) that is compatible with the estimators \(\Qbar_n^*\), \(\Gbar_n\), and \(Q_{n,W}\), but moreover is such that \[\begin{equation} P_n D^*(P_{n}^*) = 0 \tag{10.1} \end{equation}\] or, at the very least, \[\begin{equation} P_n D^*(P_{n}^*) = o_{P_0}(1/\sqrt{n}). \tag{10.2} \end{equation}\] Achieving (10.1) is called solving the efficient influence curve equation. Likewise, achieving (10.2) is called approximately solving the influence curve equation.
If such estimators can be generated indeed, then the plug-in estimator \[\begin{equation*} \psi_n^* \defq \Psi(P_{n}^{*}) = \int \left(\Qbar_n^*(1,w) - \Qbar_n^*(0,w)\right) dQ_{n,W}(w) \end{equation*}\] is asymptotically linear with influence curve \(\IC = D^{*}(P_{0})\), under mild conditions. Moreover, by virtue of its plug-in construction, it has the additional property that in finite-samples \(\psi_n^*\) will always obey bounds on the parameter space.
10.1.3 A basic fact on the influence curve equation
Our strategy for constructing such a plug-in estimate begins by generating an initial estimator \(\Qbar_n\) of \(\Qbar_0\) based on an algorithm \(\Algo_{\Qbar}\) and an estimator \(\Gbar_n\) of \(\Gbar_0\) based on an algorithm \(\Algo_{\Gbar}\). These initial estimators should strive to be as close as possible to their respective targets. We use the empirical distribution \(Q_{n,W}\) as an estimator of \(Q_{0,W}\).
Now, recall the definition of \(D_{1}^{*}\) (3.4) and note that for any estimator \(\Qbar_n^*\) of \(\Qbar_0\) and a law \(P_n^{*}\) that is compatible with \(\Qbar_n^*\) and \(Q_{n,W}\), \[\begin{equation*} P_{n} D_{1} (P_{n}^{*}) = 0. \end{equation*}\] The proof can be found there in Appendix C.4.1.
In words, this shows that so long as we use the empirical distribution \(Q_{n,W}\) to estimate \(Q_{0,W}\), by its very construction achieving (10.1) or (10.2) is equivalent to solving \[\begin{equation} P_n D_{2}^*(P_{n}^*) = 0 \tag{10.3} \end{equation}\] or \[\begin{equation} P_n D_{2}^*(P_{n}^*) = o_{P_0}(1/\sqrt{n}). \tag{10.4} \end{equation}\]
It thus remains to devise a strategy for ensuring that \(P_n D_2^*(P_n^*) = 0\) or \(o_{P_0}(1/\sqrt{n})\).
10.2 Targeted fluctuation
Our approach to satisfying (10.3)21 hinges on revising our initial estimator \(\Qbar_n\) of \(\Qbar_0\). We propose a method for building an estimator of \(\Qbar_0\) that is “near to” \(\Qbar_n\), but is such that for any law \(P_n^*\) that is compatible with this revised estimator, \(P_n D^*(P_n^*) = 0\).22 Because \(\Qbar_n\) is our best (initial) estimator of \(\Qbar_0\), the new estimator that we shall build should be at least as good an estimator of \(\Qbar_0\) as \(\Qbar_n\). So first, we must propose a way to move between regression functions, and then we must propose a way to move to a new regression function that fits the data at least as well as \(\Qbar_n\).
Instead of writing “propos[ing] a way to move between regression functions” we may also have written “proposing a way to fluctuate a regression function,” thus suggesting very opportunely that the notion of fluctuation as discussed in Section 3.3 may prove instrumental to achieve the former objective.
10.2.1 ☡ Fluctuating indirectly
Let us resume the discussion where we left it at the end of Section 3.3.1. It is easy to show that the fluctuation \(\{P_{h} : h \in H\}\) of \(P\) in direction of \(s\) in \(\calM\) induces a fluctuation \(\{\Qbar_{h} : h \in H\}\) of \(\Qbar = \left.Q_{h}\right|_{h=0}\) in the space \(\calQ \defq \{\Qbar : P \in \calM\}\) of regression functions induced by model \(\calM\). Specifically we show there in Appendix C.4.2 that, for every \(h \in H\), the conditional mean \(\Qbar_{h}(A,W)\) of \(Y\) given \((A,W)\) under \(P_{h}\) is given by \[\begin{equation*}\Qbar_{h}(A,W) \defq \frac{\Qbar(A,W) + h \Exp_{P}(Ys(O) | A,W)}{1 + h \Exp_{P}(s(O)|A,W)}.\end{equation*}\]
We note that if \(s(O)\) depends on \(O\) only through \((A,W)\) then, abusing notation and writing \(s(A,W)\) for \(s(O)\), \[\begin{align} \Qbar_{h}(A,W) &= \frac{\Qbar(A,W) + h \Exp_{P}(Ys(A,W) | A,W)}{1 + h \Exp_{P}(s(A,W)|A,W)}\notag\\&=\frac{\Qbar(A,W) + h s(A,W)\Qbar(A,W)}{1 + h s(A,W)}\notag\\&=\Qbar(A,W).\tag{10.5} \end{align}\] In words, \(\Qbar\) is not fluctuated at all, that is, the laws \(P_{h}\) that are elements of the fluctuation share the same conditional mean of \(Y\) given \((A,W)\).23
However we find it easier in the present context, notably from a computational perspective, to fluctuate \(\Qbar_{n}\) directly in \(\calQ\), as opposed to indirectly through a fluctuation defined in \(\calM\) of a law compatible with \(\Qbar_{n}\). The next section introduces such a direct fluctuation.
10.2.2 Fluctuating directly
Set arbitrarily \(\Qbar \in \calQ\). The (direct) fluctuation of \(\Qbar\) that we propose to consider depends on a user-supplied \(\Gbar\). For any \(\Gbar\) such that \(0 < \ell\Gbar(A,W) < 1\), \(P_{0}\)-almost surely, the \(\Gbar\)-specific fluctuation model for \(\Qbar\) is \[\begin{equation} \calQ(\Qbar,\Gbar)\defq \left\{(w,a) \mapsto \Qbar_{h}(a,w) \defq \expit\left(\logit\left(\Qbar(a,w)\right) + h \frac{2a - 1}{\ell \Gbar(a,w)} \right) : t \in \bbR\right\} \subset \calQ. \tag{10.6} \end{equation}\]
Fluctuation \(\calQ(\Qbar,\Gbar)\) is a one-dimensional parametric model (indexed by the real-valued parameter \(h\)) that goes through \(\Qbar\) (at \(h=0\)). For each \(h \in \bbR\), \(\Qbar_{h} \in \calQ\) is the conditional mean of \(Y\) given \((A,W)\) under infinitely many laws \(P \in \calM\).
The following chunk of code represents three elements of the fluctuations \(\calQ(\Qbar_{0}, \Gbar_{0})\) and \(\calQ(\Qbar_{n,\text{trees}}, \Gbar_{0})\), where \(\Qbar_{n,\text{trees}}\) is an estimator of \(\Qbar_{0}\) derived by the boosted trees algorithm (see Section 7.6.3).
<- fluctuate(Qbar, Gbar, h = -1)
Qbar_hminus <- fluctuate(Qbar, Gbar, h = +1)
Qbar_hplus
<- wrapper(Qbar_hat_trees, FALSE)
Qbar_trees <- fluctuate(Qbar_trees, Gbar, h = -1)
Qbar_trees_hminus <- fluctuate(Qbar_trees, Gbar, h = +1)
Qbar_trees_hplus
tibble(w = seq(0, 1, length.out = 1e3)) %>%
mutate(truth_0_1 = Qbar(cbind(A = 1, W = w)),
truth_0_0 = Qbar(cbind(A = 0, W = w)),
trees_0_1 = Qbar_trees(cbind(A = 1, W = w)),
trees_0_0 = Qbar_trees(cbind(A = 0, W = w)),
truth_hminus_1 = Qbar_hminus(cbind(A = 1, W = w)),
truth_hminus_0 = Qbar_hminus(cbind(A = 0, W = w)),
trees_hminus_1 = Qbar_trees_hminus(cbind(A = 1, W = w)),
trees_hminus_0 = Qbar_trees_hminus(cbind(A = 0, W = w)),
truth_hplus_1 = Qbar_hplus(cbind(A = 1, W = w)),
truth_hplus_0 = Qbar_hplus(cbind(A = 0, W = w)),
trees_hplus_1 = Qbar_trees_hplus(cbind(A = 1, W = w)),
trees_hplus_0 = Qbar_trees_hplus(cbind(A = 0, W = w))) %>%
pivot_longer(-w, names_to = "f", values_to = "value") %>%
extract(f, c("f", "h", "a"), "([^_]+)_([^_]+)_([01]+)") %>%
mutate(f = ifelse(f == "truth", "Q_0", "Q_n"),
h = factor(ifelse(h == "0", 0, ifelse(h == "hplus", 1, -1)))) %>%
mutate(a = paste0("a=", a),
fh = paste0("(", f, ",", h, ")")) %>%
+
ggplot geom_line(aes(x = w, y = value, color = h, linetype = f, group = fh),
size = 1) +
labs(y = bquote(paste(f[h](a,w))),
title = bquote("Visualizing three elements of two fluctuations of"
~ bar(Q)[0] ~ "and" ~ bar(Q)[n])) +
ylim(NA, 1) +
facet_wrap(~ a)
10.2.3 ⚙ More on fluctuations
Justify the series of equalities in (10.5).
Justify that, in fact, if \(s(O)\) depends on \(O\) only through \((A,W)\) then the conditional law of \(Y\) given \((A,W)\) under \(P_{h}\) equals that under \(P\).
10.2.4 Targeted roaming of a fluctuation
Recall that our goal is to build an estimator of \(\Qbar_0\) that is at least as good as \(\Qbar_n\). We will look for this enhanced estimator in the fluctuation \(\calQ(\Qbar_{n}, \Gbar_{n})\) where \(\Gbar_{n}\) is an estimator of \(\Gbar_{0}\) that is bounded away from 0 and 1. Thus, the estimator writes as \(\Qbar_{n,h_{n}}\) for some data-driven \(h_{n} \in \bbR\). We will clarify why we do so at a later time
To assess the performance of all the candidates included in the fluctuation, we formally rely on the empirical risk function \(h \mapsto \Exp_{P_{n}} \left(L_{y} (\Qbar_{n,h})(O)\right)\) where \[\begin{equation*}\Exp_{P_{n}} \left(L_{y} (\Qbar_{n,h})(O)\right) = \frac{1}{n} \sum_{i=1}^{n} L_{y} (\Qbar_{n,h})(O_{i})\end{equation*}\] and the logistic (or negative binomial) loss function \(L_{y}\) is given by \[\begin{equation*} -L_{y}(f)(O) \defq Y \log f(A,W) + (1 - Y) \log \left(1 - f(A,W)\right) \tag{10.7} \end{equation*}\] for any function \(f:\{0,1\} \times [0,1] \to [0,1]\). This loss function is the counterpart of the loss function \(L_{a}\) defined in (7.1). The justification that we gave in Section 7.4.1 of the relevance of \(L_{a}\) also applies here, mutatis mutandis. In summary, the oracle risk \(\Exp_{P_{0}} \left(L_{y} (f)(O)\right)\) of \(f\) is a real-valued measure of discrepancy between \(\Qbar_{0}\) and \(f\); \(\Qbar_{0}\) minimizes \(f \mapsto \Exp_{P_{0}} \left(L_{y} (f)(O)\right)\) over the set of all (measurable) functions \(f:[0,1] \times \{0,1\} \times [0,1] \to [0,1]\); and a minimizer \(h_{0} \in \bbR\) of \(h \mapsto \Exp_{P_{0}} \left(L_{y} (\Qbar_{n,h})(O)\right)\) identifies the element of fluctuation \(\calQ(\Qbar_{n}, \Gbar_{n})\) that is closest to \(\Qbar_{0}\) (some details are given there in Appendix B.6).
It should not come as a surprise after this discussion that the aforementioned data-driven \(h_{n}\) is chosen to be the minimizer of the empirical risk function, i.e., \[\begin{equation*}h_{n} \defq \mathop{\arg\min}_{h \in \bbR} \Exp_{P_{n}} \left(L_{y} (\Qbar_{n,h})(O)\right).\end{equation*}\] The criterion is convex so the optimization program is well-posed.
The next chunk of code illustrates the search of an approximation of \(h_{n}\) over a grid of candidate values.
<- seq(-0.01, 0.01, length.out = 1e4)
candidates <- obs[1:1e3, "W"]
W <- obs[1:1e3, "A"]
A <- obs[1:1e3, "Y"]
Y <- compute_lGbar_hatAW(A, W, Gbar_hat)
lGAW <- compute_Qbar_hatAW(A, W, Qbar_hat_trees)
QAW <- sapply(candidates,
risk function(h) {
<- expit(logit(QAW) + h * (2 * A - 1) / lGAW)
QAW_h -mean(Y * log(QAW_h) + (1 - Y) * log(1 - QAW_h))
})<- which.min(risk)
idx_min <- which.min(abs(candidates))[1]
idx_zero <- c(expression(R[n](bar(Q)[list(n,hn)]^list(o))),
labels expression(R[n](bar(Q)[list(n,0)]^list(o))))
%>% enframe %>%
risk mutate(h = candidates) %>%
filter(abs(h - h[idx_min]) <= abs(h[idx_min])) %>%
ggplot() +
geom_point(aes(x = h, y = value), color = "#CC6666") +
geom_vline(xintercept = c(0, candidates[idx_min])) +
geom_hline(yintercept = c(risk[idx_min], risk[idx_zero])) +
scale_y_continuous(
bquote("empirical logistic risk, " ~ R[n](bar(Q)[list(n,h)]^list(o))),
trans = "exp", labels = NULL,
sec.axis = sec_axis(~ .,
breaks = c(risk[idx_min], risk[idx_zero]),
labels = labels))
Figure 10.2 reveals how moving away slightly from \(h=0\) to the left (i.e., to \(h_{n}\) equal to -0.003, rounded to three decimal places) entails a decrease of the empirical risk. The gain is modest at the scale of the empirical risk, but considerable in terms of inference, as we explain below.
10.2.5 Justifying the form of the fluctutation
Let us define \(\Qbar_{n}^{*} \defq \Qbar_{n, h_{n}}\). We justify in two steps our assertion that moving from \(\left.\Qbar_{n}\right|_{h=0} = \Qbar_{n}\) to \(\Qbar_{n}^{*}\) along fluctuation \(\calQ(\Qbar_{n}, \Gbar_{n})\) has a considerable impact for the inference of \(\psi_{0}\).
First, we note that there is no need to iterate the updating procedure. Specifically, even if we tried to fluctuate \(\Qbar_{n}^{*}\) along the fluctuation \(\calQ(\Qbar_{n}^{*}, \Gbar_{n}) = \{\Qbar_{n,h'}^{*} : h' \in \bbR\}\) defined like \(\calQ(\Qbar_{n}, \Gbar_{n})\) (10.6) with \(\Qbar_{n,h_{n}}\) substituted for \(\Qbar_{n}\), then we would not move at all. This is obvious because there is a one-to-one smooth correspondence between the parameter \(h'\) indexing \(\calQ(\Qbar_{n}^{*}, \Gbar_{n})\) and the parameter \(h\) indexing \(\calQ(\Qbar_{n}, \Gbar_{n})\) (namely, \(h' = h + h_{n}\)). Therefore, the derivative of (the real-valued function over \(\bbR\)) \(h \mapsto \Exp_{P_n} \left(L_{y} (\Qbar_{n,h})(O)\right)\) evaluated at its minimizer \(h_{n}\) equals 0. Equivalently (see there in Appendix C.4.3 for a justification of the last but one equality below),
\[\begin{align}\notag \frac{d}{d h} & \left. \Exp_{P_{n}} \left(L_{y}(\Qbar_{n,h}^{*})(O)\right)\right|_{h=0} \\ \notag &= -\frac{1}{n}\left.\frac{d}{d h} \sum_{i=1}^{n} \left(Y_{i} \log \Qbar_{n,h}^{*}(A_{i}, W_{i}) + (1 - Y_{i}) \log\left(1 - \Qbar_{n,h}^{*}(A_{i}, W_{i})\right)\right)\right|_{h=0} \\ \notag &= -\frac{1}{n}\sum_{i=1}^{n} \left(\frac{Y_{i}}{\Qbar_{n}^{*}(A_{i}, W_{i})} - \frac{1 - Y_{i}}{1 - \Qbar_{n}^{*}(A_{i}, W_{i})}\right) \times \left.\frac{d}{d h} \Qbar_{n,h}^{*}(A_{i}, W_{i})\right|_{h=0} \\ \notag&= -\frac{1}{n}\sum_{i=1}^{n} \frac{Y_{i} - \Qbar_{n}^{*}(A_{i}, W_{i})}{\Qbar_{n}^{*}(A_{i}, W_{i}) \times \left(1 - \Qbar_{n}^{*}(A_{i}, W_{i})\right)} \left.\frac{d}{d h} \Qbar_{n,h}^{*}(A_{i}, W_{i})\right|_{h=0} \\ &= -\frac{1}{n}\sum_{i=1}^{n} \frac{2A_{i} - 1}{\ell\Gbar_{n}(A_{i}, W_{i})} \left(Y_{i} - \Qbar_{n}^{*}(A_{i}, W_{i})\right) = 0. \tag{10.8}\end{align}\]
Second, let \(P_{n}^{*}\) be a law in \(\calM\) such that the \(Q_{W}\), \(\Gbar\) and \(\Qbar\) features of \(P_{n}^{*}\) equal \(Q_{n,W}\), \(\Gbar_{n}\) and \(Q_{n}^{*}\), respectively. In other words, \(P_{n}^{*}\) is compatible with \(Q_{n,W}\), \(\Gbar_{n}\) and \(\Qbar_{n}^{*}\).24 Note that the last equality in (10.8) equivalently writes as \[\begin{equation*}P_{n} D_{2}^{*} (P_{n}^{*}) = 0.\end{equation*}\] Thus the (direct) fluctuation solves (10.3). Now, we have already argued in Section 10.1.3 that it also holds that \[\begin{equation*}P_{n} D_{1}^{*} (P_{n}^{*}) = 0.\end{equation*}\] Consequently, \(P_{n}^{*}\) solves the efficient influence curve equation, that is, satisfies (10.1). As argued in Section 10.1.2, it thus follows that the plug-in estimator \[\begin{equation*} \psi_n^* \defq \Psi(P_{n}^{*}) = \int \left(\Qbar_n^*(1,w) - \Qbar_n^*(0,w)\right) dQ_{n,W}(w) \end{equation*}\] is asymptotically linear with influence curve \(\IC = D^{*}(P_{0})\), under mild conditions. Moreover, by virtue of its plug-in construction, it has the additional property that in finite-samples \(\psi_n^*\) will always obey bounds on the parameter space.
10.2.6 ⚙ Alternative fluctuation
The following exercises will have you consider an alternative means of performing a fluctuation. For \(a = 0, 1\), consider the following fluctuation model for \(\Qbar^a \defq \Qbar(a,\cdot)\): \[\begin{equation} \calQ^a(\Qbar^a) \defq \left\{w \mapsto \Qbar_{h}^a(w) \defq \expit\left(\logit\left(\Qbar(a,w)\right) + h \right) : h \in \bbR\right\} \subset \calQ. \tag{10.9} \end{equation}\] Let \[\begin{equation*} \calQ^{\text{alt}}(\Qbar) \defq \left\{(a,w) \mapsto \Qbar_{h_0, h_1}(a,w) \defq a \times \Qbar_{h_{1}}^1(w) + (1-a) \Qbar_{h_{0}}^0(w) : h_{0}, h_{1} \in \bbR\right\} \end{equation*}\] be the fluctuation model for \(\Qbar\) that is implied by the two submodels for \(\Qbar^1\) and \(\Qbar^0\).
Also for a given \(\Gbar\) satisfying that \(0 < \ell\Gbar(A,W) < 1\), \(P_{0}\)-almost surely, and both \(a=0,1\), consider the loss function \(L_{y,\Gbar}^{a}\) given by \[\begin{equation*} L_{y,\Gbar}^{a} (f) (A,W) \defq\frac{\one\{A = a\}}{\ell\Gbar(A, W)} L_{y} (f)(O) \end{equation*}\] for any function \(f:\{0,1\} \times [0,1] \to [0,1]\), where \(L_{y}\) is defined in (10.7) above. It yields the empirical risk function \[\begin{equation*} (h_{0}, h_{1}) \mapsto \sum_{a = 0,1} \Exp_{P_{n}} \left(L^a_{y, \Gbar} (\Qbar^a_{n,h_a})(O)\right). \end{equation*}\]
Comment on the differences between these fluctuation model and loss function compared to those discussed above.
Visualize three elements of \(\calQ^{\text{alt}}(\Qbar_0)\) and \(\calQ^{\text{alt}}(\Qbar_n)\): choose three values for \((h_0, h_1)\) and reproduce Figure 10.1.
Argue that in order to minimize the empirical risk over all \((h_0, h_1) \in \bbR^2\), we may minimize over \(h_0 \in \bbR\) and \(h_1 \in \bbR\) separately.
Visualize how the empirical risk with \(\Gbar = \Gbar_n\) varies for different elements \(\Qbar_{n, h_0, h_1}\) of \(\calQ(\Qbar_n)\). For simplicity (and with the justification of problem 3), you may wish to make a separate figure (like Figure 10.2) for \(a = 0, 1\) that illustrates how the empirical risk varies as a function of \(h_a\) while setting \(h_{1 - a} = 0\).
☡ Justify the validity of \(L^a_{y, \Gbar}\) as a loss function for \(\Qbar^a_0\): show that amongst all functions that map \(w\) to \((0,1)\), the true risk \(\Exp_{P_0}\left(L^a_{y, \Gbar}(f)(O)\right)\) is minimized when \(f = \Qbar^a_0\).
☡ Argue that your answer to problem 2 also implies that the summed loss function \(\sum_{a=0,1} L^a_{y, \Gbar}\) is valid for \(\Qbar\).
☡ Justify the combination of these loss function and fluctuation by repeating the calculation in equation (10.8) above, mutatis mutandis.
10.3 Summary and perspectives
The procedure laid out in Section 10 is called targeted minimum loss-based estimation (TMLE). The nomenclature derives from its logistics. We first generated an (un-targeted) initial estimator of \(\Psi(P_0)\) by substituting for \(P_{0}\) a law \(\Phat_{n}\) compatible with initial estimators of some \(\Psi\)-specific relevant nuisance parameters. Then through loss-minimization, we built a targeted estimator by substituting for \(\Phat_{n}\) a law \(P_{n}^{*}\) compatible with the nuisance parameters that we updated in a targeted fashion.
The TMLE procedure was coined in 2006 by Mark van der Laan and Dan Rubin (Laan and Rubin 2006). It has since then been developed and applied in a great variety of contexts. We refer to the monographies (Laan and Rose 2011) and (Laan and Rose 2018) for a rich overview.
In summary, targeted learning bridges the gap between formal inference of finite-dimensional parameters \(\Psi(P_{0})\) of the law \(P_{0}\) of the data, via bootstrapping or influence curves, and data-adaptive, loss-based, machine learning estimation of \(\Psi\)-specific infinite-dimensional features thereof (the so-called nuisance parameters). A typical example concerns the super learning-based, targeted estimation of effect parameters defined as identifiable versions of causal quantities. The TMLE algorithm integrates completely the estimation of the relevant nuisance parameters by super learning (Laan, Polley, and Hubbard 2007). Under mild assumptions, the targeting step removes the bias of the initial estimators of the targeted effects. The resulting TMLEs enjoy many desirable statistical properties: among others, by being substitution estimators, they lie in the parameter space; they are often double-robust; they lend themselves to the construction of confidence regions.
The scientific community is always engaged in devising and promoting enhanced principles for sounder research through the better design of experimental and nonexperimental studies and the development of more reliable and honest statistical analyses. By focusing on prespecified analytic plans and algorithms that make realistic assumptions in more flexible nonparametric or semiparametric statistical models, targeted learning has been at the forefront of this concerted effort. Under this light, targeted learning notably consists in translating knowledge about the data and underlying data-generating mechanism into a realistic model; in expressing the research question under the form of a statistical estimation problem; in analyzing the statistical estimation problem within the frame of the model; in developing ad hoc algorithms grounded in theory and tailored to the question at stake to map knowledge and data into an answer coupled to an assessment of its trustworthiness.
Quoting (Laan and Rose 2018):
Over the last decade, targeted learning has been established as a reliable framework for constructing effect estimators and prediction functions. The continued development of targeted learning has led to new solutions for existing problems in many data structures in addition to discoveries in varied applied areas. This has included work in randomized controlled trials, parameters defined by a marginal structural model, case-control studies, collaborative TMLE, missing and censored data, longitudinal data, effect modification, comparative effectiveness research, aging, cancer, occupational exposures, plan payment risk adjustment, and HIV, as well as others. In many cases, these studies compared targeted learning techniques to standard approaches, demonstrating improved performance in simulations and realworld applications.
It is now time to resume our introduction to TMLE and to carry out an empirical investigation of its statistical properties in the context of the estimation of \(\psi_{0}\).
10.4 Empirical investigation
10.4.1 A first numerical application
Let us illustrate the principle of targeted minimum loss estimation by
updating the G-computation estimator built on the \(n=1000\) first observations
in obs
by relying on \(\Algo_{\Qbar,\text{kNN}}\), that is, on the algorithm
for the estimation of \(\Qbar_{0}\) as it is implemented in estimate_Qbar
with
its argument algorithm
set to the built-in kknn_algo
(see Section
7.6.2). In Section 9.3, we
performed a one-step correction of the same initial estimator.
The algorithm has been trained earlier on this data set and produced the
object Qbar_hat_kknn
. The following chunk of code prints the initial
estimator psin_kknn
, its one-step update psin_kknn_os
, then applies the
targeting step and presents the resulting estimator:
(psin_kknn)#> # A tibble: 1 × 2
#> psi_n sig_n
#> <dbl> <dbl>
#> 1 0.101 0.00306
(psin_kknn_os)#> # A tibble: 1 × 3
#> psi_n sig_n crit_n
#> <dbl> <dbl> <dbl>
#> 1 0.0999 0.0171 -0.000633
<- apply_targeting_step(head(obs, 1e3),
(psin_kknn_tmle wrapper(Gbar_hat, FALSE),
wrapper(Qbar_hat_kknn, FALSE)))
#> # A tibble: 1 × 3
#> psi_n sig_n crit_n
#> <dbl> <dbl> <dbl>
#> 1 0.0999 0.0171 -2.05e-17
In the call to apply_targeting_step
we provide (i) the data set at hand
(first line), (ii) the estimator Gbar_hat
of \(\Gbar_{0}\) that we built
earlier by using algorithm \(\Algo_{\Gbar,1}\) (second line; see Section
7.4.2), (iii) the estimator Qbar_hat_kknn
of \(\Qbar_{0}\)
(third line). Apparently, in this particular example, the one-step correction
and targeted correction updater similarly the initial estimator.
10.4.2 ⚙ A computational exploration
Consult the man page of function
apply_targeting_step
(run?apply_targeting_step
) and explain what is the role of its inputepsilon
.Run the chunk of code below. What does it do? Hint: check out the chunks of code of Sections 7.6.2, 8.2.1 and 10.4.1.
<- seq(-1e-2, 1e-2, length.out = 1e2)
epsilon <- wrapper(Gbar_hat, FALSE)
Gbar_hat_w <- wrapper(Qbar_hat_kknn, FALSE)
Qbar_kknn
<- sapply(epsilon, function(h) {
psi_trees_epsilon apply_targeting_step(head(obs, 1e3), Gbar_hat_w,
epsilon = h) %>%
Qbar_trees, select(psi_n, crit_n) %>% as.matrix
})<- which.min(abs(psi_trees_epsilon[2, ]))
idx_trees
<- sapply(epsilon, function(h) {
psi_kknn_epsilon apply_targeting_step(head(obs, 1e3), Gbar_hat_w,
epsilon = h) %>%
Qbar_kknn, select(psi_n, crit_n) %>% as.matrix
})<- which.min(abs(psi_kknn_epsilon[2, ]))
idx_kknn
rbind(t(psi_trees_epsilon), t(psi_kknn_epsilon)) %>% as_tibble %>%
rename("psi_n" = V1, "crit_n" = V2) %>%
mutate(type = rep(c("trees", "kknn"), each = length(epsilon))) %>%
ggplot() +
geom_point(aes(x = crit_n, y = psi_n, color = type)) +
geom_vline(xintercept = 0) +
geom_hline(yintercept = c(psi_trees_epsilon[1,idx_trees],
1,idx_kknn])) +
psi_kknn_epsilon[labs(x = bquote(paste(P[n]~D^"*", (P[list(n,h)]^o))),
y = bquote(Psi(P[list(n,h)]^o)))
- Discuss the fact that the colored curves in Figure 10.3 look like segments.
10.4.3 Empirical investigation
To assess more broadly what is the impact of the targeting step, let us apply it to the estimators that we built in Section 8.4.3. In Section 9.3, we applied the one-step correction to the same estimators.
The object learned_features_fixed_sample_size
already contains the estimated
features of \(P_{0}\) that are needed to perform the targeting step on the
estimators \(\psi_{n}^{d}\) and \(\psi_{n}^{e}\), thus we merely have to call the
function apply_targeting_step
.
<- learned_features_fixed_sample_size %>%
psi_tmle mutate(tmle_d =
pmap(list(obs, Gbar_hat, Qbar_hat_d),
~ apply_targeting_step(as.matrix(..1),
wrapper(..2, FALSE),
wrapper(..3, FALSE))),
tmle_e =
pmap(list(obs, Gbar_hat, Qbar_hat_e),
~ apply_targeting_step(as.matrix(..1),
wrapper(..2, FALSE),
wrapper(..3, FALSE)))) %>%
select(tmle_d, tmle_e) %>%
pivot_longer(c(`tmle_d`, `tmle_e`),
names_to = "type", values_to = "estimates") %>%
extract(type, "type", "_([de])$") %>%
mutate(type = paste0(type, "_targeted")) %>%
unnest(estimates) %>%
group_by(type) %>%
mutate(sig_alt = sd(psi_n)) %>%
mutate(clt_ = (psi_n - psi_zero) / sig_n,
clt_alt = (psi_n - psi_zero) / sig_alt) %>%
pivot_longer(c(`clt_`, `clt_alt`),
names_to = "key", values_to = "clt") %>%
extract(key, "key", "_(.*)$") %>%
mutate(key = ifelse(key == "", TRUE, FALSE)) %>%
rename("auto_renormalization" = key)
<- psi_tmle %>%
(bias_tmle group_by(type, auto_renormalization) %>%
summarize(bias = mean(clt)) %>% ungroup)
#> # A tibble: 4 × 3
#> type auto_renormalization bias
#> <chr> <lgl> <dbl>
#> 1 d_targeted FALSE -0.0147
#> 2 d_targeted TRUE -0.0313
#> 3 e_targeted FALSE 0.0132
#> 4 e_targeted TRUE -0.00407
<- ggplot() +
fig geom_line(aes(x = x, y = y),
data = tibble(x = seq(-4, 4, length.out = 1e3),
y = dnorm(x)),
linetype = 1, alpha = 0.5) +
geom_density(aes(clt, fill = type, colour = type),
alpha = 0.1) +
psi_tmle, geom_vline(aes(xintercept = bias, colour = type),
size = 1.5, alpha = 0.5) +
bias_tmle, facet_wrap(~ auto_renormalization,
labeller =
as_labeller(c(`TRUE` = "auto-renormalization: TRUE",
`FALSE` = "auto-renormalization: FALSE")),
scales = "free")
+
fig labs(y = "",
x = bquote(paste(sqrt(n/v[n]^{"*"})*
^{"*"} - psi[0])))) (psi[n]
We see that the step of targeting is as promised: the bias of the resulting estimators is minimized relative to the naive estimators. Comparing these results to those obtain using one-step estimators (Section 9.3), we find quite similar performance between one-step and TMLE estimators.