# Section 7 Nuisance parameters

## 7.1 Anatomy of an expression

From now, all the inference strategies that we will present unfold in two or three stages. For all of them, the first stage consists in estimating a selection of features of the law \(P_{0}\) of the experiment. Specifically, the features are chosen among \(Q_{0,W}\) (the marginal law of \(W\) under \(P_{0}\)), \(\Gbar_{0}\) (the conditional probability that \(A=1\) given \(W\) under \(P_{0}\)) and \(\Qbar_{0}\) (the conditional mean of \(Y\) given \(A\) and \(W\) under \(P_{0}\)).

In this context, because they are not the parameter of primary interest
(*i.e.*, they are not the real-valued feature \(\Psi(P_{0})\)), they are often
referred to as *nuisance parameters* of \(P_{0}\). The unflaterring expression
conveys the notion that their estimation is merely an intermediate step along
our path towards an inference of the target parameter.

As for the reason why \(Q_{0,W}\), \(\Gbar_{0}\) and \(\Qbar_{0}\) are singled out, it is because of their role in the definition of \(\Psi\) and the efficient influence curve \(D^{*}(P_{0})\).

## 7.2 An algorithmic stance

In general, we can view an estimator of any feature \(f_0\) of \(P_{0}\) as the output of an algorithm \(\Algo\) that maps any element of

\[\begin{equation*} \calM^{\text{empirical}} \defq \left\{\frac{1}{m} \sum_{i=1}^{m} \Dirac(o_{i}) : m \geq 1, o_{1}, \ldots, o_{m} \in [0,1] \times \{0,1\} \times [0,1]\right\} \end{equation*}\]

to the set \(\calF\) where \(f_{0}\) is known to live. Here, \(\calM^{\text{empirical}}\) can be interpreted as the set of all possible empirical measures summarizing the outcomes of any number of replications of the experiment \(P_{0}\). In particular, \(P_{n}\) belongs to this set.

The `tlrider`

package includes such template algorithms for the estimation of
\(Q_{0,W}\), \(\Gbar_{0}\) and \(\Qbar_{0}\). We illustrate how they work and their
use in the next sections.

## 7.3 `QW`

For instance, `estimate_QW`

is an algorithm \(\Algo_{Q_{W}}\) for the estimation
of the marginal law of \(W\) under \(P_{0}\) (to see its man page, simply run
`?estimate_QW`

). It is a map from \(\calM^{\text{empirical}}\) to the set of
laws on \([0,1]\). The following chunk of code estimates \(Q_{0,W}\) based on the
\(n = 1000\) first observations in `obs`

:

`<- estimate_QW(head(obs, 1e3)) QW_hat `

It is easy to sample independent observations from `QW_hat`

. To do so, we
create an object of class `LAW`

then set its marginal law of \(W\) to that
described by `QW_hat`

and specify its `sample_from`

feature:

```
<- LAW()
empirical_experiment alter(empirical_experiment, QW = QW_hat)
alter(empirical_experiment, sample_from = function(n) {
<- get_feature(empirical_experiment, "QW")
QW <- sample(pull(QW, "value"), n, prob = pull(QW, "weight"))
W cbind(W = W, A = NA, Y = NA)
})<- sample_from(empirical_experiment, 1e3) %>% as_tibble
W
%>%
W ggplot() +
geom_histogram(aes(x = W, y = stat(density)), bins = 40) +
stat_function(fun = get_feature(experiment, "QW"), col = "red")
```

Note that all the \(W\)s sampled from `QW_hat`

fall in the set \(\{W_{1}, \ldots, W_{n}\}\) of observed \(W\)s in `obs`

(an obvious fact given the definition of
the `sample_from`

feature of `empirical_experiment`

:

```
length(intersect(pull(W, W), head(obs[, "W"], 1e3))))
(#> [1] 1000
```

This is because `estimate_QW`

estimates \(Q_{0,W}\) with its empirical
counterpart, *i.e.*,

\[\begin{equation*}\frac{1}{n} \sum_{i=1}^{n} \Dirac(W_{i}).\end{equation*}\]

## 7.4 `Gbar`

Another template algorithm is built-in into `tlrider`

: `estimate_Gbar`

(to see
its man page, simply run `?estimate_Gbar`

). Unlike `estimate_QW`

,
`estimate_Gbar`

needs further specification of the algorithm. The package also
includes examples of such specifications.

There are two sorts of specifications, of which we say that they are either
*working model-based* or *machine learning-based*. We discuss the former sort
in the next subsection. The latter sort is discussed in Section
7.6.

### 7.4.1 Working model-based algorithms

Let us take a look at `working_model_G_one`

for instance:

```
working_model_G_one#> $model
#> function (...)
#> {
#> trim_glm_fit(glm(family = binomial(), ...))
#> }
#> <environment: 0x56115606af08>
#>
#> $formula
#> A ~ I(W^0.5) + I(abs(W - 5/12)^0.5) + I(W^1) + I(abs(W - 5/12)^1) +
#> I(W^1.5) + I(abs(W - 5/12)^1.5)
#> <environment: 0x56115606af08>
#>
#> $type_of_preds
#> [1] "response"
#>
#> attr(,"ML")
#> [1] FALSE
```

and focus on its `model`

and `formula`

attributes. The former relies on the
`glm`

and `binomial`

functions from `base`

`R`

, and on `trim_glm_fit`

(which
removes information that we do not need from the standard output of `glm`

,
simply run `?trim_glm_fit`

to see the function’s man page). The latter is a
`formula`

that characterizes what we call a *working model* for \(\Gbar_{0}\).

In words, by using `working_model_G_one`

we implicitly choose the so-called
logistic (or negative binomial) loss function \(L_{a}\) given by

\[\begin{equation} \tag{7.1} -L_{a}(f)(A,W) \defq A \log f(W) + (1 - A) \log (1 - f(W)) \end{equation}\]

for any function \(f : [0,1] \to [0,1]\) paired with the working model \[\begin{equation*} \calF_{1} \defq \left\{f_{\theta} : \theta \in \bbR^{7}\right\} \end{equation*}\] where, for any \(\theta \in \bbR^{7}\), \[\begin{equation*}\logit f_{\theta} (W) \defq \theta_{0} + \sum_{j=1}^{3} \left(\theta_{j} W^{j/2} + \theta_{3+j} |W - 5/12|^{j/2}\right).\end{equation*}\]

We acted as oracles when we specified the working model: it is
*well-specified*, *i.e.*, it happens that \(\Gbar_{0}\) is the unique minimizer
of the risk entailed by \(L_{a}\) over \(\calF_{1}\): \[\begin{equation*}\Gbar_{0} =
\mathop{\arg\min}_{f_{\theta} \in \calF_{1}} \Exp_{P_{0}}
\left(L_{a}(f_{\theta})(A,W)\right).\end{equation*}\] Therefore, the estimator
\(\Gbar_{n}\) obtained by minimizing the empirical risk

\[\begin{equation*} \Exp_{P_{n}} \left(L_{a}(f_{\theta})(A,W)\right) = \frac{1}{n} \sum_{i=1}^{n} L_{a}(f_{\theta})(A_{i},W_{i}) \end{equation*}\]

over \(\calF_{1}\) estimates \(\Gbar_{0}\) consistently.

Of course, it is seldom certain in real life that the target feature, here
\(\Gbar_{0}\), belongs to the working model.^{11} Suppose for
instance that we choose a small finite-dimensional working model \(\calF_{2}\)
without acting as an oracle. Then consistency certainly fails to hold.
However, if \(\Gbar_{0}\) can nevertheless be *projected* unambiguously onto
\(\calF_{2}\) (an assumption that cannot be checked), then the estimator might
converge to the projection.

### 7.4.2 Visualization

To illustrate the use of the algorithm \(\Algo_{\Gbar,1}\) obtained by combining
`estimate_Gbar`

and `working_model_G_one`

, let us estimate \(\Gbar_{0}\) based
on the first \(n = 1000\) observations in `obs`

:

`<- estimate_Gbar(head(obs, 1e3), algorithm = working_model_G_one) Gbar_hat `

Using `compute_Gbar_hat_W`

^{12}
(simply run `?compute_Gbar_hat_W`

to see its man page) makes it is easy to
compare visually the estimator \(\Gbar_{n} \defq \Algo_{\Gbar,1}(P_{n})\) with
its target \(\Gbar_0\):

```
tibble(w = seq(0, 1, length.out = 1e3)) %>%
mutate("truth" = Gbar(w),
"estimated" = compute_Gbar_hatW(w, Gbar_hat)) %>%
pivot_longer(-w, names_to = "f", values_to = "value") %>%
ggplot() +
geom_line(aes(x = w, y = value, color = f), size = 1) +
labs(y = "f(w)",
title = bquote("Visualizing" ~ bar(G)[0] ~ "and" ~ hat(G)[n])) +
ylim(NA, 1)
```

## 7.5 ⚙ `Qbar`

, working model-based algorithms

A third template algorithm is built-in into `tlrider`

: `estimate_Qbar`

(to see
its man page, simply run `?estimate_Qbar`

). Like `estimate_Gbar`

,
`estimate_Qbar`

needs further specification of the algorithm. The package also
includes examples of such specifications, which can also be either working
model-based (see Section 7.4) or machine learning-based (see
Sections 7.6 and 7.7).

There are built-in specifications similar to `working_model_G_one`

, *e.g.*,

```
working_model_Q_one#> $model
#> function (...)
#> {
#> trim_glm_fit(glm(family = binomial(), ...))
#> }
#> <environment: 0x56115606af08>
#>
#> $formula
#> Y ~ A * (I(W^0.5) + I(W^1) + I(W^1.5))
#> <environment: 0x56115606af08>
#>
#> $type_of_preds
#> [1] "response"
#>
#> attr(,"ML")
#> [1] FALSE
#> attr(,"stratify")
#> [1] FALSE
```

- Drawing inspiration from Section 7.4, comment upon and use
the algorithm \(\Algo_{\Qbar,1}\) obtained by combining
`estimate_Gbar`

and`working_model_Q_one`

.

## 7.6 `Qbar`

### 7.6.1 `Qbar`

, machine learning-based algorithms

We explained how algorithm \(\Algo_{\Gbar,1}\) is based on a working model (and
*you* did for \(\Algo_{\Qbar,1}\)). It is not the case that all algorithms are
based on working models in the same (admittedly rather narrow) sense. We
propose to say that those algorithms that are not based on working models like
\(\Algo_{\Gbar,1}\), for instance, are instead *machine learning-based*.

Typically, machine learning-based algorithms are more data-adaptive; they rely
on larger working models, and/or fine-tune parameters that must be calibrated,
*e.g.* by cross-validation. Furthermore,
they call for being stacked,
*i.e.*, combined by means of another outer algorithm (involving
cross-validation) into a more powerful machine learning-based
*meta-algorithm*. The super
learning methodology is a
popular stacking algorithm.

We will elaborate further on this important topic in another forthcoming part
and merely touch upon it in Section 7.8. Here, we simply
illustrate the concept with two specifications built-in into `tlrider`

. Based
on the *\(k\)-nearest neighbors* non-parametric estimating methodology, the
first one is discussed in the next subsection. Based on *boosted trees*,
another non-parametric estimating methodology, the second one is used in the
exercise that follows the next subsection.

### 7.6.2 `Qbar`

, kNN algorithm

Algorithm \(\Algo_{\Qbar,\text{kNN}}\) is obtained by combining `estimate_Qbar`

and `kknn_algo`

. The training of \(\Algo_{\Qbar,\text{kNN}}\) (*i.e.*, the
making of the output \(\Algo_{\Qbar,\text{kNN}} (P_{n})\) is implemented based
on function `caret::train`

of the `caret`

(classification and regression
training) package (to see its man page, simply run `?caret::train`

). Some
additional specifications are provided in `kknn_grid`

and `kknn_control`

.

In a nutshell, \(\Algo_{\Qbar,\text{kNN}}\) estimates \(\Qbar_{0}(1,\cdot)\) and
\(\Qbar_{0}(0,\cdot)\) separately. Each of them is estimated by applying the
\(k\)-nearest neighbors methodology as it is implemented in function
`kknn::train.kknn`

from the `kknn`

package (to see its man page, simply run
`?kknn::train.kknn`

).^{13} The following chunk of code trains
algorithm \(\Algo_{\Qbar,\text{kNN}}\) on \(P_{n}\):

```
<- estimate_Qbar(head(obs, 1e3),
Qbar_hat_kknn algorithm = kknn_algo,
trControl = kknn_control,
tuneGrid = kknn_grid)
```

Using `compute_Qbar_hat_AW`

(simply run `?compute_Qbar_hat_AW`

to see its man
page) makes it is easy to compare visually the estimator \(\Qbar_{n,\text{kNN}} \defq \Algo_{\Qbar,\text{kNN}}(P_{n})\) with its target \(\Qbar0\), see Figure
7.3.

```
<- tibble(w = seq(0, 1, length.out = 1e3),
fig truth_1 = Qbar(cbind(A = 1, W = w)),
truth_0 = Qbar(cbind(A = 0, W = w)),
kNN_1 = compute_Qbar_hatAW(1, w, Qbar_hat_kknn),
kNN_0 = compute_Qbar_hatAW(0, w, Qbar_hat_kknn))
```

### 7.6.3 `Qbar`

, boosted trees algorithm

Algorithm \(\Algo_{\Qbar,\text{trees}}\) is obtained by combining
`estimate_Qbar`

and `bstTree_algo`

. The training of
\(\Algo_{\Qbar,\text{trees}}\) (*i.e.*, the making of the output
\(\Algo_{\Qbar,\text{trees}} (P_{n})\) is implemented based on function
`caret::train`

of the `caret`

package. Some additional specifications are
provided in `bstTree_grid`

and `bstTree_control`

.

In a nutshell, \(\Algo_{\Qbar,\text{trees}}\) estimates \(\Qbar_{0}(1,\cdot)\) and
\(\Qbar_{0}(0,\cdot)\) separately. Each of them is estimated by boosted trees
as implemented in function `bst::bst`

from the `bst`

(gradient boosting)
package (to see its man page, simply run `?bst::bst`

).^{14} The following chunk of code
trains algorithm \(\Algo_{\Qbar,\text{trees}}\) on \(P_{n}\), and reveals what are
the optimal fine-tune parameters for the estimation of \(\Qbar_{0}(1,\cdot)\)
and \(\Qbar_{0}(0,\cdot)\):

```
<- estimate_Qbar(head(obs, 1e3),
Qbar_hat_trees algorithm = bstTree_algo,
trControl = bstTree_control,
tuneGrid = bstTree_grid)
%>% filter(a == "one") %>% pull(fit) %>%
Qbar_hat_trees %>% tail(3) %>% str_wrap(width = 60) %>% cat
capture.output #> RMSE was used to select the optimal model using the smallest
#> value. The final values used for the model were mstop = 30,
#> maxdepth = 2 and nu = 0.2.
%>% filter(a == "zero") %>% pull(fit) %>%
Qbar_hat_trees %>% tail(3) %>% str_wrap(width = 60) %>% cat
capture.output #> RMSE was used to select the optimal model using the smallest
#> value. The final values used for the model were mstop = 30,
#> maxdepth = 1 and nu = 0.1.
```

We can compare visually the estimators \(\Qbar_{n,\text{kNN}}\), \(\Qbar_{n,\text{trees}} \defq \Algo_{\Qbar,\text{trees}}(P_{n})\) with its target \(\Qbar_0\), see Figure 7.3. In summary, \(\Qbar_{n,\text{kNN}}\) is rather good, though very variable at the vincinity of the break points. As for \(\Qbar_{n,\text{trees}}\), it does not seem to capture the shape of its target.

```
%>%
fig mutate(trees_1 = compute_Qbar_hatAW(1, w, Qbar_hat_trees),
trees_0 = compute_Qbar_hatAW(0, w, Qbar_hat_trees)) %>%
pivot_longer(-w, names_to = "f", values_to = "value") %>%
extract(f, c("f", "a"), "([^_]+)_([01]+)") %>%
mutate(a = paste0("a=", a)) %>%
+
ggplot geom_line(aes(x = w, y = value, color = f), size = 1) +
labs(y = "f(w)",
title = bquote("Visualizing" ~ bar(Q)[0] ~ "and its estimators")) +
ylim(NA, 1) +
facet_wrap(~ a)
```

## 7.7 ⚙ ☡ `Qbar`

, machine learning-based algorithms

Using

`estimate_Q`

, make your own machine learning-based algorithm for the estimation of \(\Qbar_{0}\).Train your algorithm on the same data set as \(\Algo_{\Qbar,\text{kNN}}\) and \(\Algo_{\Qbar,\text{trees}}\). If, like \(\Algo_{\Qbar,\text{trees}}\), your algorithm includes a fine-tuning procedure, comment upon the optimal, data-driven specification.

Plot your estimators of \(\Qbar_{0}(1,\cdot)\) and \(\Qbar_{0}(0,\cdot)\) on Figure 7.3.

## 7.8 Meta-learning/super learning

Without a great deal of previous experience or scientific expertise, it would
have likely been difficult for us to *a priori* postulate which of the two
above machine learning algorithms (or indeed the algorithm based on a working
model) would perform better for estimation of \(\bar{Q}_0\). Rather than
committing to one single algorithm, we may instead wish to resort to
meta-learning or super learning. In this approach, one specifies a *library*
of candidate algorithms for estimating a given nuisance parameter.
Cross-validation is used to determine an *ensemble* (*e.g.* convex
combination) of the algorithms that yields the best fit to the underlying
function. In this way, one can learn in real time which algorithms tend to
fit the data best and shift attention towards those algorithms.

In fact, if one knows nothing about the feature, then it is

*certain*that it does not belong to whichever small finite-dimensional working model we may come up with.↩︎See also the companion function

`compute_lGbar_hat_AW`

(run`?compute_lGbar_hat_AW`

to see its man page.↩︎Specifically, argument

`kmax`

(maximum number of neighbors considered) is set to 5, argument`distance`

(parameter of the Minkowski distance) is set to 2, and argument`kernel`

is set to`gaussian`

. The best value of \(k\) is chosen between 1 and`kmax`

by leave-one-out. No outer cross-validation is needed.↩︎Specifically, argument

`mstop`

(number of boosting iterations for prediction) is one among 10, 20 and 30; argument`nu`

(stepsize of the shrinkage parameter) is one among 0.1 and 0.2; argument`maxdepth`

(maximum depth of the base learner, a tree) is one among 1, 2 and 5. An outer 10-fold cross-validation is carried out to select the best combination of fine-tune parameters.↩︎