Section 3 Smoothness

3.1 Fluctuating smoothly

Within our view of the target parameter as a statistical mapping evaluated at the law of the experiment, it is natural to inquire of properties this functional enjoys. For example, we may be interested in asking how the value of Ψ(P) changes as we consider laws that get nearer to P in M. If small deviations from P0 result in large changes in Ψ(P0), then we might hypothesize that it will be difficult to produce stable estimators of ψ0. Fortunately, this turns out not to be the case for the mapping Ψ, and so we say that Ψ is a smooth statistical mapping.

To discuss how Ψ(P) changes for distributions that get nearer to P in the model, we require a more concrete notion of what it means to get near to a distribution in a model. The notion hinges on fluctuations (or fluctuating models).

3.1.1 The another_experiment fluctuation

In Section 2.3.3, we discussed the nature of the object called another_experiment that was created when we ran example(tlrider):

another_experiment
#> A law for (W,A,Y) in [0,1] x {0,1} x [0,1].
#> 
#> If the law is fully characterized, you can use method
#> 'sample_from' to sample from it.
#> 
#> If you built the law, or if you are an _oracle_, you can also
#> use methods 'reveal' to reveal its relevant features (QW, Gbar,
#> Qbar, qY -- see '?reveal'), and 'alter' to change some of them.
#> 
#> If all its relevant features are characterized, you can use
#> methods 'evaluate_psi' to obtain the value of 'Psi' at this law
#> (see '?evaluate_psi') and 'evaluate_eic' to obtain the efficient
#> influence curve of 'Psi' at this law (see '?evaluate_eic').

The message is a little misleading. Indeed, another_experiment is not a law but, rather, a collection of laws indexed by a real-valued parameter h. This oracular statement (we built the object!) is evident when one looks again at the sample_from feature of another_experiment:

reveal(another_experiment)$sample_from
#> function(n, h) {
#>         ## preliminary
#>         n <- R.utils::Arguments$getInteger(n, c(1, Inf))
#>         h <- R.utils::Arguments$getNumeric(h)
#>         ## ## 'Gbar' and 'Qbar' factors
#>         Gbar <- another_experiment$.Gbar
#>         Qbar <- another_experiment$.Qbar
#>         ## sampling
#>         ## ## context
#>         params <- formals(another_experiment$.QW)
#>         W <- stats::runif(n, min = eval(params$min),
#>                    max = eval(params$max))
#>         ## ## action undertaken
#>         A <- stats::rbinom(n, size = 1, prob = Gbar(W))
#>         ## ## reward
#>         params <- formals(another_experiment$.qY)
#>         shape1 <- eval(params$shape1)
#>         QAW <- Qbar(cbind(A = A, W = W), h = h)
#>         Y <- stats::rbeta(n,
#>                           shape1 = shape1,
#>                           shape2 = shape1 * (1 - QAW) / QAW)
#>         ## ## observation
#>         obs <- cbind(W = W, A = A, Y = Y)
#>         return(obs)
#>       }
#> <bytecode: 0x56115cf79c38>
#> <environment: 0x56115878c6d8>

Let us call ΠhM the law encoded by another_experiment for a given h taken in ]1,1[. Note that P{Πh:h]1,1[} defines a collection of laws, i.e., a statistical model.

We say that P is a submodel of M because PM. Moreover, we say that this submodel is through Π0 since Πh=Π0 when h=0. We also say that P is a fluctuation of Π0.

One could enumerate many possible submodels in M through Π0. It turns out that all that matters for our purposes is the form of the submodel in a neighborhood of Π0. We informally say that this local behavior describes the direction of a submodel through Π0. We formalize this notion Section 3.3.

We now have a notion of how to move through the model space PM and can study how the value of the parameter changes as we move away from a law P. Above, we said that Ψ is a smooth parameter if it does not change “abruptly” as we move towards P in any particular direction. That is, we should hope that Ψ is differentiable along our submodel at P. This idea too is formalized in Section 3.3. We now turn to illustrating this idea numerically.

3.1.2 Numerical illustration

The code below evaluates how the parameter changes for laws in P, and approximates the derivative of the parameter along the submodel P at Π0. Recall that the numerical value of Ψ(Π0) has already been computed and is stored in object psi_Pi_zero.

approx <- seq(-1, 1, length.out = 1e2)
psi_Pi_h <- sapply(approx, function(t) {
  evaluate_psi(another_experiment, h = t)
})
slope_approx <- (psi_Pi_h - psi_Pi_zero) / approx
slope_approx <- slope_approx[min(which(approx > 0))]
ggplot() +
  geom_point(data = data.frame(x = approx, y = psi_Pi_h), aes(x, y),
             color = "#CC6666") +
  geom_segment(aes(x = -1, y = psi_Pi_zero - slope_approx,
                   xend = 1, yend = psi_Pi_zero + slope_approx),
               arrow = arrow(length = unit(0.03, "npc")),
               color = "#9999CC") +
  geom_vline(xintercept = 0, color = "#66CC99") +
  geom_hline(yintercept = psi_Pi_zero, color = "#66CC99") +
  labs(x = "h", y = expression(Psi(Pi[h]))) 
Evolution of statistical mapping \(\Psi\) along fluctuation \(\{\Pi_{h} : h \in H\}\).

Figure 3.1: Evolution of statistical mapping Ψ along fluctuation {Πh:hH}.

The red curve represents the function hΨ(Πh). The blue line represents the tangent to the previous curve at h=0, which indeed appears to be differentiable around h=0. In Section 3.4, we derive a closed-form expression for the slope of the blue curve.

3.2 ⚙ Yet another experiment

  1. Adapt the code from Problem 1 in Section 1.3 to visualize wEΠh(Y|A=1,W=w), wEΠh(Y|A=0,W=w), and wEΠh(Y|A=1,W=w)EΠh(Y|A=0,W=w), for h{1/2,0,1/2}.

  2. Run the following chunk of code.

yet_another_experiment <- copy(another_experiment)
alter(yet_another_experiment,
      Qbar = function(AW, h){
        A <- AW[, "A"]
        W <- AW[, "W"]
        expit( logit( A * W + (1 - A) * W^2 ) + 
               h * (2*A - 1) / ifelse(A == 1,
                                      sin((1 + W) * pi / 6), 
                                      1 - sin((1 + W) * pi / 6)) *
               (Y - A * W + (1 - A) * W^2))
      })
  1. Justify that yet_another_fluctuation characterizes another fluctuation of Π0. Comment upon the similarities and differences between {Πh:h]1,1[} and {Πh:h]1,1[}.

  2. Repeat Problem 1 above with Πh substituted for Πh.

  3. Re-produce Figure 3.1 for the {Πh:h]1,1[} fluctuation. Comment on the similarities and differences between the resulting figure and Figure 3.1. In particular, how does the behavior of the target parameter around h=0 compare between laws Π0 and Π0?

3.3 ☡ More on fluctuations and smoothness

3.3.1 Fluctuations

Let us now formally define what it means for statistical mapping Ψ to be smooth at every PM. Let H be the interval ]1/M,M[. For every hH, we can define a law PhM by setting PhP5 and dPhdP1+hs, where s:OR is a (measurable) function of O such that s(O) is not equal to zero P-almost surely, EP(s(O))=0, and s bounded by M. We make the observation that (i)Ph|h=0=P,(ii)ddhlogdPhdP(O)|h=0=s(O).

Because of (i), {Ph:hH} is a submodel through P, also referred to as a fluctuation of P. The fluctuation is a one-dimensional submodel of M with univariate parameter hH. We note that (ii) indicates that the score of this submodel at h=0 is s. Thus, we say that the fluctuation is in the direction of s.

Fluctuations of P do not necessarily take the same form as in (3.1). No matter how the fluctuation is built, for our purposes the most important feature of the fluctuation is its direction.

3.3.2 Smoothness and gradients

We are now prepared to provide a formal definition of smoothness of statistical mappings. We say that a statistical mapping Ψ is smooth at every PM if for each PM, there exists a (measurable) function D(P):OR such that EP(D(P)(O))=0, VarP(D(P)(O))<, and, for every fluctuation {Ph:hH} with score s at h=0, the real-valued mapping hΨ(Ph) is differentiable at h=0, with a derivative equal to EP(D(P)(O)s(O)). The object D(P) in (3.3) is called a gradient of Ψ at P.6

3.3.3 A Euclidean perspective

This terminology has a direct parallel to directional derivatives in the calculus of Euclidean geometry. Recall that if f is a differentiable mapping from Rp to R, then the directional derivative of f at a point x (an element of Rp) in direction u (a unit vector in Rp) is the scalar product of the gradient of f and u. In words, the directional derivative of f at x can be represented as a scalar product of the direction that we approach x and the change of the function’s value at x.

In the present problem, the law P is the point at which we evaluate the function Ψ, the score s of the fluctuation is the direction in which we approach the point, and the gradient describes the change in the function’s value at the point.

3.3.4 The canonical gradient

In general, it is possible for many gradients to exist7. Yet, in the special case that the model is nonparametric, only a single gradient exists. The unique gradient is then referred to as the canonical gradient or, for reasons that will be clarified in Section 3.5, the efficient influence curve. In the more general setting, the canonical gradient may be defined as the minimizer of DVarP(D(O)) over the set of all gradients of Ψ at P.

It is not difficult to check that the efficient influence curve of statistical mapping Ψ (2.6) at PM can be written as D(P)D1(P)+D2(P),whereD1(P)(O)ˉQ(1,W)ˉQ(0,W)Ψ(P),D2(P)(O)2A1ˉG(A,W)(YˉQ(A,W)).

A method from package tlrider evaluates the efficient influence curve at a law described by an object of class LAW. It is called evaluate_eic. For instance, the next chunk of code evaluates the efficient influence curve D(P0) of Ψ (2.6) at P0M that is characterized by experiment:

eic_experiment <- evaluate_eic(experiment)

The efficient influence curve D(P0) is a function from O to R. As such, it can be evaluated at the five independent observations drawn from P0 in Section 1.2.2. This is what the next chunk of code does:

(eic_experiment(five_obs))
#> [1] -0.0241 -0.0283  0.1829  0.0374 -0.0463

Finally, the efficient influence curve can be visualized as two images that represent (w,y)D(P0)(w,a,y) for a=0,1, respectively:

crossing(w = seq(0, 1, length.out = 2e2),
     a = c(0, 1),
     y = seq(0, 1, length.out = 2e2)) %>%
  mutate(eic = eic_experiment(cbind(Y=y,A=a,W=w))) %>%
  ggplot(aes(x = w, y = y, fill = eic)) +
  geom_raster(interpolate = TRUE) +
  geom_contour(aes(z = eic), color = "white") +
  facet_wrap(~ a, nrow = 1,
             labeller = as_labeller(c(`0` = "a = 0", `1` = "a = 1"))) +
  labs(fill = expression(paste(D^"*", (P[0])(w,a,y))))
Visualizing the efficient influence curve \(D^{*}(P_{0})\) of \(\Psi\) (2.6) at \(P_{0}\), the law described by experiment.

Figure 3.2: Visualizing the efficient influence curve D(P0) of Ψ (2.6) at P0, the law described by experiment.

3.4 A fresh look at another_experiment

We can give a fresh look at Section 3.1.2 now.

3.4.1 Deriving the efficient influence curve

It is not difficult (though cumbersome) to verify that, up to a constant, {Πh:h[1,1]} is a fluctuation of Π0 in the direction (in the sense of (3.1)) of

σ0(O)10WA×β0(A,W)×(log(1Y)+3k=0(k+β0(A,W))1)+constant,whereβ0(A,W)1ˉQΠ0(A,W)ˉQΠ0(A,W).

Consequently, the slope of line in Figure 3.1 is equal to

EΠ0(D(Π0)(O)σ0(O)).

Since D(Π0) is centered under Π0, knowing σ0 up to a constant is not problematic.

3.4.2 Numerical validation

In the following code, we check the above fact numerically. When we ran example(tlrider), we created a function sigma0. The function implements σ0 defined in (3.5):

sigma0
#> function(obs, law = another_experiment) {
#>   ## preliminary
#>   Qbar <- get_feature(law, "Qbar", h = 0)
#>   QAW <- Qbar(obs[, c("A", "W")])
#>   params <- formals(get_feature(law, "qY", h = 0))
#>   shape1 <- eval(params$shape1)
#>   ## computations
#>   betaAW <- shape1 * (1 - QAW) / QAW
#>   out <- log(1 - obs[, "Y"])
#>   for (int in 1:shape1) {
#>     out <- out + 1/(int - 1 + betaAW)
#>   }
#>   out <- - out * shape1 * (1 - QAW) / QAW *
#>            10 * sqrt(obs[, "W"]) * obs[, "A"]
#>   ## no need to center given how we will use it
#>   return(out)
#>  }

The next chunk of code approximates (3.6) pointwise and with a confidence interval of asymptotic level 95%:

eic_another_experiment <- evaluate_eic(another_experiment, h = 0)
obs_another_experiment <- sample_from(another_experiment, B, h = 0)
vars <- eic_another_experiment(obs_another_experiment) *
  sigma0(obs_another_experiment)

sd_hat <- sd(vars)
(slope_hat <- mean(vars))
#> [1] 1.35
(slope_CI <- slope_hat + c(-1, 1) *
   qnorm(1 - alpha / 2) * sd_hat / sqrt(B))
#> [1] 1.35 1.36

Equal to 1.349 (rounded to three decimal places — hereafter, all rounding will be to three decimal places as well), the first numerical approximation slope_approx is not too off!

3.5 ☡ Asymptotic linearity and statistical efficiency

3.5.1 Asymptotic linearity

Suppose that O1,,On are drawn independently from PM. If an estimator ψn of Ψ(P) can be written as

ψn=Ψ(P)+1nni=1IC(Oi)+oP(1/n)

for some function IC:OR such that EP(IC(O))=0 and VarP(IC(O))<, then we say that ψn is asymptotically linear with influence curve IC. Asymptotically linear estimators are weakly convergent. Specifically, if ψn is asymptotically linear with influence curve IC, then

n(ψnΨ(P))=1nni=1IC(Oi)+oP(1)

and, by the central limit theorem (recall that O1,,On are independent), n(ψnΨ(P)) converges in law to a centered Gaussian distribution with variance VarP(IC(O)).

3.5.2 Influence curves and gradients

As it happens, influence curves of regular8 estimators are intimately related to gradients. In fact, if ψn is a regular, asymptotically linear estimator of Ψ(P) with influence curve IC, then it must be true that Ψ is a smooth parameter at P and that IC is a gradient of Ψ at P.

3.5.3 Asymptotic efficiency

Now recall that, in Section 3.3.4, we defined the canonical gradient as the minimizer of DVarP(D(O)) over the set of all gradients. Therefore, if ψn is a regular, asymptotically linear estimator of Ψ(P) (built from n independent observations drawn from P), then the asymptotic variance of n(ψnΨ(P)) cannot be smaller than the variance of the canonical gradient of Ψ at P, i.e.,

VarP(D(P)(O)).

In other words, (3.9) is the lower bound on the asymptotic variance of any regular, asymptotically linear estimator of Ψ(P). This bound is referred to as the Cramér-Rao bound. Any regular estimator that achieves this variance bound is said to be asymptotically efficient at P. Because the canonical gradient is the influence curve of an asymptotically efficient estimator, it is often referred to as the efficient influence curve.

3.6 ⚙ Cramér-Rao bounds

  1. What does the following chunk do?
obs <- sample_from(experiment, B)
(cramer_rao_hat <- var(eic_experiment(obs)))
#> [1] 0.287
  1. Same question about this one.
obs_another_experiment <- sample_from(another_experiment, B, h = 0)
(cramer_rao_Pi_zero_hat <-
   var(eic_another_experiment(obs_another_experiment)))
#> [1] 0.0946
  1. With a large independent sample drawn from Ψ(P0) (or Ψ(Π0)), is it possible to construct a regular estimator ψn of Ψ(P0) (or Ψ(Π0)) such that the asymptotic variance of n times ψn minus its target be smaller than the Cramér-Rao bound?

  2. Is it easier to estimate Ψ(P0) or Ψ(Π0) (from independent observations drawn from either law)? In what sense? (Hint: you may want to compute a ratio.)


  1. That is, Ph is dominated by P: if an event A satisfies P(A)=0, then necessarily Ph(A)=0 too. Because PhP, the law Ph has a density with respect to P, meaning that there exists a (measurable) function f such that Ph(A)=oAf(o)dP(o) for any event A. The function is often denoted dPh/dP.↩︎

  2. Interestingly, if a fluctuation {Ph:hH} satisfies (3.2) for a direction s such that s0, EP(s(O))=0 and VarP(s(O))<, then hΨ(Ph) is still differentiable at h=0 with a derivative equal to (3.3) beyond fluctuations of the form (3.1).↩︎

  3. This may be at first surprising given the parallel drawn in Section 3.3.3 to Euclidean geometry. However, it is important to remember that the model dictates fluctuations of P that are valid submodels with respect to the full model. In turn, this determines the possible directions from which we may approach P. Thus, depending on the direction, (3.3) may hold with different choices of D.↩︎

  4. We can view ψn as the by product of an algorithm ˆΨ trained on independent observations O1,,On drawn from P. We say that the estimator is regular at P if, for any direction s0 such that EP(s(O))=0 and VarP(s(O))< and fluctuation {Ph:hH} satisfying (3.2), the estimator ψn,1/n of Ψ(P1/n) obtained by training ˆΨ on independent observations O1, , On drawn from P1/n is such that n(ψn,1/nΨ(P1/n)) converges in law to a limit that does not depend on s.↩︎