Penalized log-spline density estimation: density

Model

Given observations $x_1,\dots,x_n$ with weights $f_i$ ( $F=\sum_i f_i$ ) on an interval $\Omega=[a,b]$ , and a basis $\phi(x)=(\phi_1(x),\dots,\phi_K(x))^\top$ , the density is parameterized by $c\in\mathbb{R}^K$ as $p(x;c)=\frac{\exp\!\bigl(\phi(x)^\top c\bigr)}{C(c)},\qquad C(c)=\int_\Omega \exp\!\bigl(\phi(x)^\top c\bigr)\,dx .$

The penalized negative log-likelihood minimized is $F(c)=\underbrace{-\sum_i f_i\,\phi(x_i)^\top c+F\log C(c)}_{-\ell(c)}+\lambda\,c^\top K c, \qquad K=\int_\Omega (L\phi)(L\phi)^\top\,dx,$ where $L$ is the differential operator carried by WfdParobj$Lfd.

Gradient and expected Hessian

Differentiating $\log C$ gives $\nabla_c\log C(c)=\mathbb{E}_p[\phi(X)],\qquad \nabla_c^2\log C(c)=\operatorname{Var}_p[\phi(X)].$ Hence $\nabla\ell(c)=\sum_i f_i\bigl(\phi(x_i)-\mathbb{E}_p[\phi]\bigr),\qquad -\nabla^2\ell(c)=F\operatorname{Var}_p[\phi].$ With the penalty, $g(c)=-\nabla\ell(c)+2\lambda K c,\qquad H(c)=F\operatorname{Var}_p[\phi]+2\lambda K .$ These are exactly loglfnden (gradient) and Varfnden+ $2K$ (expected Hessian); the integrals $C$ , $\mathbb{E}_p[\phi]$ , $\mathbb{E}_p[\phi\phi^\top]$ are computed by Romberg integration in normden.phi, expectden.phi, expectden.phiphit.

Identifiability

For a B-spline (or any partition-of-unity) basis, $\sum_j\phi_j\equiv 1$ , so shifting $c\mapsto c+a\mathbf{1}$ shifts $\phi^\top c$ by the constant $a$ , which is absorbed into $C$ . The likelihood is therefore invariant along the direction $\mathbf{1}$ . Let $Z\in\mathbb{R}^{K\times(K-1)}$ be an orthonormal basis of $\mathbf{1}^\perp$ (zerobasis(nbasis)). Reparametrize $c=Z\tilde c$ ; the reduced problem $\tilde g=Z^\top g,\qquad \tilde H=Z^\top H Z$ has $\tilde H\succ 0$ (since $\operatorname{Var}_p[\phi]$ is PD on $\mathbf{1}^\perp$ , and the $\lambda K$ term is PSD).

Algorithm (Fisher scoring with line search)

Initialize $c\leftarrow c_0$ (from WfdParobj$fd).
Compute $g,H$ ; reduce to $\tilde g,\tilde H$ ; Newton step $\Delta\tilde c=-\tilde H^{-1}\tilde g$ , $\Delta c=Z\Delta\tilde c$ .
Backtracking/interpolating line search (stepit) on $\alpha\mapsto F(c+\alpha\Delta c)$ subject to box constraints $c\in[-50,400]^K$ (stepchk).
Update $c\leftarrow c+\alpha\Delta c$ ; stop when $|F^{\text{new}}-F^{\text{old}}|<\text{conv}$ or iterlim reached.
Return $W_c=\phi^\top c$ and $C=C(c)$ .

Correctness

Convexity. $\log C(c)$ is the log-partition of an exponential family and is convex (its Hessian $\operatorname{Var}_p[\phi]$ is PSD). Therefore $-\ell$ is convex, and $F=-\ell+\lambda c^\top K c$ is convex (strictly, on $\mathbf{1}^\perp$ , when $K\succeq 0$ ). Any stationary point of the reduced problem in $\tilde c$ is the unique global minimizer.

Descent direction. Since $\tilde H\succ 0$ , $\tilde g^\top\Delta\tilde c=-\tilde g^\top\tilde H^{-1}\tilde g<0\quad\text{whenever }\tilde g\neq 0,$ so $\Delta c$ is a strict descent direction for $F$ (lines 227–238 check the slope and fall back to $-\tilde g$ if numerical loss makes $\cos\angle<0$ , line 340–343).

Monotone convergence. The line search returns $\alpha>0$ with $F(c+\alpha\Delta c)<F(c)$ (Wolfe-style interpolation in stepit); the sequence $\{F(c^{(k)})\}$ is decreasing and bounded below (since $F$ is coercive on the identifiable subspace), hence converges. Combined with strict convexity on $\mathbf{1}^\perp$ , $c^{(k)}\to c^\star$ , the unique minimizer.

Normalization. At return, $C=\int_\Omega\exp(\phi^\top c^\star)\,dx$ is computed to tolerance $10^{-7}$ by Romberg extrapolation, so $\int_\Omega p(x;c^\star)\,dx=1$ by construction.

Output

density_mpl returns list(Wfdobj, C, Flist, iternum, iterhist) with $\widehat p(x)=\frac{\exp\!\bigl(W_{c^\star}(x)\bigr)}{C},\quad x\in\Omega,$ the unique maximizer of the penalized log-likelihood in the chosen log-spline family.

Bugs in the R reference

Porting the algorithm to Rust surfaced two issues in fda::density.fd (which is what dda::density_mpl_legacy ports verbatim) that the new density_mpl_rust backend corrects.

1. Hessian scaling for frequency-weighted input

Let $F=\sum_i f_i$ be the total weight. Differentiating $-\ell$ twice in $c$ : $-\partial_c^2\ell(c) = F\,\partial_c^2\log C(c) = F\operatorname{Var}_p[\phi(X)].$ The Fisher information is $F\operatorname{Var}_p[\phi]$ , not $N\operatorname{Var}_p[\phi]$ . In Varfnden (R/density-mpl-legacy.R) the code computes

Varphi <- nobs*(EDwDwt - outer(EDw,EDw))

i.e. it always scales by nobs = length(x), regardless of $F$ .

For a one-column input ( $f_i\equiv 1$ , $F=N=$ nobs), this matches the correct formula.
For a two-column input with frequencies normalized to $\sum_i f_i=1$ (see the m == 2 branch of density_mpl_legacy), we have $F=1$ but nobs $=N$ , so the Hessian is over-scaled by a factor of $N$ . The Newton step $-H^{-1}g$ is therefore under-sized by a factor $N$ .

In R, fda::stepit masks this by aggressively expanding the trial step (its line search is Wolfe-style with quadratic interpolation). A plain backtracking line search — as in the Rust port — does not expand and the descent stalls. The fix in density.rs::varfnden replaces nobs with $F=\sum f_i$ : $H(c)=F\operatorname{Var}_p[\phi]+2\lambda K .$

2. Early termination at non-stationary points

The R loop in density_mpl_legacy has two exit conditions:

if (abs(Flist$f - Foldstr$f) < conv) { ... return }
if (Flist$f >= Foldstr$f) break          # not a return: just exits

The second branch fires whenever the line search fails to find a step that strictly decreases $F$ . This is not the same as reaching a stationary point — stepit can fail purely from numerical limitations of its interpolation while $\|\tilde g\|$ is still on the order of $10^{-1}$ to $10^{-4}$ . The function then returns the current $c$ regardless.

For test problems where the true minimizer lies close to an active box constraint or where the Hessian is poorly conditioned along weakly identified directions (e.g. B-spline boundary coefficients with no data nearby), density_mpl_legacy terminates with $\|\tilde g\|\gg \epsilon$ and the returned $c$ is not the optimum of the penalized log-likelihood.

The Rust port uses Armijo backtracking on the (correctly scaled) Newton direction and converges to $\|\tilde g\|<10^{-7}$ in our tests. Where the R and Rust solutions disagree, the Rust solution has the strictly smaller objective value.

Practical consequence

For un-penalized fits on data that doesn’t cover the basis range (e.g. a $\mathcal N(0,1)$ sample with $\Omega=[-4,4]$ ), the two implementations can return densities that disagree by $O(1)$ outside the data support — both are local optima of an ill-posed problem and the boundary coefficients of the B-spline basis are weakly identified. Inside the data support the densities agree to under 1% relative difference.

Penalized log-spline density estimation: density_mpl