Skip to contents

Model

Given observations x1,,xnx_1,\dots,x_n with weights fif_i (F=ifiF=\sum_i f_i) on an interval Ω=[a,b]\Omega=[a,b], and a basis ϕ(x)=(ϕ1(x),,ϕK(x))\phi(x)=(\phi_1(x),\dots,\phi_K(x))^\top, the density is parameterized by cKc\in\mathbb{R}^K as p(x;c)=exp(ϕ(x)c)C(c),C(c)=Ωexp(ϕ(x)c)dx. p(x;c)=\frac{\exp\!\bigl(\phi(x)^\top c\bigr)}{C(c)},\qquad C(c)=\int_\Omega \exp\!\bigl(\phi(x)^\top c\bigr)\,dx .

The penalized negative log-likelihood minimized is F(c)=ifiϕ(xi)c+FlogC(c)(c)+λcKc,K=Ω(Lϕ)(Lϕ)dx, F(c)=\underbrace{-\sum_i f_i\,\phi(x_i)^\top c+F\log C(c)}_{-\ell(c)}+\lambda\,c^\top K c, \qquad K=\int_\Omega (L\phi)(L\phi)^\top\,dx, where LL is the differential operator carried by WfdParobj$Lfd.

Gradient and expected Hessian

Differentiating logC\log C gives clogC(c)=𝔼p[ϕ(X)],c2logC(c)=Varp[ϕ(X)]. \nabla_c\log C(c)=\mathbb{E}_p[\phi(X)],\qquad \nabla_c^2\log C(c)=\operatorname{Var}_p[\phi(X)]. Hence (c)=ifi(ϕ(xi)𝔼p[ϕ]),2(c)=FVarp[ϕ]. \nabla\ell(c)=\sum_i f_i\bigl(\phi(x_i)-\mathbb{E}_p[\phi]\bigr),\qquad -\nabla^2\ell(c)=F\operatorname{Var}_p[\phi]. With the penalty, g(c)=(c)+2λKc,H(c)=FVarp[ϕ]+2λK. g(c)=-\nabla\ell(c)+2\lambda K c,\qquad H(c)=F\operatorname{Var}_p[\phi]+2\lambda K . These are exactly loglfnden (gradient) and Varfnden+2K2K (expected Hessian); the integrals CC, 𝔼p[ϕ]\mathbb{E}_p[\phi], 𝔼p[ϕϕ]\mathbb{E}_p[\phi\phi^\top] are computed by Romberg integration in normden.phi, expectden.phi, expectden.phiphit.

Identifiability

For a B-spline (or any partition-of-unity) basis, jϕj1\sum_j\phi_j\equiv 1, so shifting cc+a𝟏c\mapsto c+a\mathbf{1} shifts ϕc\phi^\top c by the constant aa, which is absorbed into CC. The likelihood is therefore invariant along the direction 𝟏\mathbf{1}. Let ZK×(K1)Z\in\mathbb{R}^{K\times(K-1)} be an orthonormal basis of 𝟏\mathbf{1}^\perp (zerobasis(nbasis)). Reparametrize c=Zc̃c=Z\tilde c; the reduced problem g̃=Zg,H̃=ZHZ \tilde g=Z^\top g,\qquad \tilde H=Z^\top H Z has H̃0\tilde H\succ 0 (since Varp[ϕ]\operatorname{Var}_p[\phi] is PD on 𝟏\mathbf{1}^\perp, and the λK\lambda K term is PSD).

  1. Initialize cc0c\leftarrow c_0 (from WfdParobj$fd).
  2. Compute g,Hg,H; reduce to g̃,H̃\tilde g,\tilde H; Newton step Δc̃=H̃1g̃\Delta\tilde c=-\tilde H^{-1}\tilde g, Δc=ZΔc̃\Delta c=Z\Delta\tilde c.
  3. Backtracking/interpolating line search (stepit) on αF(c+αΔc)\alpha\mapsto F(c+\alpha\Delta c) subject to box constraints c[50,400]Kc\in[-50,400]^K (stepchk).
  4. Update cc+αΔcc\leftarrow c+\alpha\Delta c; stop when |FnewFold|<conv|F^{\text{new}}-F^{\text{old}}|<\text{conv} or iterlim reached.
  5. Return Wc=ϕcW_c=\phi^\top c and C=C(c)C=C(c).

Correctness

Convexity. logC(c)\log C(c) is the log-partition of an exponential family and is convex (its Hessian Varp[ϕ]\operatorname{Var}_p[\phi] is PSD). Therefore -\ell is convex, and F=+λcKcF=-\ell+\lambda c^\top K c is convex (strictly, on 𝟏\mathbf{1}^\perp, when K0K\succeq 0). Any stationary point of the reduced problem in c̃\tilde c is the unique global minimizer.

Descent direction. Since H̃0\tilde H\succ 0, g̃Δc̃=g̃H̃1g̃<0whenever g̃0, \tilde g^\top\Delta\tilde c=-\tilde g^\top\tilde H^{-1}\tilde g<0\quad\text{whenever }\tilde g\neq 0, so Δc\Delta c is a strict descent direction for FF (lines 227–238 check the slope and fall back to g̃-\tilde g if numerical loss makes cos<0\cos\angle<0, line 340–343).

Monotone convergence. The line search returns α>0\alpha>0 with F(c+αΔc)<F(c)F(c+\alpha\Delta c)<F(c) (Wolfe-style interpolation in stepit); the sequence {F(c(k))}\{F(c^{(k)})\} is decreasing and bounded below (since FF is coercive on the identifiable subspace), hence converges. Combined with strict convexity on 𝟏\mathbf{1}^\perp, c(k)cc^{(k)}\to c^\star, the unique minimizer.

Normalization. At return, C=Ωexp(ϕc)dxC=\int_\Omega\exp(\phi^\top c^\star)\,dx is computed to tolerance 10710^{-7} by Romberg extrapolation, so Ωp(x;c)dx=1\int_\Omega p(x;c^\star)\,dx=1 by construction.

Output

density_mpl returns list(Wfdobj, C, Flist, iternum, iterhist) with p̂(x)=exp(Wc(x))C,xΩ, \widehat p(x)=\frac{\exp\!\bigl(W_{c^\star}(x)\bigr)}{C},\quad x\in\Omega, the unique maximizer of the penalized log-likelihood in the chosen log-spline family.

Bugs in the R reference

Porting the algorithm to Rust surfaced two issues in fda::density.fd (which is what dda::density_mpl_legacy ports verbatim) that the new density_mpl_rust backend corrects.

1. Hessian scaling for frequency-weighted input

Let F=ifiF=\sum_i f_i be the total weight. Differentiating -\ell twice in cc: c2(c)=Fc2logC(c)=FVarp[ϕ(X)]. -\partial_c^2\ell(c) = F\,\partial_c^2\log C(c) = F\operatorname{Var}_p[\phi(X)]. The Fisher information is FVarp[ϕ]F\operatorname{Var}_p[\phi], not NVarp[ϕ]N\operatorname{Var}_p[\phi]. In Varfnden (R/density-mpl-legacy.R) the code computes

Varphi <- nobs*(EDwDwt - outer(EDw,EDw))

i.e. it always scales by nobs = length(x), regardless of FF.

  • For a one-column input (fi1f_i\equiv 1, F=N=F=N=nobs), this matches the correct formula.
  • For a two-column input with frequencies normalized to ifi=1\sum_i f_i=1 (see the m == 2 branch of density_mpl_legacy), we have F=1F=1 but nobs=N=N, so the Hessian is over-scaled by a factor of NN. The Newton step H1g-H^{-1}g is therefore under-sized by a factor NN.

In R, fda::stepit masks this by aggressively expanding the trial step (its line search is Wolfe-style with quadratic interpolation). A plain backtracking line search — as in the Rust port — does not expand and the descent stalls. The fix in density.rs::varfnden replaces nobs with F=fiF=\sum f_i: H(c)=FVarp[ϕ]+2λK. H(c)=F\operatorname{Var}_p[\phi]+2\lambda K .

2. Early termination at non-stationary points

The R loop in density_mpl_legacy has two exit conditions:

if (abs(Flist$f - Foldstr$f) < conv) { ... return }
if (Flist$f >= Foldstr$f) break          # not a return: just exits

The second branch fires whenever the line search fails to find a step that strictly decreases FF. This is not the same as reaching a stationary point — stepit can fail purely from numerical limitations of its interpolation while g̃\|\tilde g\| is still on the order of 10110^{-1} to 10410^{-4}. The function then returns the current cc regardless.

For test problems where the true minimizer lies close to an active box constraint or where the Hessian is poorly conditioned along weakly identified directions (e.g. B-spline boundary coefficients with no data nearby), density_mpl_legacy terminates with g̃ϵ\|\tilde g\|\gg \epsilon and the returned cc is not the optimum of the penalized log-likelihood.

The Rust port uses Armijo backtracking on the (correctly scaled) Newton direction and converges to g̃<107\|\tilde g\|<10^{-7} in our tests. Where the R and Rust solutions disagree, the Rust solution has the strictly smaller objective value.

Practical consequence

For un-penalized fits on data that doesn’t cover the basis range (e.g. a 𝒩(0,1)\mathcal N(0,1) sample with Ω=[4,4]\Omega=[-4,4]), the two implementations can return densities that disagree by O(1)O(1) outside the data support — both are local optima of an ill-posed problem and the boundary coefficients of the B-spline basis are weakly identified. Inside the data support the densities agree to under 1% relative difference.