Regression with Ordered Predictors via Ordinal Smoothing Splines

Helwig, Nathaniel E.

doi:10.3389/fams.2017.00015

METHODS article

Front. Appl. Math. Stat., 28 July 2017

Sec. Mathematics of Computation and Data Science

Volume 3 - 2017 | https://doi.org/10.3389/fams.2017.00015

Regression with Ordered Predictors via Ordinal Smoothing Splines

Nathaniel E. Helwig^1,2^*

¹Department of Psychology, University of Minnesota, Minneapolis, MN, United States
²School of Statistics, University of Minnesota, Minneapolis, MN, United States

Many applied studies collect one or more ordered categorical predictors, which do not fit neatly within classic regression frameworks. In most cases, ordinal predictors are treated as either nominal (unordered) variables or metric (continuous) variables in regression models, which is theoretically and/or computationally undesirable. In this paper, we discuss the benefit of taking a smoothing spline approach to the modeling of ordinal predictors. The purpose of this paper is to provide theoretical insight into the ordinal smoothing spline, as well as examples revealing the potential of the ordinal smoothing spline for various types of applied research. Specifically, we (i) derive the analytical form of the ordinal smoothing spline reproducing kernel, (ii) propose an ordinal smoothing spline isotonic regression estimator, (iii) prove an asymptotic equivalence between the ordinal and linear smoothing spline reproducing kernel functions, (iv) develop large sample approximations for the ordinal smoothing spline, and (v) demonstrate the use of ordinal smoothing splines for isotonic regression and semiparametric regression with multiple predictors. Our results reveal that the ordinal smoothing spline offers a flexible approach for incorporating ordered predictors in regression models, and has the benefit of being invariant to any monotonic transformation of the predictor scores.

1. Introduction

1.1. Motivation

The General Linear Model (GLM) (see [1]) is one of the most widely applied statistical methods, with applications common in psychology [2], education [3], medicine [4], business [5], and several other disciplines. The GLM's popularity in applied research is likely due to a combination of the model's interpretability and flexibility, as well as easy availability through R [6] and commercial statistical softwares (e.g., SAS, SPSS, etc.). The GLM and its generalized extension (GzLM; see [7]) are well-equipped for modeling relationships between variables of mixed types, i.e., unordered categorical and continuous variables can be simultaneously included as predictors in a regression model. However, many studies collect one (or more) ordered categorical variables, which do not fit neatly within the GLM framework.

For example, in finance it is typical to rate the risk of investments on an ordinal scale (very low risk, low risk, …, high risk, very high risk), and a typical goal is to model expected returns given an investment's risk. In medical studies, severity of symptoms (very low, low, …, high, very high) and adherence to treatment (never, rarely, …, almost always, always) are often measured on ordinal scales, and a typical goal is to study patient outcomes in response to different treatments after controlling for symptom severity and treatment adherence. Psychological attributes such as personality traits and intelligence are typically measured on an ordinal scale, e.g., using questionnaires consisting of Likert scale items (strongly disagree, disagree, …, agree, strongly agree), and many psychological studies hope to understand how individual and group differences in psychological traits relate to differences in observed behavioral outcomes.

The examples mentioned in the previous paragraph represent just a glimpse of the many ways in which ordinal variables are relevant to our day-to-day financial, physical, and mental health. When it comes to modeling ordinal outcome (response) variables, there are a multitude of potential methods discussed in the literature (see [8-12]). However, when it comes to including ordinal variables as predictors in a GLM (or GzLM), the choices are slim. In nearly all cases, ordinal predictors are treated as either nominal (unordered) or continuous variables in regression models, which can lead to convoluted and possibly misleading results.

Suppose X is an ordered categorical variable with K categories (levels) 1 < ⋯ < K, and suppose we want to include X as a predictor in a regression model. The naive method would be to include K − 1 dummy variables in the model design matrix, and let the intercept absorb the K-th level's effect. We refer to this method as naive for two reasons: (i) this approach ignores the fact that the levels of X are ordered and, instead, parameterizes X as K − 1 unrelated variables, and (ii) this approach makes it cumbersome (and possibly numerically unstable) to examine interaction effects involving ordinal predictors, which could be of interest in a variety of studies. Although the dummy coding approach will suffice for certain applications, this method is far from ideal. Furthermore, if the number of levels K is large, the dummy coding approach could be infeasible.

Another possibility for including an ordinal predictor X in a regression model is to simply treat X as a continuous variable. In some cases, researchers make an effort to code the levels of an ordinal predictor X such that the relationship between X and Y is approximately linear. This approach is parsimonious because the ordinal predictor with K categories (requiring K − 1 coefficients) will be a reduced to a continuous predictor with a linear effect (requiring 1 coefficient). However, this approach is problematic for several reasons (i) the slope coefficient in such a model has no meaning, (ii) different researchers could concoct different coding schemes for the same data, which would hinder research comparability and reproducibility, and (iii) the ordinal nature of the predictor X is ignored, which is undesirable.

Penalized regression provides a promising framework for including ordinal predictors in regression models [13], given that an appropriate penalty can simultaneously induce order information on the solution and stabilize the estimation. Gertheiss and Tutz [13] discuss how a binary design matrix in combination with a squared difference penalty can be used to fit regression models with ordinal predictors. This approach is implemented in the R package ordPens [14], which fits models containing additive effects of ordinal and metric (continuous) predictors. Adding a penalty to impose order and stabilize the solution is a promising approach, but this method still parameterizes X as K − 1 unrelated variables in the model design matrix. As a result, this approach (and the ordPens R package) offer no method for examining interaction effects between multiple ordinal predictors or interaction effects between ordinal and metric predictors.

1.2. Purpose

In this paper, we discuss the benefits of taking a smoothing spline approach [15, 16] to the modeling of ordinal predictors. This approach has been briefly mentioned as a possibility [15, p. 34], but a thorough treatment of the ordinal smoothing spline is lacking from the literature. Expanding the work of Gertheiss and Tutz [13] and Gu [15], we (Section 3.1) derive the analytical form of the reproducing kernel function corresponding to the ordinal smoothing spline, which makes it possible to efficiently compute ordinal smoothing splines. Our results reveal that the reproducing kernel function only depends on rank information, so the ordinal smoothing spline estimator is invariant to any monotonic transformation of the predictor scores. We also (Section 3.2) propose an ordinal smoothing spline isotonic regression estimator via factoring the reproducing kernel function into monotonic (step) functions. Furthermore, we (Section 3.3) prove a correspondence between the ordinal and linear smoothing spline for large values of K, and (Section 3.4) develop computationally scalable approximations for fitting ordinal smoothing splines with large values of K. Finally, we demonstrate the potential of the ordinal smoothing spline for applied research via a simulation study and two real data examples. Our simulation study (Section 4) reveals that the ordinal smoothing spline can outperform the linear smoothing spline and classic isotonic regression algorithms when analyzing monotonic functions with various degrees of smoothness. Our real data results (Section 5) demonstrate that the ordinal smoothing spline—in combination with the powerful smoothing spline ANOVA framework [15]—provides an appealing approach for including ordinal predictors in regression models.

2. Smoothing Spline Background

2.1. Reproducing Kernels

Let $H$ denote a Hilbert space of functions on domain $X$ , and let 〈·, ·〉 denote the inner-product of $H$ . According to the Riesz representation theorem [17, 18], if L is a continuous linear functional in a Hilbert space $H$ , then there exists a unique function $ζ_{L} \in H$ such that Lη = 〈ζ_L, η〉 for all $η \in H$ . The function ζ_L is referred to as the representer of the functional L. The evaluation functional [x]η = η(x) evaluates a function $η \in H$ at an input $x \in X$ . If the evaluation functional [x]η = η(x) is continuous in $H$ for all $x \in X$ , then the space $H$ is referred to as a reproducing kernel Hilbert space (RKHS) (see [15, 16]). If $H$ is a RKHS with inner-product 〈·, ·〉, then there exists a unique function $ρ_{x} \in H$ that is the representer of the evaluation functional [x](·), such that [x]η = 〈ρ_x, η〉 = η(x) for all $η \in H$ and all $x \in X$ . The symmetric and non-negative definite function

\begin{array}{l} ρ (x, y) = ρ_{x} (y) = ρ_{y} (x) = 〈 ρ_{x}, ρ_{y} 〉 & (1) \end{array}

is called the reproducing kernel of $H$ because it satisfies the reproducing property 〈ρ(x, y), η(y)〉 = η(x) for all $η \in H$ and all $x, y \in X$ , see Aronszajn [19]. If $H = H_{0} \oplus H_{1}$ with $H_{0} \cap H_{1} = \emptyset$ , then

\begin{array}{l} ρ (x, y) = ρ_{0} (x, y) + ρ_{1} (x, y) & (2) \end{array}

where $ρ_{0} \in H_{0}$ and $ρ_{1} \in H_{1}$ are the reproducing kernels of $H_{0}$ and $H_{1}$ , respectively, such that

\begin{array}{l} 〈 ρ (x, y), η (y) 〉 = {〈 ρ_{0} (x, y), η_{0} (y) 〉}_{0} + {〈 ρ_{1} (x, y), η_{1} (y) 〉}_{1} \\ = η_{0} (x) + η_{1} (x) = η (x) & (3) \end{array}

with 〈·, ·〉 = 〈·, ·〉₀ + 〈·, ·〉₁, $η = (η_{0} + η_{1}) \in H$ , $η_{0} \in H_{0}$ , and $η_{1} \in H_{1}$ .

2.2. Smoothing Spline Definition

Assume a nonparametric regression model (see [15, 16, 20-23])

\begin{array}{l} y_{i} = η (x_{i}) + ϵ_{i} & (4) \end{array}

for i = 1, …, n, where y_i ∈ ℝ is the real-valued response for the i-th observation, $x_{i} \in X$ is the predictor for the i-th observation where $X$ is the predictor domain, $η \in H$ is an unknown smooth function where $H$ is a RKHS with inner-product 〈·, ·〉, and $ϵ_{i} \overset{i i d}{~} N (0, σ^{2})$ is a Gaussian error term. A smoothing spline is the function η_λ that minimizes the penalized least squares functional

\begin{array}{l} \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - η (x_{i}))}^{2} + λ J (η) & (5) \end{array}

where J(·) is a quadratic penalty functional such that larger values correspond to less smoothness, and λ ≥ 0 is a smoothing parameter that balances the trade-off between fitting and smoothing the data. Let $H = H_{0} \oplus H_{1}$ denote a tensor sum decomposition of $H$ , where $H_{0} = {η : J (η) = 0}$ is the null space of J and $H_{1} = {η : 0 < J (η) < \infty}$ is the contrast space. Similarly, let 〈·, ·〉 = 〈·, ·〉₀+〈·, ·〉₁ denote the corresponding decomposition of the inner-product of $H$ . Note that, by definition, the quadratic penalty functional J is the inner-product of the contrast space $H_{1}$ , i.e., J(η) = 〈η, η〉₁. Given λ, the Kimeldorf-Wahba representer theorem [24] states that the η_λ minimizing Equation (5) has the form

\begin{array}{l} η_{λ} (x) = \sum_{v = 0}^{m - 1} d_{v} ϕ_{v} (x) + \sum_{i = 1}^{n} c_{i} ρ_{1} (x, x_{i}) & (6) \end{array}

where the functions ${ϕ_{v}}_{v = 0}^{m - 1}$ span the penalty's null space $H_{0}$ , the function ρ₁ is the reproducing kernel of the contrast space $H_{1}$ , and d = {d_v} and c = {c_i} are the unknown coefficient vectors (see [15, 16, 25]). The reproducing property implies that the quadratic penalty functional J(η_λ) = 〈η_λ, η_λ〉₁ has the form

\begin{array}{l} J (η_{λ}) = \sum_{i = 1}^{n} \sum_{j = 1}^{n} c_{i} c_{j} ρ_{1} (x_{i}, x_{j}) & (7) \end{array}

which can be easily evaluated given ρ₁, c, and x.

2.3. Types of Splines

The type of spline will depend on the forms of the RKHS $H$ and the inner-product 〈·, ·〉, which will depend on the predictor domain $X$ . The essential components for the formation of a smoothing spline include: (i) form a tensor sum decomposition of the RKHS $H = H_{0} \oplus H_{1}$ , (ii) partition the inner-product of the RKHS 〈·, ·〉 = 〈·, ·〉₀ + 〈·, ·〉₁ such that 〈·, ·〉_k defines an inner-product in $H_{k}$ for k ∈ {0, 1}, (iii) define the smoothing spline penalty as J(η) = 〈η, η〉₁, and (iv) derive the reproducing kernels of $H_{0}$ and $H_{1}$ . Table 1 provides the information needed to form three common smoothing splines (nominal, polynomial, and thin-plate), as well as the ordinal smoothing spline. See Gu [15] and Helwig and Ma [26] for more information about nominal and polynomial smoothing splines, and see Gu [15], Helwig and Ma [26], Duchon [27], Meinguet [28], and Wood [29] for more information about thin-plate splines. More information about the ordinal smoothing spline will be provided in Section 3.

TABLE 1

Table 1. Ingredients for forming different types of smoothing splines.

2.4. Tensor Product Smoothers

Suppose we have a model with predictor vector $X = (X_{1}, \dots, X_{d}) \in X = \prod_{j = 1}^{d} X_{j}$ where $X_{j} \in X_{j} \forall j$ . If the predictors are all continuous with similar scale, a thin-plate spline can be used. However, if the predictors differ in type and/or scale (e.g., some nominal and some continuous), another approach is needed. Let $x = {(x_{1}, \dots, x_{d})}^{'}$ and $y = {(y_{1}, \dots, y_{d})}^{'}$ denote two realizations of X and let

\begin{array}{l} ρ_{X_{j}} (x_{j}, y_{j}) = ρ_{0 j} (x_{j}, y_{j}) + ρ_{1 j} (x_{j}, y_{j}) & (8) \end{array}

denote the reproducing kernel function corresponding to $H_{X_{j}} = H_{0 j} \oplus H_{1 j}$ , which denotes a RKHS of functions on $X_{j}$ for j = 1, …, d. The symmetric and non-negative definite function

\begin{array}{l} ρ (x, y) = \prod_{j = 1}^{d} ρ_{X_{j}} (x_{j}, y_{j}) = ρ_{0} (x, y) + ρ_{1} (x, y) & (9) \end{array}

defines the unique tensor product reproducing kernel function corresponding to the tensor product RKHS

\begin{array}{l} H = H_{X_{1}} \otimes \dots \otimes H_{X_{d}} = H_{0} \oplus H_{1} & (10) \end{array}

where $ρ_{0} \in H_{0}$ with $H_{0} = H_{01} \otimes H_{02} \otimes \dots \otimes H_{0 d}$ denoting the tensor product null space and $ρ_{1} \in H_{1}$ with $H_{1} = H ⊖ H_{0}$ denoting the tensor product contrast space.

A tensor product smoothing spline solves the penalized least squares problem in Equation (5) in a tensor product RKHS $H = \otimes_{j = 1}^{d} H_{X_{j}} = H_{0} \oplus H_{1}$ . In this case, the contrast space can be decomposed such as $H_{1} = H_{1}^{*} \oplus \dots \oplus H_{s}^{*}$ where the $H_{k}^{*}$ are orthogonal subspaces with inner-products ${〈 \cdot, \cdot 〉}_{k}^{*}$ . The different $H_{k}^{*}$ may have different modules, so it is helpful to reweight the relative influence of each $H_{k}^{*}$ . Specifically, when solving the penalized least squares functional, we can define

\begin{array}{l} J (η) = \sum_{k = 1}^{s} θ_{k}^{- 1} {〈 η, η 〉}_{k}^{*} \\ ρ_{1} (x, y) = \sum_{k = 1}^{s} θ_{k} ρ_{k}^{*} (x, y) & (11) \end{array}

where $ρ_{k}^{*} \in H_{k}^{*}$ is the reproducing kernel of $H_{k}^{*}$ and θ_k > 0 are additional (non-negative) smoothing parameters that control the influence of each $H_{k}^{*}$ [see 15, 25, 26]. We can decompose any $η \in H$ such as

\begin{array}{l} η (x) = η_{0} (x) + \sum_{k = 1}^{s} η_{k}^{*} (x) \\ = {〈 ρ_{0} (x, y), η_{0} (y) 〉}_{0} + \sum_{k = 1}^{s} θ_{k}^{- 1} {〈 θ_{k} ρ_{k}^{*} (x, y), η_{k}^{*} (y) 〉}_{k}^{*} & (12) \end{array}

where $η_{0} \in H_{0}$ and $η_{k}^{*} \in H_{k}^{*}$ for k = 1, …, s. Note that the different $η_{k}^{*}$ functions correspond to different nonparametric main and interaction effects between predictors, so different statistical models can be fit by removing different $H_{k}^{*}$ subspaces (and corresponding $η_{k}^{*}$ ) from the model.

2.5. Smoothing Spline Computation

Applying the Kimeldorf-Wahba representer theorem, we can approximate Equation (5) as

\begin{array}{l} n^{- 1} ‖ y - K d - J c ‖^{2} + λ c^{'} Q c & (13) \end{array}

where $y = {(y_{1}, \dots, y_{n})}^{'}$ is the response vector, K = {ϕ_v(x_i)}_n×m is the null space basis function matrix, $J = {ρ_{1} (x_{i}, {\tilde{x}}_{j})}_{n \times R}$ denotes the contrast space basis function matrix with ${{\tilde{x}}_{j}}_{j = 1}^{R} \subseteq {x_{i}}_{i = 1}^{n}$ denoting the selected knots, and $Q = {ρ_{1} ({\tilde{x}}_{i}, {\tilde{x}}_{j})}_{R \times R}$ denotes the penalty matrix. The full solution uses all unique x_i as knots, but using a subset of R ≪ n knots has been shown to perform well in practice, as long as enough knots are included to reasonably span $X$ [26, 30-32]. Given λ, the optimal coefficients are

\begin{array}{l} (\begin{matrix} \hat{d} \\ \hat{c} \end{matrix}) = {(\begin{matrix} K^{'} K & K^{'} J \\ J^{'} K & J^{'} J + n λ Q \end{matrix})}^{†} (\begin{matrix} K^{'} \\ J^{'} \end{matrix}) y & (14) \end{array}

where (·)^† denotes the Moore-Pensore pseudoinverse [33, 34]. The fitted values can be written as

\begin{array}{l} \hat{η} = K \hat{d} + J \hat{c} = S_{λ} y & (15) \end{array}

where

\begin{array}{l} S_{λ} = (\begin{matrix} K & J \end{matrix}) {(\begin{matrix} K^{'} K & K^{'} J \\ J^{'} K & J^{'} J + n λ Q \end{matrix})}^{†} (\begin{matrix} K^{'} \\ J^{'} \end{matrix}) & (16) \end{array}

is the smoothing matrix, which depends on λ and any θ_k parameters embedded in ρ₁. The smoothing parameters can be obtained by minimizing the Generalized Cross-Validation (GCV) criterion [35]

\begin{array}{l} GCV (λ) = \frac{n^{- 1} ‖ y - S_{λ} y ‖^{2}}{{[1 - tr (S_{λ}) / n]}^{2}} & (17) \end{array}

where tr(S_λ) is often considered the effective degrees of freedom of a smoothing spline.

2.6. Bayesian Interpretation

The smoothing spline solution in Equation (15) can be interpreted as a Bayesian estimate of a Gaussian process [36-40]. Suppose that $η = (η_{0} + η_{1}) \in H = H_{0} \oplus H_{1}$ where $η_{k} \in H_{k}$ for k ∈ {0, 1}. Furthermore, assume that (i) $η_{0} (x) = ϕ^{'} d$ where $ϕ^{'} = (ϕ_{1} (x), \dots, ϕ_{m} (x))$ and $d ~ N (0_{m}, τ^{2} I_{m})$ , (ii) $η_{1} (x) = ρ^{'} c$ where $ρ^{'} = (ρ_{1} (x, {\tilde{x}}_{1}), \dots, ρ_{1} (x, {\tilde{x}}_{R}))$ and $c ~ N (0_{R}, \frac{σ^{2}}{n λ} Q^{†})$ , and (iii) η₀ and η₁ are independent of one another and of the error term ϵ. Applying Bayes's Theorem, the posterior distribution of β = (d′, c′)′ is multivariate normal with mean and covariance

\begin{array}{l} μ_{β | y} = Σ_{β y}^{'} Σ_{y}^{- 1} y \\ Σ_{β | y} = Σ_{β} - Σ_{β y}^{'} Σ_{y}^{- 1} Σ_{β y} & (18) \end{array}

where $Σ_{β} = bdiag (τ^{2} I_{m}, \frac{σ^{2}}{n λ} J Q^{†})$ is a block-diagonal matrix, $Σ_{y β} = (τ^{2} K \frac{σ^{2}}{n λ} J Q^{†})$ , and $Σ_{y} = τ^{2} K K^{'} + \frac{σ^{2}}{n λ} J Q^{†} J^{'} + σ^{2} I_{n}$ . Letting τ² → ∞ corresponds to a diffuse prior on the null space coefficients, and

\begin{array}{l} {\hat{μ}}_{β | y} = \lim_{τ^{2} \to \infty} μ_{β | y} = {(\begin{matrix} K^{'} K & K^{'} J \\ J^{'} K & J^{'} J + n λ Q \end{matrix})}^{†} (\begin{matrix} K^{'} \\ J^{'} \end{matrix}) y & (19) \end{array}

makes the posterior mean equivalent to the smoothing spline coefficient estimates from Equation (14). The corresponding Bayesian covariance matrix estimator is

\begin{array}{l} {\hat{Σ}}_{β | y} = \lim_{τ^{2} \to \infty} Σ_{β | y} = σ^{2} {(\begin{matrix} K^{'} K & K^{'} J \\ J^{'} K & J^{'} J + n λ Q \end{matrix})}^{†} . & (20) \end{array}

Using this Bayesian interpretation, one can form a 100(1 − α)% Bayesian “confidence interval” around the smoothing spline estimate

\begin{array}{l} ψ^{'} {\hat{μ}}_{β | y} \pm Z_{α / 2} \sqrt{ψ^{'} {\hat{Σ}}_{β | y} ψ} & (21) \end{array}

where ψ′ = (ϕ′, ρ′) and Z_α/2 is a critical value from a standard normal distribution. These confidence intervals have a desirable across-the-function coverage property, such that the intervals are expected to contain about 100(1 − α)% of the function realizations, assuming that the GCV has been used to select the smoothing parameters [37, 38, 40].

3. Ordinal Smoothing Spline

3.1. Formulation

Real-valued functions on $X = {1, \dots, K}$ correspond to real-valued vectors of length K, where the function evaluation is equivalent to coordinate extraction, i.e., η(x) for x ∈ {1, …, K} can be viewed as extracting the x-th element of the vector $η = {(η_{1}, \dots, η_{K})}^{'}$ . Thus, equipped with an inner product, the space of functions on $X = {1, \dots, K}$ is a RKHS, denoted by $H = ℝ^{K}$ , which can be decomposed into the tensor summation of two orthogonal subspaces: $H = H_{0} \oplus H_{1}$ . The null space $H_{0} = {η : η (1) = \dots = η (K)}$ consists of all constant functions (vectors of length K), and the contrast space $H_{1} = {η : \sum_{x = 1}^{K} η (x) = 0}$ consists of all functions (vectors of length K) that sum to zero across the K elements.

For ordinal predictors $x \in X = {1, \dots, K}$ with 1 < ⋯ < K, we can use the penalty functional

\begin{array}{l} J (η) = \sum_{x = 2}^{K} {[η (x) - η (x - 1)]}^{2} & (22) \end{array}

which penalizes differences between adjacent values [13, 15]. Letting η = [η(1), …, η(K)]′, we can write the penalty functional as

\begin{array}{l} J (η) = ‖ D η ‖^{2} & (23) \end{array}

where D = {d_ij}_{K−1 × K} is a first order difference operator matrix with elements defined as

\begin{array}{l} d_{i j} = {\begin{array}{l} 1 & if j = i + 1 \\ - 1 & if j = i \\ 0 & otherwise . \end{array} & (24) \end{array}

As an example, with K = 5 the matrix D has the form

\begin{array}{l} D = (\begin{matrix} - 1 & 1 & 0 & 0 & 0 \\ 0 & - 1 & 1 & 0 & 0 \\ 0 & 0 & - 1 & 1 & 0 \\ 0 & 0 & 0 & - 1 & 1 \end{matrix}) . & (25) \end{array}

The penalty functional in Equation (22) corresponds to the inner-product

\begin{array}{l} 〈 η, ξ 〉 = K \bar{η} \bar{ξ} + \sum_{x = 2}^{K} [η (x) - η (x - 1)] [ξ (x) - ξ (x - 1)] \\ = η^{'} (\frac{1}{K} 1_{K} 1_{K^{'}}) ξ + η^{'} D^{'} D ξ & (26) \end{array}

where $\bar{η} = (1 / K) \sum_{x = 1}^{K} η (x)$ , $\bar{ξ} = (1 / K) \sum_{x = 1}^{K} ξ (x)$ , ${〈 η, ξ 〉}_{0} = η (\frac{1}{K} 1_{K} 1_{K}^{'}) ξ$ is the inner product of the null space, and ${〈 η, ξ 〉}_{1} = η^{'} D^{'} D ξ$ is the inner product of the contrast space. Note that $(\frac{1}{K} 1_{K} 1_{K}^{'}) D^{'} D = 0_{K \times K}$ by definition. Furthermore, note that the ordinal smoothing spline estimates η within the same RKHS as the nominal smoothing spline (see Table 1), however the ordinal smoothing spline uses a different definition of the inner-product for the contrast space, which induces a different reproducing kernel.

Theorem 3.1. Given the ordinal smoothing spline inner-product in Equation (26), the null space $H_{0} = {η : η (1) = \dots = η (K)}$ has reproducing kernel ρ₀(x, y) = 1/K, and the contrast space $H_{1} = {η : \sum_{x = 1}^{K} η (x) = 0}$ has reproducing kernel

\begin{array}{l} ρ_{1} (x, y) = \sum_{k = 1}^{K - 1} (1_{{x \leq k}} - \frac{k}{K}) (1_{{y \leq k}} - \frac{k}{K}) \\ = 1 - x \lor y + \frac{1}{2 K} [x (x - 1) + y (y - 1)] + \tilde{K} \end{array}

for x, y ∈ {1, …, K} where 1_{·} is an indicator function, x ∨ y = max(x, y), and $\tilde{K} = \frac{(K - 1) (2 K - 1)}{6 K}$ .

See the Supplementary Material for a proof. Note that Theorem 3.1 provides the reproducing kernel needed to numerically evaluate the optimal η_λ from Equation (6) for the ordinal smoothing spline. Considering the model in Equation (4) with $X = {1, \dots, K}$ where 1 < ⋯ < K are elements of an ordered set, the penalized least squares problem has the form

\begin{array}{l} n^{- 1} ‖ y - d 1_{n} - G Q c ‖^{2} + λ c^{'} Q c & (27) \end{array}

where $y = {(y_{1}, \dots, y_{n})}^{'}$ is the response vector, d is the unknown intercept, G = {g_ij}_n×K is a selection matrix such that g_ij = 1 if x_i = j for j ∈ {1, …, K} and g_ij = 0 otherwise, $Q = {ρ_{1} ({\tilde{x}}_{i}, {\tilde{x}}_{j})}_{K \times K}$ with ${\tilde{x}}_{i} = i$ and ${\tilde{x}}_{j} = j$ for i, j = 1, …, K, and $c = {(c_{1}, \dots, c_{K})}^{'}$ are the unknown function coefficients. Given λ, the optimal coefficients and smoothing matrix can be obtained using the result in Equations (14) and (16) with K = 1_n and J = GQ. When applying the Bayesian interpretation in Section 2.6, note that the pseudoinverse of Q has Toeplitz form for the internal rows and columns with deviations from Toeplitz form in cells (1,1) and (K, K). As an example, with K = 5 the pseudoinverse of Q is given by

\begin{array}{l} Q^{†} = D^{'} D = (\begin{matrix} 1 & - 1 & 0 & 0 & 0 \\ - 1 & 2 & - 1 & 0 & 0 \\ 0 & - 1 & 2 & - 1 & 0 \\ 0 & 0 & - 1 & 2 & - 1 \\ 0 & 0 & 0 & - 1 & 1 \end{matrix}) & (28) \end{array}

where c ~ N(0, [σ²/(nλ)]Q^†) is the assumed prior distribution for the contrast space coefficients.

3.2. Relation to Isotonic Regression

Using the reproducing kernel definition in the first line of Theorem 3.1, $Q = {ρ_{1} (x, y)}_{K \times K} = M M^{'}$ where the elements of the matrix M = {_{m_ij}K×K−1} have the form

\begin{array}{l} m_{i j} = {\begin{matrix} (j - K) / K & if i \leq j \\ j / K & if i > j \end{matrix} & (29) \end{array}

As an example, with K = 5 the matrix M has the form

\begin{array}{l} M = \frac{1}{5} (\begin{matrix} - 4 & - 3 & - 2 & - 1 \\ 1 & - 3 & - 2 & - 1 \\ 1 & 2 & - 2 & - 1 \\ 1 & 2 & 3 & - 1 \\ 1 & 2 & 3 & 4 \end{matrix}) . & (30) \end{array}

Note that the columns of M sum to zero, which implies that Q₀Q = 0_K×K, where $Q_{0} = (\frac{1}{K} 1_{K} 1_{K}^{'})$ is the reproducing kernel matrix for the null space. To visualize the ordinal smoothing spline reproducing kernel function, Figure 1 plots the columns of M and the columns of the corresponding reproducing kernel matrix Q for the situation with K = 5. Note that by definition (i) the K − 1 columns of M are (linearly independent) monotonic increasing functions of x ∈ {1, …, K} and (ii) the K columns of Q are linearly dependent given that Q has rank K − 1.

FIGURE 1

Figure 1. Visualization of the ordinal smoothing spline reproducing kernel function with K = 5 levels of X. Left plot shows the columns of the factorized kernel matrix M, whereas right plot shows the columns of reproducing kernel matrix Q = MM′.

The factorization of the reproducing kernel matrix Q = MM′ implies that the ordinal smoothing spline can be reformulated to form an isotonic regression estimator. Specifically, we could reparameterize the ordinal smoothing spline problem as

\begin{array}{l} n^{- 1} ‖ y - d 1_{n} - G M c_{m} ‖^{2} + λ ‖ c_{m} ‖^{2} & (31) \end{array}

where $c_{m} = M^{'} c$ and Q = MM′ by definition. Note that columns of M are monotonic increasing such that the k-th column contains two unique values with the “jump” occurring between rows k and k+1 for k ∈ {1, …, K−1}, see Figure 1. Thus, by constraining the elements of c_m to be non-negative, we can constrain the ordinal smoothing spline to be a monotonic increasing function across the values 1 < ⋯ < K. Specifically, for a fixed λ we seek the solution to the problem

\begin{array}{l} \min_{β} \frac{1}{2} β^{'} (X^{'} X + n λ I_{K}^{*}) β - y^{'} X β subject to A β \geq 0_{K} & (32) \end{array}

where $β = {(d, c_{m}^{'})}^{'}$ is the coefficient vector, X = [1_n, GM] is the design matrix, $I_{K}^{*}$ is a K×K identity matrix with cell (1,1) equal to zero, and A = (0_K−1, I_K−1) is the (K −1) × K constraint matrix.

Theorem 3.2. The coefficient vector β minimizing the inequality constrained optimization problem in Equation (32) has the form

\begin{array}{l} {\hat{β}}^{*} = (I_{K} + C^{- 1} A_{2}^{'} B_{2} A) \hat{β} \end{array}

where $C = X^{'} X + n λ I_{K}^{*}$ , $\hat{β} = C^{- 1} X^{'} y$ is the unconstrained ordinal smoothing spline solution, and the A₂ and B₂ matrices depend on the active constraints.

See the Supplementary Material for the proof. Note that Theorem 3.2 reveals that the ordinal smoothing spline isotonic regression estimator is a linear transformation of the unconstrained estimator. Furthermore, Theorem 3.2 implies that $\hat{η} = S_{λ}^{*} y$ where the smoothing matrix has the form $S_{λ}^{*} = X (I_{K} + C^{- 1} A_{2}^{'} B_{2} A) C^{- 1} X^{'}$ . Thus, when the non-negativity constraints are active, $t r (S_{λ}^{*})$ can be used as an estimate of the effective degrees of freedom for smoothing parameter selection.

3.3. Reproducing Kernel as K → ∞

We now provide a theorem on the behavior of the ordinal smoothing spline reproducing kernel as the number of levels K of the ordered set $X = {1, \dots, K}$ approaches infinity.

Theorem 3.3. As K → ∞, we have $ρ_{1} (x, y) / (K - 1) \to {\tilde{ρ}}_{1} (\tilde{x}, ỹ)$ where

\begin{array}{l} ρ_{1} (x, y) = 1 - x \lor y + \frac{1}{2 K} [x (x - 1) + y (y - 1)] + \frac{(K - 1) (2 K - 1)}{6 K} \end{array}

is the ordinal smoothing spline reproducing kernel function for $x, y \in X = {1, \dots, K}$ with x ∨ y = max(x, y) denoting the maximum of x and y, and

\begin{array}{l} {\tilde{ρ}}_{1} (\tilde{x}, \tilde{y}) = (\tilde{x} - 1 / 2) (\tilde{y} - 1 / 2) + (1 / 2) {{(| \tilde{x} - \tilde{y} | - 1 / 2)}^{2} - 1 / 12} \end{array}

is the linear smoothing spline reproducing kernel function for $\tilde{x}, ỹ \in \tilde{X} = [0, 1]$ with $\tilde{x} = (x - 1) / (K - 1)$ and ỹ = (y − 1)/(K − 1) by definition.

See the Supplementary Material for a proof. Note that Theorem 3.3 implies that ordinal and linear smoothing splines will perform similarly as the number of levels of the ordered set K increases. In Figure 2, we plot the reproducing kernel function for the (standardized) ordinal and linear smoothing spline (with K = 5), along with the Frobenius norm between the (standardized) ordinal and linear smoothing spline reproducing kernel matrices for different values of K. This figure reveals that the two kernel functions are practically identical for K ≥ 20, and can look quite similar even for small values of K (e.g., K = 5).

FIGURE 2

Figure 2. (Top) Visualization of the (standardized) ordinal and linear smoothing spline reproducing kernel functions with K = 5 levels of X. (Bottom) Frobenius norm between the (standardized) ordinal and linear smoothing spline reproducing kernel matrices Q as a function of K.

3.4. Approximation for Large K

If the number of elements of the ordered set K is quite large, then using all K levels as knots would be computationally costly. In such cases, the unconstrained ordinal smoothing spline solution can be fit via the formulas in Equations (14) and (16) with a set of knots ${{\tilde{x}}_{j}}_{j = 1}^{R} \subset {1, \dots, K}$ . When monotonicity constraints are needed, the ordinal smoothing spline would still be computationally costly for large K. This is because the factorization of the reproducing kernel into the outer product of monotonic functions, i.e., Q = MM′, depends on K. For a scalable approximation to the ordinal smoothing spline isotonic regression estimator, the penalty functional itself can be approximated such as

\begin{array}{l} J (η) = \sum_{j = 2}^{R} {[η ({\tilde{x}}_{j}) - η ({\tilde{x}}_{j - 1})]}^{2} & (33) \end{array}

where the knots $1 = {\tilde{x}}_{1} < \dots < {\tilde{x}}_{R} = K$ are assumed to be ordered and unique. The modified RKHS is

H = {η : η (x) = η ({\tilde{x}}_{j}) if x \in {{\tilde{x}}_{j - 1} + 1, \dots, {\tilde{x}}_{j}}} \subset ℝ^{K} (34)

for x ∈ {1, …, K} and j ∈ {2, …, R}. The modified RKHS $H$ has the modified inner product

\begin{matrix} 〈 η, ξ 〉 = η^{'} (\frac{1}{R} {\tilde{1}}_{K} \tilde{1}'_{K}) ξ + η^{'} \tilde{D}' \tilde{D} ξ & (35) \end{matrix}

where ${\tilde{1}}_{K}$ is a K × 1 vector that contains ones in the positions corresponding to the R knots and zeros elsewhere, and $\tilde{D}$ is a sparse R−1×K first order difference operator matrix with entries defined as

{\tilde{d}}_{i j} = {\begin{array}{l} 1 & if j = {\tilde{x}}_{i + 1} \\ - 1 & if j = {\tilde{x}}_{i} \\ 0 & otherwise . \end{array} (36)

As an example, with K = 5 and R = 3 knots $({\tilde{x}}_{1}, {\tilde{x}}_{2}, {\tilde{x}}_{3}) = (1, 3, 5)$ , the matrix $\tilde{D}$ has the form

\tilde{D} = (\begin{matrix} - 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & - 1 & 0 & 1 \end{matrix}) . (37)

Theorem 3.4. Given the ordinal smoothing spline inner-product in Equation (35), the null space $H_{0} = {η : η ({\tilde{x}}_{1}) = \dots = η ({\tilde{x}}_{R})}$ has reproducing kernel ρ₀(x, y) = 1/R, and the contrast space $H_{1} = {η : \sum_{j = 1}^{R} η ({\tilde{x}}_{j}) = 0}$ has reproducing kernel

ρ_{1} (x, y) = \sum_{j = 1}^{R - 1} (1_{{x \leq {\tilde{x}}_{j}}} - \frac{j}{R}) (1_{{y \leq {\tilde{x}}_{j}}} - \frac{j}{R})

for x, y ∈ {1, …, K} where 1_{·} is an indicator function and $1 = {\tilde{x}}_{1} < \dots < {\tilde{x}}_{R} = K$ are the knots.

See the Supplementary Material for a proof. Note that Theorem 3.4 provides the reproducing kernel needed to compute the ordinal smoothing spline isotonic regression estimator using the knot-based approximation for large K. Specifically, Theorem 3.4 reveals that we can write $\tilde{J} = {ρ_{1} (x_{i}, {\tilde{x}}_{j})}_{n \times R} = \tilde{P} \tilde{M}'$ and $\tilde{Q} = {ρ_{1} ({\tilde{x}}_{i}, {\tilde{x}}_{j})}_{R \times R} = \tilde{M} \tilde{M}'$ where $\tilde{P} = {{\tilde{p}}_{i j}}_{n \times R - 1}$ with ${\tilde{p}}_{i j} = (j / R) - 1_{{x_{i} \leq {\tilde{x}}_{j}}}$ for i = 1, …, n and j = 1, …, R − 1, and $\tilde{M} = {{\tilde{m}}_{i j}}_{R \times R - 1}$ with ${\tilde{m}}_{i j} = (j / R) - 1_{{{\tilde{x}}_{i} \leq {\tilde{x}}_{j}}}$ for i = 1, …, R and j = 1, …, R − 1. Note that $\tilde{P} = \tilde{G} \tilde{M}$ where $\tilde{G} = {{\tilde{g}}_{i j}}_{n \times R}$ is a selection matrix such that ${\tilde{g}}_{i j} = 1$ if x_i = j = 1 or $x_{i} \in ({\tilde{x}}_{j - 1}, {\tilde{x}}_{j}]$ for j = 2, …, R, and ${\tilde{g}}_{i j} = 0$ otherwise. Using the modified reproducing kernel, the reparameterized ordinal smoothing spline problem is

n^{- 1} ‖ y - d 1_{n} - \tilde{G} \tilde{M} {\tilde{c}}_{m} ‖^{2} + λ ‖ {\tilde{c}}_{m} ‖^{2} (38)

where ${\tilde{c}}_{m} = \tilde{M}' c$ , so analogs of the results in Section 3.2 can be applied.

4. Simulation Study

To investigate the performance of the ordinal smoothing spline, we designed a simulation study that manipulated two conditions: (i) the sample size (4 levels: n ∈ {50, 100, 200, 500}), and (ii) the data generating function η (2 forms, see Figure 3). To simulate the data, we defined x_i to be an equidistant sequence of n points spanning the interval [0,1], and then defined y_i = η(x_i) + ϵ_i where the error terms were (independently) randomly sampled from a standard normal distribution. Note that each x_i is unique, so K = n in this case. We compared four different methods as a part of the simulation: (a) linear smoothing spline, (b) ordinal smoothing spline, (c) monotonicity constrained ordinal smoothing spline, and (d) isotonic regression implemented through the isoreg function in R [6]. Methods (a)–(c) were fit using the bigsplines package [41] in R. For the smoothing spline methods, we used the same sequence of 20 points as knots to ensure that differences in the results are not confounded by differences in the knot locations. For each of the 8 (4 n × 2 η) cells of the simulation design, we generated 100 independent samples of data and fit each of the four models to each generated sampled.

FIGURE 3

Figure 3. The data generating functions from the simulation study. In each case, the function is defined as $η (x) = \frac{1}{K} \sum_{k = 1}^{K - 1} sign (\sqrt{x} - x_{k}^{*})$ where the ${x_{k}^{*}}_{k = 1}^{K - 1}$ are a sequence of equidistant knots spanning the interval [0,1].

To evaluate the performance of each method, we calculated the root mean squared error (RMSE) between the truth η and the estimate $\hat{η}$ . The RMSE for each method is displayed in Table 2 and Figure 4, which clearly reveal the benefit of the ordinal smoothing spline. Specifically, Figure 4 reveals that the unconstrained ordinal smoothing spline outperforms the linear smoothing spline for smaller values of n = K, and performs similarly to linear smoothing spline for large values of K. Figure 4 also reveals that the monotonicity constrained estimator (using the knot-approximated reproducing kernel function described in Section 3.4) performed slightly better than the unconstrained ordinal smoothing spline, particularly in the n = 50 condition. Furthermore, the simulation results in Figure 4 demonstrate that the ordinal smoothing spline systematically outperforms R's (default) isotonic regression routine. Thus, the ordinal smoothing spline offers an effective alternative to classic nonparametric and isotonic regression approaches, and—unlike the linear smoothing spline—the ordinal smoothing spline has the benefit of being invariant to any monotonic transformation of x.

TABLE 2

Table 2. Median root mean squared error (RMSE) across 100 simulation replications.

FIGURE 4

Figure 4. Box plots of the root mean squared error (RMSE) between η and $\hat{η}$ for each method.

5. Examples

5.1. Example 1: Income by Education and Sex

To demonstrate the power of the monotonic ordinal smoothing spline, we use open source data to examine the relationship between income and educational attainment. The data were collected as a part of the 2014 US Census and are freely available from the Integrated Public Use Microdata Series (IPUMS) USA website (https://usa.ipums.org/usa/), which is managed by the Minnesota Population Center at the University of Minnesota. Note that the original data file usa_00001.dat contains data from n* = 3, 132, 610 subjects. However, we restrict our analyses to individuals who were 18+ years old and earned a non-zero income as reported on their 2014 census, resulting in a sample size of n = 2, 214, 350 subjects. We fit the monotonic ordinal smoothing spline separately to the males' data (n = 1, 093, 949) and females' data (n = 1, 120, 401) using all 11 unique education levels as knots. Due to the positive skewness of the income data, we fit the model to y = log(income). Finally, to ensure our results were nationally representative, we fit the model via weighted penalized least squares using the person weights (PERWT) provided with the IPUMS data.

The fit models explain about 15% of the variation in the male and female income data, and reveal noteworthy differences in expected incomes as a function of education and sex. The predicted mean incomes for each educational attainment level and sex are plotted in Figure 5, which has some striking trends. First, note that expected income is low (about $15,000 for males and $10,000 for females) for individuals without a high school education. Completing high school results in an approximate $7,000 expected increase in income for men, but only a $4,500 expected increase in income for women. Similarly, completing 1 year of college results in an expected $1,700 pay increase for men, but only an expected $1,000 increase for women. This disturbing trend continues to magnify at each education level, such that women receive a smaller expected return on their education. At the highest level of educational attainment, the gender pay gap is most pronounced such that women tend to make about $28,000 less per year than men.

FIGURE 5

Figure 5. Monotonic ordinal smoothing spline solution showing the relationship between education and income for males (blue circle) and females (red triangle).

5.2. Example 2: Student Math Performance

As a second example, we use math performance data from Portuguese secondary students. The data were collected by Paulo Cortez at the University of Minho in Portugal [42], and are available from the UCI Machine Learning Repository [43]. In this example, we use the math performance data (student-mat.csv), which contains math exam scores from n = 395 Portuguese students. We focus on predicting the students' scores on the first exam during the period (G1). Unlike Cortez et al., we model the first exam (instead of the final grade) because we hope to identify factors that cause students to fall behind early in the semester. By discovering factors that relate to poor math performance on the first exam, it may be possible to create student-specific interventions (e.g., tutoring or more study time) with hopes of improving the final grade.

In Table 3, we describe the 15 predictor variables that we include in our model. To model the math exam scores, we fit a semiparametric regression model of the form

\begin{array}{l} {math}_{i} = β_{0} + β_{1} {school}_{i} + β_{2} {sex}_{i} + β_{3} {famsup}_{i} + β_{4} {paid}_{i} \\ + β_{5} {activities}_{i} + β_{6} {nursery}_{i} + η_{1} ({age}_{i}) \\ + η_{2} ({failures}_{i}) + η_{3} ({absences}_{i}) + η_{4} ({Medu}_{i}) \\ + η_{5} ({traveltime}_{i}) + η_{6} ({studytime}_{i}) + η_{7} ({goout}_{i}) \\ + η_{8} ({Walc}_{i}) + η_{9} ({health}_{i}) + ϵ_{i} & (39) \end{array}

where math_i is the i-th student's score on the first math exam, β₀ is an unknown regression intercept, (β₁, …, β₆) are unknown regression coefficients corresponding to the six binary predictors (school, sex, famsup, paid, activities, nursery), η_j is the j-th effect function for j = 1, …, 9, and $ϵ_{i} \overset{i i d}{~} N (0, σ^{2})$ is the i-th student's model error term. We used a cubic smoothing spline marginal reproducing kernel for the integer valued variables (age, failures, absences) because these variables are measured on a ratio scale. We used the ordinal smoothing spline reproducing kernel (see Theorem 3.1) for the other six nonparametric effects because these variables were measured on an ordinal scale (see Table 3). The full smoothing spline solution was fit, i.e., all n = 395 data points were used as knots.

TABLE 3

Table 3. Predictor variables for math performance example.

The smoothing parameters were chosen by minimizing the GCV criterion [35]. To avoid the computational expense of simultaneously tuning the GCV criterion with respect to 9 smoothing parameters (one for each η_j), we used a version of Algorithm 3.2 described by Gu and Wahba [25] to obtain initial values of the smoothing parameters. Then we fit the model with the 9 local smoothing parameters fixed at the initialized values, and optimized the GCV criterion with respect to the global smoothing parameter λ. This approach has been shown to produce results that are essentially identical to the fully optimal solution (see [15, 26]).

The fit model explains about 23% of the variation in the students' scores on the first math exam. The estimated regression coefficients $({\hat{β}}_{1}, \dots, {\hat{β}}_{6})$ and effect functions $({\hat{η}}_{1}, \dots, {\hat{η}}_{9})$ are plotted in Figure 6, along with 90% Bayesian confidence intervals [37, 38, 40] around the effect functions. Examining the top row of Figure 6, it is evident that only two of the six parametric effects has a significant effect: sex and famsup. The signs and magnitudes of these significant coefficients reveal that (i) males tend to get higher math exam scores than females, and (ii) students who receive extra educational support from their families tend to get lower math exam scores. This second point may seem counter-intuitive, because one may think that extra educational support should lead to higher grades. However, it is likely that the students who receive extra support are receiving this extra support for a reason.

FIGURE 6

Figure 6. Estimated effects of the different predictor variables on expected math scores with 90% Bayesian confidence intervals (dashed lines) around the predictions.

Examination of the remaining subplots in Figure 6 reveals that the number of prior course failures has the largest (negative) effect on the expected math scores, which is not surprising. There is a slight trend such that larger ages lead to better grades, but the trend is not significantly different from zero according to the 90% Bayesian confidence intervals. Given the other effects in the model, the number of absences has no effect on the expected math scores, which is surprising. Having a mother who completed higher education increases a student's expected math exam score, whereas there were no significant differences between the other four lesser levels of the mother's education. Studying for 5 or more hours per week increases a student's expected scores, whereas studying less than 2 h per week decreases a student's expected scores. Going out with friends and/or drinking on the weekend with high or very high frequency significantly decreases a student's expected math exam scores. Given the other effects, travel time to school and a student's health did not have significant effects on the math exam scores.

6. Discussion

6.1. Summary of Findings

Our simulation and real data examples reveal the flexibility and practical potential of the ordinal smoothing spline. The simulation study investigated the potential of the ordinal smoothing spline for isotonic regression, as well as the relationship between the ordinal and linear smoothing spline reproducing kernel functions. The simulation results demonstrated that (i) the ordinal smoothing spline can outperform the linear smoothing spline at small samples, (ii) the ordinal smoothing spline performs similar to the linear smoothing spline for large K, and (iii) monotonic ordinal smoothing splines can outperform standard isotonic regression approaches. Thus, the simulation study illustrates the results in Theorems 3.1–3.3, and also reveals that the knot-approximated reproducing kernel proposed in Theorem 3.4 offers an effective approximation to the monotonic ordinal smoothing spline solution.

The first example (income by education and sex) offers a practical example of the potential of the ordinal smoothing spline for discovering monotonic trends in data. Unlike classic approaches to isotonic regression, the ordinal smoothing spline uses a reproducing kernel (and knot-based) approach, making it easily scalable to large survey samples such as the US Census data. Using a large and nationally representative sample of US citizens ages 18+, this example reveals a clear gender pay gap, such that women receive less return on their educational investments, i.e., less pay for the same level of education. And note that the predictor could be monotonically transformed without changing the ordinal smoothing spline solution, which is one of the primary benefits of using the ordinal smoothing spline estimator for variables such as Educational Attainment—which has no clear unit of measurement.

The second example (math grades) demonstrates the effectiveness of using ordinal smoothing splines to include multiple ordinal predictors in a regression model along with other nominal and metric (continuous) predictors. Furthermore, the second example illustrates that many ordinal relationships do not have a clear linear in/decrease across the K levels of the ordinal variable. This reveals that it could be difficult and/or misleading to code ordinal predictors as if they were continuous (e.g., linear) effects. So, despite this common practice in the social sciences, our results make it clear that this sort of approach does not have an obvious solution for all ordinal variables, and could be particularly problematic when multiple ordinal variables are included in the same model.

6.2. Concluding Remarks

This paper discusses the ordinal smoothing spline, which can be used to model functional relationships between ordered categorical predictor variables and any exponential family response variable. This approach makes it straightforward to incorporate one (or more) ordinal predictors in a regression model, and also provides a theoretically sound method for examining interaction effects between ordinal and other predictors. In this paper, we present the ordinal smoothing spline reproducing kernel function (Theorem 3.1), which has the benefit of being invariant to any monotonic transformation of the predictor. We also discuss how the ordinal smoothing spline estimator can be adjusted to impose monotonicity constraints (Theorem 3.2). Furthermore, we reveal an interesting asymptotic relationship between ordinal and linear smoothing splines (Theorem 3.3), and develop large sample approximations for ordinal smoothing splines (Theorem 3.4). Finally, we have demonstrated the practical potential of ordinal smoothing splines using a simulation study (Section 4) and real data examples from two distinct disciplines (Section 5).

In nearly all applications of the GLM or GzLM, ordered categorical predictors are treated as either unordered categorical predictors or metric predictors for modeling purposes. In most cases, it is not obvious which approach one should choose. Neither approach is ideal for ordinal data, so one must ultimately decide which aspect of the data should be misrepresented in the model. Treating ordinal data as unordered (by definition) ignores the ordinal nature of the data, which is undesirable. Whereas, treating ordinal data as metric (continuous) ignores the discrete nature of the data, which is undesirable. In contrast, the ordinal smoothing spline is the obvious solution to this problem, because the ordinal smoothing spline (i) respects the ordinal nature of the predictor, and (ii) provides a theoretically sound framework for incorporating ordinal predictors in regression models.

Author Contributions

The sole author (NH) contributed to all aspects of this work.

Funding

The sole author (NH) was funded by start-up funds from the University of Minnesota.

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The author acknowledges the two data sources: (i) the Minnesota Population Center at the University of Minnesota, and (ii) the Machine Learning Repository at University of California Irvine.

Supplementary Material

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fams.2017.00015/full#supplementary-material

References

1. Christensen R. Plane Answers to Complex Questions: The Theory of Linear Models. 3rd Edn. New York, NY: Springer-Verlag (2002).

Google Scholar

2. Chartier S, Faulkner A. General linear models: an integrated approach to statistics. Tutorial Quant Methods Psychol. (2008) 4:65–78. doi: 10.20982/tqmp.04.2.p065

CrossRef Full Text | Google Scholar

3. Graham JM. The general linear model as structural equation modeling. J Educ Behav Stat. (2008) 33:485–506. doi: 10.3102/1076998607306151

CrossRef Full Text | Google Scholar

4. Casals M, Girabent-Farrés M, Carrasco JL. Methodological quality and reporting of generalized linear mixed models in clinical medicine (2000–2012): a systematic review. PLoS ONE (2014) 9:e112653. doi: 10.1371/journal.pone.0112653

PubMed Abstract | CrossRef Full Text | Google Scholar

5. de Jong P, Heller GZ. Generalized Linear Models for Insurance Data. Cambridge: Cambridge University Press (2008).

Google Scholar

6. R Core Team. R: A Language Environment for Statistical Computing. Vienna (2017). Available online at: http://www.R-project.org/

7. McCullagh P, Nelder JA. Generalized Linear Models. 2nd Edn. London: Chapman and Hall (1989).

8. McCullagh P. Regression model for ordinal data (with discussion). J R Stat Soc Ser B (1980) 42:109–42.

Google Scholar

9. Armstrong B, Sloan M. Ordinal regression models for epidemiologic data. Am J Epidemiol. (1989) 129:191–204. doi: 10.1093/oxfordjournals.aje.a115109

PubMed Abstract | CrossRef Full Text | Google Scholar

10. Peterson B, Harrell FE Jr. Partial proportional odds models for ordinal response variables. J R Stat Soc Ser C (1990) 39:205–17. doi: 10.2307/2347760

CrossRef Full Text | Google Scholar

11. Cox C. Location-scale cumulative odds models for ordinal data: a generalized non-linear model approach. Stat Med. (1995) 14:1191–203. doi: 10.1002/sim.4780141105

PubMed Abstract | CrossRef Full Text | Google Scholar

12. Liu I, Agresti A. The analysis of ordinal categorical data: an overview and a survey of recent developments. Test (2005) 14:1–73. doi: 10.1007/BF02595397

CrossRef Full Text | Google Scholar

13. Gertheiss J, Tutz G. Penalized regression with ordinal predictors. Int Stat Rev. (2009) 77:345–65. doi: 10.1111/j.1751-5823.2009.00088.x

CrossRef Full Text | Google Scholar

14. Gertheiss J. ordPens: Selection and/or Smoothing of Ordinal Predictors. R package version 0.3-1 (2015). Available online at: https://CRAN.R-project.org/package=ordPens

15. Gu C. Smoothing Spline ANOVA Models. 2nd Edn. New York, NY: Springer-Verlag (2013). doi: 10.1007/978-1-4614-5369-7

CrossRef Full Text | Google Scholar

16. Wahba G. Spline Models for Observational Data. Philadelphia, PA: Society for Industrial and Applied Mathematics (1990).

Google Scholar

17. Riesz F. Sur une espèce de géométrie analytique des systèmes de fonctions sommables. C R Acad Sci Paris (1907) 144:1409–11.

18. Riesz F. Sur les opérations fonctionnelles linéaires. C R Acad Sci Paris (1909) 149:974–7.

19. Aronszajn N. Theory of reproducing kernels. Trans Am Math Soc. (1950) 68:337–404. doi: 10.1090/S0002-9947-1950-0051437-7

CrossRef Full Text | Google Scholar

20. Ruppert D, Wand MP, Carroll RJ. Semiparametric Regression. Cambridge: Cambridge University Press (2003).

PubMed Abstract | Google Scholar

21. Hastie T, Tibshirani R. Generalized Additive Models. New York, NY: Chapman and Hall/CRC (1990).

PubMed Abstract | Google Scholar

22. Ramsay JO, Silverman BW. Functional Data Analysis. 2nd Edn. New York, NY: Springer (2005).

23. Wood SN. Generalized Additive Models: An Introduction with R. Boca Raton, FL: Chapman & Hall (2006).

Google Scholar

24. Kimeldorf G, Wahba G. Some results on Tchebycheffian spline functions. J Math Anal Appl. (1971) 33:82–95. doi: 10.1016/0022-247X(71)90184-3

CrossRef Full Text | Google Scholar

25. Gu C, Wahba G. Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method. SIAM J Sci Stat Comput. (1991) 12:383–98. doi: 10.1137/0912021

CrossRef Full Text | Google Scholar

26. Helwig NE, Ma P. Fast and stable multiple smoothing parameter selection in smoothing spline analysis of variance models with large samples. J Comput Graph Stat. (2015) 24:715–32. doi: 10.1080/10618600.2014.926819

CrossRef Full Text | Google Scholar

27. Duchon J. Splines minimizing rotation-invariant semi-norms in Sobolev spaces. In Schempp W, Zeller K editors. Constructive Theory of Functions of Several Variables. Lecture Notes in Mathematics, Vol. 571. Berlin; Heidelberg: Springer (1977). p. 85–100.

Google Scholar

28. Meinguet J. Multivariate interpolation at arbitrary points Made simple. J Appl Math Phys. (1979) 30:292–304. doi: 10.1007/BF01601941

CrossRef Full Text | Google Scholar

29. Wood SN. Thin plate regression splines. J R Stat Soc Ser B (2003) 65:95–114. doi: 10.1111/1467-9868.00374

CrossRef Full Text | Google Scholar

30. Gu C, Kim YJ. Penalized likelihood regression: general formulation and efficient approximation. Can J Stat. (2002) 30:619–28. doi: 10.2307/3316100

CrossRef Full Text | Google Scholar

31. Kim YJ, Gu C. Smoothing spline Gaussian regression: more scalable computation via efficient approximation. J R Stat Soc Ser B (2004) 66:337–56. doi: 10.1046/j.1369-7412.2003.05316.x

CrossRef Full Text | Google Scholar

32. Helwig NE, Ma P. Smoothing spline ANOVA for super-large samples: scalable computation via rounding parameters. Stat Its Interface (2016) 9:433–44. doi: 10.4310/SII.2016.v9.n4.a3

CrossRef Full Text | Google Scholar

33. Moore EH. On the reciprocal of the general algebraic matrix. Bull Am Math Soc. (1920) 26:394–5.

34. Penrose R. A generalized inverse for matrices. Math Proc Cambridge Philos Soc. (1950) 51:406–13. doi: 10.1017/S0305004100030401

CrossRef Full Text | Google Scholar

35. Craven P, Wahba G. Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik (1979) 31:377–403. doi: 10.1007/BF01404567

CrossRef Full Text | Google Scholar

36. Kimeldorf G, Wahba G. A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. Ann Math Stat. (1970) 41:495–502. doi: 10.1214/aoms/1177697089

CrossRef Full Text | Google Scholar

37. Wahba G. Bayesian “confidence intervals” for the cross-validated smoothing spline. J R Stat Soc Ser B (1983) 45:133–50.

Google Scholar

38. Nychka D. Bayesian confidence intervals for smoothing splines. J Am Stat Assoc. (1988) 83:1134–43. doi: 10.1080/01621459.1988.10478711

CrossRef Full Text | Google Scholar

39. Gu C. Penalized likelihood regression: a Bayesian analysis. Stat Sin. (1992) 2:255–64.

Google Scholar

40. Gu C, Wahba G. Smoothing spline ANOVA with component-wise Bayesian “confidence intervals”. J Comput Graph Stat. (1993) 2:97–117.

Google Scholar

41. Helwig NE. Bigsplines: Smoothing Splines for Large Samples. R package version 1.1-0 (2017). Available online at: http://CRAN.R-project.org/package=bigsplines

42. Cortez P, Silva A. Using data mining to predict secondary school student performance. In Brito A, Teixeira J editors. Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008). Porto: EUROSIS (2008). p. 5–12.

Google Scholar

43. Lichman M. UCI Machine Learning Repository. (2013). Available online at: http://archive.ics.uci.edu/ml

44. Liew CK. Inequality constrained least-squares esitmation. J Am Stat Assoc. (1976) 71:746–51.

Google Scholar

Keywords: isotonic regression, monotonic regression, nonparametric regression, ordinal data, smoothing spline, step function

Citation: Helwig NE (2017) Regression with Ordered Predictors via Ordinal Smoothing Splines. Front. Appl. Math. Stat. 3:15. doi: 10.3389/fams.2017.00015

Received: 02 May 2017; Accepted: 12 July 2017;
Published: 28 July 2017.

Edited by:

Sou Cheng Choi, Illinois Institute of Technology, United States

Reviewed by:

Katerina Hlavackova-Schindler, University of Vienna, Austria
Junhui Wang, City University of Hong Kong, Hong Kong

Copyright © 2017 Helwig. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Nathaniel E. Helwig, aGVsd2lnQHVtbi5lZHU=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.