Review and Synthesis of Estimation Strategies to Meet Small Area Needs in Forest Inventory

Dettmann, Garret T.; Radtke, Philip J.; Coulston, John W.; Green, P. Corey; Wilson, Barry T.; Moisen, Gretchen G.

doi:10.3389/ffgc.2022.813569

REVIEW article

Front. For. Glob. Change , 16 March 2022

Sec. Forest Management

Volume 5 - 2022 | https://doi.org/10.3389/ffgc.2022.813569

This article is part of the Research Topic Small Area Estimation in Forest Inventories: New Needs, Methods, and Tools View all 15 articles

Review and Synthesis of Estimation Strategies to Meet Small Area Needs in Forest Inventory

$\r\nGarret T. Dettmann$ Garret T. Dettmann¹

¹Department of Forest Resources and Environmental Conservation, Virginia Tech, Blacksburg, VA, United States
²U.S. Forest Service, Blacksburg, VA, United States
³U.S. Forest Service, St. Paul, MN, United States
⁴U.S. Forest Service, Ogden, UT, United States

Small area estimation is a growing area of research for making inferences over geographic, demographic, or temporal domains smaller than those in which a particular survey data set was originally intended to be used. We aimed to review a body of literature to summarize the breadth and depth of small area estimation and related estimation strategies in forest inventory and management to-date, as well as the current state of terminology, methods, concerns, data sources, research findings, challenges, and opportunities for future work relevant to forestry and forest inventory research. Estimation methodologies explored include direct, indirect, and composite estimation within design-based and model-based inference bases. A variety of estimation methods in forestry have been applied to extensive multi-resource inventory systems like national forest inventories to increase the precision of estimates on small domains or subsets of the overall populations of interest. To avoid instability and large variances associated with small sample sizes when working with small area domains, forest inventory data are often supplemented with information from auxiliary sources, especially from remote sensing platforms and other geospatial, map-based products. Results from many studies show gains in precision compared to direct estimates based only on field inventory data. Gains in precision have been demonstrated in both project-level applications and national forest inventory systems. Potential gains are possible over varying geographic and temporal scales, with the degree of success in reducing variance also dependent on the types of auxiliary information, scale, strength of model relationships, and methodological alternatives, leaving considerable opportunity for future research and growth in small area applications for forest inventory.

1. Introduction

The frequency and sophistication of statistical methods in forest inventory have grown steadily since their earliest adoption by forest researchers, with an overall goal of providing information of sufficiently high quality to inform decision-making (Schumacher, 1945). One ongoing trend involves the use of data collected as a part of broad regional or national forest inventories to produce estimates for areas smaller than the surveys were originally designed to address (Magnussen et al., 2014). This trend reflects a situation described by W. A. Fuller in a plenary presentation delivered to a 1998 workshop on Environmental Monitoring Surveys Over Time hosted by the University of Washington, Seattle (Fuller, 1999):

“The client will always require more than is specified at the design stage. For example, the client will explain that they require estimates only at the regional or national level and then, when data are available, ask for county estimates.”

While the quotation cites a typical circumstance, many situations arise in forest resource assessment where stakeholders recognize that data from multiple sources and scales could be leveraged to give improved estimates for increasingly small subsets of survey populations—whether based on geographic areas, time periods, or demographic subsets of larger populations from which sample data were collected.

One means of addressing this need is through small area estimation (SAE), a set of statistical methods aimed at providing estimates of parameters of interest for population subsets known as small area domains, typically by linking information from auxiliary sources with sample observations gathered over a larger population that encompasses multiple small area domains. A small area in this case can be described as any geographic, temporal, or categorical domain for which an established approach for direct estimation does not provide adequate precision (Rao and Molina, 2015). Usual quantities of interest include finite population parameters such as area totals or means, especially where tolerances for estimator accuracy have been specified as a part of the sample design. Domains in SAE are typically subsets of larger survey populations, such as those found in regional or national economic, health, or agricultural surveys focused on multiple attributes of interest (Schreuder et al., 1993). These surveys may involve decades of repeated sampling and data collection targeting dozens of attributes, or be limited to a single attribute observed at just one point in time.

SAE methods seek to improve the precision of estimates for small area domains of interest (DOI) using data observed from other domains to increase the amount of sample information available, an approach often described as “borrowing strength.” The indirect (i.e., outside-of-domain) data are generally linked to small area DOI through one or more auxiliary variables and a model relationship that holds across multiple domains. In this sense, the domain (d) direct data (y ∈ d) are linked to the indirect data (y ∉ d) via a model relationship involving the auxiliary data (x) and corresponding observations of y. In the example Fuller (1999) described, counties of interest are the small area domains, with the sample data collected in a specific county serving as its source of direct information. Sample observations from surrounding areas—including other counties—are the source of indirect data. In a national forest inventory (NFI) application consistent with Fuller's example both y ∈ d and y ∉ d would be collected on observational units selected by statistical sampling. Auxiliary information to construct a model for y ~ x might come from remote sensing or other geospatial data sets separate from the NFI sample observations, or from other sources including surveys of forest landowners, agencies, or enterprises that keep records of timber harvesting, tree-planting, or other management-related activities. Other arrangements and sources of information are possible, but the overall pattern of direct, indirect, and auxiliary information, and a model y ~ x is present in most SAE applications (Rao, 2008).

Methods now widely associated with SAE largely appeared in technical literature beginning in the 1970s to address needs for increased accuracy when estimating for small area domains within population, social-economic, and public health surveys (Federal Committee on Statistical Methodology, 1993). Since then, SAE has been adopted in applications aimed at estimating incomes (Fay and Herriot, 1979), census groupings, and crop areas (Battese et al., 1988), among others. Methods for SAE have continued to develop over time as new statistical and computational tools have become available, together with widespread availability and cost-effectiveness of modern data sets. Remote sensing technologies such as satellite imagery or aerial laser scanning (ALS) have played a key role in accelerating the application of SAE methodologies to forest inventory settings (Pfeffermann, 2002, 2013; Sugasawa and Kubokawa, 2020; Coulston et al., 2021).

Interest in applying SAE across disciplines has grown over time but most applications in forest inventory began to appear in published work over the past several decades (Burk and Ek, 1982; Anderson and Breidenbach, 2007; Breidenbach et al., 2010; Goerndt et al., 2011). SAE reduces forest inventory estimator errors for small area domains, offering an efficient and cost-effective option for reducing uncertainty compared to increasing sampling intensity by installing additional forest-inventory field plots (Magnussen et al., 2014). SAE techniques have been used to enhance precision of NFI-derived estimates (Breidenbach and Astrup, 2012; Frank et al., 2020), in forest stand inventories (Ver Planck et al., 2018), and from surveys of wood processors or commercial landowner inventories (Green et al., 2020; Coulston et al., 2021), but are not limited to those uses (Affleck and Gregoire, 2015). The ability of SAE to increase estimator precision in small areas where data are otherwise too sparse to satisfy tolerance specifications makes it attractive for applications in forest inventory (Guldin, 2021). For example, the Norwegian NFI has employed national canopy height maps from aerial remote sensing as auxiliary data sources since about 2010 to address needs for better local information in producing municipal forest statistics and forest-management related inventories (Astrup et al., 2019; Breidenbach et al., 2020). SAE has been used with forest inventory data from the U.S. Department of Agriculture Forest Service Forest Inventory and Analysis (FIA) program to generate estimates of forest attributes in small areas such as biofuel supply areas around co-firing power plants (Goerndt et al., 2019). In forests where field plots can be precisely referenced to high-quality geospatial auxiliary data (e.g., ALS), SAE can provide increased precision of estimates for arbitrarily small spatial areas—accounting for spatial correlations in sample data when warranted—even where no direct sample data lie within some DOI (Babcock et al., 2018; Pascual et al., 2018). Current research aims to combine forest biomass data from field plots with canopy-height measurements from the NASA GEDI spaceborne lidar platform and other remotely-sensed auxiliary information in a SAE framework, signaling the first efforts to obtain global scale forest biomass estimates in an inferential framework (Patterson et al., 2019).

While the potential value of SAE for use in forest inventory and monitoring is high, the number of applications reported in technical literature to date is relatively small. A main objective of this work is to provide an overview of published findings of small area applications in forest inventory including relevant considerations for their implementation in other, perhaps novel, settings. An underlying goal is to provide a backdrop of SAE-related concepts and terminology, as the clear and consistent use of statistical language is important to wider adoption of these tools in future work. A further objective is to provide clarification, without a heavy emphasis on mathematical statistics, where ample terminology and notation can pose challenges to those interested in pursuing small area applications, possibly for the first time. We begin in Section 2 with background on relevant statistical paradigms of design- and model-based inference relevant to SAE, including an important extension of design-based inference known as model-assisted estimation. Section 3 presents terminology for direct and indirect estimators along with an overview of composite estimators used in a majority of SAE methods classified as either unit-level or area-level methods. In Section 4, we summarize key findings of published SAE research in forestry, with synthesis and comparison to design-based approaches including model-assisted estimation. Section 4 also includes some discussion of variance estimators in small-area inventory applications along with emerging topics in SAE. We conclude in Section 5 with some take-home findings of the work.

2. Background

2.1. Design-Based Estimators

The design-based framework for inference from statistical sampling is a pillar of many SAE procedures, especially area-level estimators that will be discussed in Section 3.3 below. Sampling is likely familiar to most forest inventory specialists as it provides the basis for establishing statistical properties of estimators in well-designed inventories (Shiver and Borders, 1996; Gregoire and Valentine, 2007; Thompson, 2012). The sample design assigns probabilities to population units (e.g., plots, trees, etc.) in the sampling frame for being selected in a particular random draw from a finite population, with the overall probability of any unit being included in a sample determined from its selection probability in the context of the sampling scheme. While the attributes of each unit—whether sampled or not—are treated as fixed quantities, randomness in design-based methods arises through the process of sampling. Gregoire (1998) explained this detail stating, “in the design-based framework, the population is regarded as fixed whereas the sample is regarded as a realization of a stochastic process.” Design-based methods rely on the randomization distribution of sampling and possible estimates that could be obtained by following the sample design and its implementation in the sampling scheme. A key consequence is the lack of reliance on mathematical assumptions of how elements in the population are distributed, or of model relationships assumed about how two or more variables in the population are related (Sterba, 2009).

The usual goal of design-based sampling and subsequent estimation is to obtain reliable estimates of finite population parameters in an inferential framework. In forest inventories, population totals, means, or proportions for various attributes of interest are typical subjects of estimation. No assumptions about the underlying structure or distributions of population units being surveyed are required for valid inference in the design-based framework. Design-based estimators compatible with their sample designs are design-unbiased, meaning the expected value of the estimator over all possible samples equals the true population parameter being estimated. It follows that design-based estimators are design consistent, such that, “both the design bias and the variance go to zero as the sample size increases” (Skinner and Wakefield, 2017).

The Horvitz-Thompson (H-T) estimator is a design-based estimator widely used in forest inventory and introduced in many texts starting with simple random sampling. Any finite population total for attribute y can be estimated using the H-T estimator

\begin{array}{l} {\hat{τ}}_{y} = \sum_{i ϵ s} \frac{y_{i}}{π_{i}} & (1) \end{array}

where ${\hat{τ}}_{y}$ is the estimated population total, s denotes the set of observed values in the sample, y_i is the observed value of attribute y on the i^th sample unit, and π_i is the probability that y_i is included in s. In this form design-based estimators rely entirely on observed sample data and sample design weights, i.e., the inclusion probabilities in the denominator of Equation (1), to estimate an attribute of interest. Standard errors of an estimate require pairwise joint inclusion probabilities P(y_i ∩ y_j) (i ≠ j), calculated as the product of π_i and π_j when random sampling is from non-overlapping population units. Thus, the inference base for H-T estimator makes use of properties of the sample design such as a sampling scheme that draws samples according to design weights, independence of sample observations, and sampling distribution properties including applicability of Student's t-distribution and the Central Limit Theorem (Sterba, 2009). Sampling methods relying on a design-based framework do so largely for its desirable properties of unbiasedness and asymptotic consistency, often sought in inventory settings. These methods, however, rely on direct data—values of y observed directly by sampling from the population of interest—which can be expensive to obtain. Any need for estimates on small subsets of the population will likely result in a need for increased sampling intensity and additional expense (Fuller, 1999).

2.2. Model-Assisted Estimators

Models can be used within a design-based framework to improve estimates in what are often called model-assisted estimators (Särndal et al., 1992; McConville et al., 2017, 2020). Like the H-T estimator (Equation 1), model-assisted estimators are direct estimators in that they rely on values of y only from population DOI. One example is post-stratification which has formed the basis of forest inventory estimation strategies in the U.S. for many decades (Bechtold and Patterson, 2005). Among the simplest model-assisted estimators are linear combinations of auxiliary variables x – either univariate or multivariate – available for all sampled units and having known population means (μ_x) or totals (τ_x). Using the H-T estimator (1), the model-assisted estimator can be written as in Stahl et al. (2016)

\begin{array}{l} {\hat{τ}}_{y}^{M A} = \sum_{i \in U} \hat{y_{i}} + \sum_{i \in s} \frac{y_{i} - \hat{y_{i}}}{π_{i}} & (2) \end{array}

where ${\hat{τ}}_{y}^{M A}$ is the estimated model-assisted population total for attribute y, U denotes the (universal) set of all population units, and predicted values are obtained from a model, i.e., $\hat{y_{i}} = m (x_{i})$ , oftentimes a linear regression model. The first term on the right-hand side of Equation (2) requires at least that population totals τ_x for all predictors are known, while the second summand is a H-T estimator of sample deviations from model-predicted values, which should be positive if the model underpredicts for y∈s and negative if the model overpredicts.

Särndal et al. (1992) presented an unbiased estimator for the variance of Equation (2) when the coefficients of a linear model m are known constants, an application known as the difference estimator. The difference estimator is design-based because both the estimator ${\hat{τ}}_{y}^{M A}$ and it's estimated variance are unbiased over all possible samples due to the sample design. Thus, although ${\hat{τ}}^{M A}$ is a design-based estimator, the phrase “model-assisted” is used to indicate that a model relationship is involved in the estimation of τ.

The regression estimator, or generalized regression estimator (GREG) is based on the same form as Equation (2) in cases where ŷ is predicted from a regression model. GREG has been applied in forest inventory solutions that include post-stratification, ratio estimators, LASSO, ridge, and elastic-net regressions (Stehman, 2009; McConville et al., 2020). Extensions using non-linear, semiparametric, and non-parametric predictive modeling techniques have also been demonstrated in forest inventory applications (Opsomer et al., 2007; Tipton et al., 2013; Kangas et al., 2016). Although taking the same form as the difference estimator, as distinguished by Särndal et al. (1992), the linear predictor m(x) in GREG consists of a regression model y = xβ fit to paired sample values of x and y. Some authors distinguish between the difference estimator and GREG as either external or internal, respectively, based on the sources of information from which their coefficients are derived (Kangas et al., 2016; Stahl et al., 2016). Still within the realm of direct, model-assisted estimators, the term modified GREG is used to distinguish models where coefficients are derived using data outside the population of interest (Rao and Molina, 2015). For GREG to be approximately design-unbiased, sample inclusion probabilities for x and y are used in a weighted-least-squares fit of the model to sample observations. Inclusion probabilities are also used in an approximate variance formulation for the GREG estimator detailed by Särndal et al. (1992, ch. 6). Confidence intervals can be reliably constructed for these estimators, but are usually used for major domains as their variance can be large or unstable in domains having small sample sizes (Särndal, 1984; Lehtonen and Veijanen, 2009). We note that design-based approaches including H-T and model-assisted estimation are sometimes categorized as SAE (see Figure 2.1 in Rahman and Harding, 2017; Hill et al., 2021); however, they are often used where interest lies in only a single population domain, such as in the methods demonstrated by McConville et al. (2020) for estimating forest attributes in a single county in Utah, USA.

2.3. Model-Based Estimators

A second pillar of many SAE procedures is the model-based framework for inference, which we introduce here as being distinct from a purely design-based framework, including model-assisted estimators described in Section 2.2. In model-based estimation, univariate or multivariate statistical models are formulated to establish the basis for assigning probabilities to observed data, for characterizing probabilistic relationships between variables, or for both. Unlike the fixed-quantity view of population units in design-based estimation, model-based approaches treat observable units in a population as realizations or instances of random variables that underlie the observable population. For this reason these conceptual models of population variables and their statistical distributions are sometimes referred to as “superpopulation models” but are just as often simply called models (Gregoire, 1998; Skinner and Wakefield, 2017).

Model-based estimators can be useful when random sampling is impractical, or when assigning inclusion probabilities to sample observations requires some assumption about the statistical or spatial distributions of population units (e.g, Radtke and Bolstad, 2001). Sterba (2009) emphasized the utility of model-based estimators in surveys where selection probabilities were unknown, particularly in some forms of non-random sampling. Perhaps most important in the context of this review, models serve an important purpose in providing a statistical link between sample observations of y and auxiliary data x, to make use of auxiliary information in ways that increase the precision of parameter estimates. The parameters of interest in forest inventory likely include population totals or means; additionally, parameters of the models themselves, such as regression parameters or ratios linking auxiliary and sample data may be of interest (Gregoire, 1998). Model-based methodology includes many tools to assist with identifying best models to fit data, to estimate parameters and standard errors for model predictors, and to adopt complex models and analyze more complex data structures than might be possible from sampling alone (Rao and Molina, 2015). While the complex and expansive nature of model-based methodologies leads to a wide array of SAE techniques that make use of models, they place a somewhat greater burden on analysts to adequately address model selection, goodness of fit, or checking of model assumptions—all model concepts that would not be required in design-based approaches.

3. Terminology and Methods

In Sections 2.1–2.3, estimation was presented as a means of obtaining reliable information about population parameters of interest—either for fixed finite populations or underlying model superpopulations—along with rationale for making statistical inferences under design-based and model-based paradigms. In moving to settings where the goal involves making reliable estimates for subsets or domains of interest within broader populations, the presentation will be expanded to include ideas and terminology better suited to the purposes and practice of SAE.

3.1. Direct, Indirect, and Composite Estimators

Apart from the design- and model-based modes of inference introduced above, estimators in general can be described as being either direct, indirect, or composite in nature, depending on the sources of information they make use of. Domain-direct estimates use observed values y_i only from sample units in a particular domain, i.e., i∈s_d. Domain-specific totals can be estimated directly as

\begin{array}{l} {\hat{τ}}_{d}^{D I R} = \sum_{i ϵ s_{d}} w_{i} y_{i} & (3) \end{array}

where DIR indicates that $\hat{τ}$ is a direct estimator, and d denotes the domain of interest. Substituting w_i = 1/π_i in Equation 3 shows how the H-T estimator (Equation 1) serves as the domain-direct estimator when data are restricted to sample units selected in domain d, i.e., when i ∈ s_d. The benefit of direct domain estimation is that no explicit model assumptions are required (cf. Sterba, 2009), and sampling weights can be used, allowing for design-unbiased estimates. A fundamental concern of SAE is that direct estimation often leads to unacceptably large or unstable standard errors for domains with small sample sizes (n_d). Additionally, no direct estimates are possible for domains having no sampled units, i.e., when n_d = 0.

Indirect domain estimation seeks to remedy the direct domain estimator's shortcoming of large variance when n_d is small by increasing the “effective sample size” using information outside of the domain of interest together with a statistical model (Rao and Molina, 2015, pg. 35). The indirect approach is manifest in what is often called a synthetic estimator, e.g.,

\begin{array}{l} {\hat{τ}}_{d}^{S Y N} = τ_{x d}^{'} \hat{β} & (4) \end{array}

which gives the indirect or synthetic estimate (Schaible, 1993) of the total for the d^th small area domain, with domain d auxiliary total τ_xd, and regression coefficients $\hat{β}$ estimated from (x, y) data sampled across the entire population to borrow strength for estimating on d. Benefits of the synthetic estimator are that it can allow for estimates to be made in non-sampled units, and likely has a smaller variance than direct domain estimates, especially where n_d is small. However, Equation (4) as given does not account for between-domain heterogeneity and thus can be biased for specific domains. Additionally the indirect estimator does not necessarily trend toward the unknown domain total τ_d as n_d increases.

Not all synthetic estimators are regression models, but the example in Equation (4) was chosen to illustrate some additional details. First, when the coefficients are estimated only from data sampled in the domain of interest, i.e., when $\hat{β}$ in Equation (4) is replaced by ${\hat{β}}_{d}$ , the estimator is considered to be a model-assisted (GREG) estimator having design-based properties for inference. The same is true when the regression model is fit using sample observations weighted by their design weights using weighted least squares (Särndal et al., 1992; McConville et al., 2020, Section 6.4). Second, a synthetic estimator's inference base may depend on assumptions about its model form being correct, and whether its parameters estimated from one set of domains are suitable for making predictions for other domains. As with any estimator, care should be taken to verify what conditions must be met to satisfy the inference base for GREG.

The synthetic estimator can be used in concert with the direct domain estimator to create a composite estimator that balances the strengths and weaknesses of direct and indirect estimators (Rao and Molina, 2015):

\begin{array}{l} {\hat{τ}}_{d}^{C O M P} = γ_{d} {\hat{τ}}_{d}^{D I R} + (1 - γ_{d}) {\hat{τ}}_{d}^{S Y N} & (5) \end{array}

The weighted average of the direct and indirect estimators comprises the composite estimator ${\hat{τ}}_{d}^{C O M P}$ , where ${\hat{τ}}_{d}^{D I R}$ is a direct estimator, e.g., (3), ${\hat{τ}}_{d}^{S Y N}$ is an indirect estimator, e.g., (4), and γ_d ∈ [0, 1] is a domain-specific weighting factor, also known as a shrinkage factor. An optimal solution for minimum $M S E ({\hat{τ}}_{d}^{C O M P})$ can be formulated as ${\hat{γ}}_{d} = M S E ({\hat{τ}}_{d}^{S Y N}) / (M S E ({\hat{τ}}_{d}^{S Y N}) + M S E ({\hat{τ}}_{d}^{D I R}))$ (Rao and Molina, 2015, Section 3.3.1). For the optimal solution, as n_d gets large or when $M S E ({\hat{τ}}_{d}^{D I R})$ is small, ${\hat{γ}}_{d}$ tends toward 1, moving the composite estimator toward ${\hat{τ}}_{d}^{D I R}$ and its favorable properties of unbiasedness and consistency. Similarly ${\hat{τ}}_{d}^{C O M P}$ tends toward the synthetic estimator when n_d is small or $M S E ({\hat{τ}}_{d}^{D I R})$ is large. Most SAE approaches favor the composite estimator for its ability to balance the unbiasedness and precision of direct and indirect estimators, respectively, while allowing for flexibility in the choice and formulation of direct estimators and synthetic models.

3.2. Unit Level SAE

Unit-level SAE employs synthetic models that operate at the scale of observational or sample units in the population, typically field plots in forest inventory applications. The synthetic model in a unit-level estimator is used to predict ŷ on all population units in a domain d, regardless of how many (or whether) sample data for y were observed in that domain. Predictors from auxiliary variables x and the model relationship y~x provide the means of generating predictions for ŷ∈d. Here x is either known for every unit in d or known in aggregate, as in a case where a domain-specific mean for x, denoted ${\bar{x}}_{d}$ , is known. In either case, paired values of (x, y) are observed on sampled units across the broader population, i.e., the indirect data, and used to fit or train a synthetic model. Errors in the synthetic model predictions are partitioned into two components (see Rao and Molina, 2015, model 4.3.1). A domain-specific error v_d is attributed to synthetic model variance that applies equally to all y∈d, and within-domain residual errors e_di—independent of v_d—that apply to individual sample units y_i; i ∈ d.

The unit-level composite estimator developed by Battese et al. (1988, hereafter BHF) is an excellent example from which to study unit-level SAE, as the data and models the authors used to demonstrate the application are integrated into the R “sae” package (Molina and Marhuenda, 2015). BHF used the nested-error linear regression model

y_{d i} = x_{d i}^{'} β + v_{d} + e_{d i}; d = 1, \dots, D; i = 1, \dots, n_{d} .

to estimate crop areas for corn and soybeans in Iowa, using USDA Statistical Reporting Service field survey data from 1978 as direct observations of y, and Landsat 2 multispectral scanner (57 x 79 m) raster imagery as auxiliary information for x. Unit-level approaches rely on matching individual sample unit observations to the auxiliary data; thus, the satellite image pixels in BHF were assigned to corresponding 250-ha (1 sq. mile) field survey units (BHF called the units “segments”)—about 555 raster cells per segment. The aggregated Landsat and field survey data formed (x, y) pairs for 37 sampled segments in a population of D = 12 counties, with each county treated as a small-area domain of interest. The BHF synthetic model used a response y = sample segment area (ha) growing corn, in a linear regression model with an intercept and two predictors x₁ = Landsat pixels classified as corn and x₂ = pixels classified as soybeans, with $\sum_{d \in D} n_{d} = 37$ observed segments.

Using their synthetic model, similar to Equation (4), BHF were able to predict domain means for corn area per segment in counties as

\begin{array}{l} {\hat{μ}}_{d} = {\bar{x}}_{d}^{'} \hat{β} + {\hat{v}}_{d} & (6) \end{array}

with ${\bar{x}}_{d}$ calculated from all segments in county d, using Landsat pixel class counts of a given crop in each segment (see Table 1, Battese et al., 1988). By estimating $\hat{β}$ as fixed effects and ${\hat{v}}_{d}$ as random effects using mixed linear regression modeling, BHF obtained the empirical best linear unbiased predictor (EBLUP) of domain-specific means ${\hat{μ}}_{d}^{E B L U P} = {\bar{x}}_{d}^{'} \hat{β} + {\hat{v}}_{d}$ , a key contribution of their work because of the favorable properties of the EBLUP.

Goerndt et al. (2013, Equation 5) and Costa et al. (2009, Equation 13) presented the BHF unit-level nested error EBLUP in the form of a composite estimator with domain-level, i.e., county-level estimates obtained from direct and synthetic regression estimates weighted as

\begin{array}{l} {\hat{μ}}_{d}^{E B L U P} = (\frac{{\hat{σ}}_{v}^{2}}{{\hat{σ}}_{v}^{2} + \frac{{\hat{σ}}_{e}^{2}}{n_{d}}}) {\hat{μ}}_{d}^{D I R} + (1 - \frac{{\hat{σ}}_{v}^{2}}{{\hat{σ}}_{v}^{2} + \frac{{\hat{σ}}_{e}^{2}}{n_{d}}}) {\bar{x}}_{d}^{'} \hat{β} & (7) \end{array}

where ${\hat{μ}}_{d}^{D I R}$ is a sample-direct estimate of the mean y for small area domain d and $\hat{β}$ is the vector of fixed effects regression coefficients obtained by linear mixed modeling with domain-specific random effects. The term $σ_{v}^{2}$ in Equation (7) denotes the variance among domain random effects $v_{d} ~ N (0, σ_{v}^{2})$ , and $σ_{e}^{2}$ as the residual variance of population units—the variance unaccounted for by the EBLUP—within domains, i.e., $e_{d i} ~ N (0, σ_{e}^{2})$ . This assumption assigns a single residual variance to all domains, with $σ_{e}^{2}$ estimated from the full set of sample residuals. As when using optimal shrinkage weights ${\hat{γ}}_{d}$ in Equation (5), the weights in Equation (7) ensure that as the domain-specific direct estimator variance gets small, such as when n_d is large, the EBLUP tends toward ${\hat{μ}}_{d}^{D I R}$ .

Unit-level paired (x, y) data typically provide an information-rich means of linking indirect data from broad and extensive populations to specific domains of interest. Further, when y is observed by non-random sampling, the model-based properties of the composite estimator support approximate and asymptotic inference bases where direct estimation alone would not. Efforts should be made to verify the veracity of the underlying statistical model and stochastic processes, especially when data collection employs stratification, clustering, or disproportionate sampling among some elements of the population (Sterba, 2009). In applications involving large data sets the computational requirements of unit-level analyses can be demanding, especially when objectives include the validation of variance estimates approximated by Taylor series linearization or when using resampling procedures to estimate variance components (Prasad and Rao, 1990; González-Manteiga and Morales, 2008). Computationally demanding synthetic models can also pose challenges for SAE, but software advances have made steady gains in providing tools to address such challenges (McRoberts et al., 2007). As with any model-based inference approach, customary steps involving model selection, goodness-of-fit, and checking other assumptions are important in unit-level SAE. In cases where influential points, heteroscedastic error variances, or non-independence of residuals present problems, developments have been made to help overcome limitations of the BHF approach (Babcock et al., 2015; Breidenbach et al., 2018).

3.3. Area Level SAE

Unlike unit-level approaches, area-level SAE employs synthetic models that operate at the scale of subpopulation domains, rather than individual sample or observational units. As a consequence, auxiliary data do not need to be paired one-to-one with individual sample observations. Instead, domain-direct estimates (e.g., ${\hat{τ}}_{d}^{D I R}$ or ${\hat{μ}}_{d}^{D I R}$ ) are paired with domain-specific observations, such as domain means or totals of x ∈ d, which we denote as x_d. The paired domain data $(ŷ_{d}^{D I R}, x_{d})$ are then used to develop regression models or train other types of synthetic models for use in SAE. Rao and Molina (2015, model 4.2.5) presented a regression-based area-level model

\begin{array}{l} {\hat{y}}_{d}^{D I R} = {x^{'}}_{d} β + b_{d} ψ_{d} + ϵ_{d} & (8) \end{array}

where ψ_d are domain-specific model errors and ϵ_d are errors due to sampling on the domain-direct estimates that appear on the left-hand side in Equation (8). In the basic area-level model, Rao and Molina (2015) explain that b_d are domain specific positive constants set to b_d = 1 in the model presented by Fay and Herriot (1979), which, for direct estimates of domain totals can be expressed as

\begin{array}{l} {\hat{τ}}_{d}^{F H} = {x^{'}}_{d} β + ψ_{d} + ϵ_{d} & (9) \end{array}

Fay-Herriot (F-H) models (Fay and Herriot, 1979) like the one shown in Equation (9) are often synonymous with area-level SAE, as they have seen considerable use in area-level applications since their introduction. An EBLUP derived from the F-H model, expressed as a composite estimator having a form similar to Equation (5) shows the roles of direct and synthetic estimators

\begin{array}{l} {\hat{τ}}_{d}^{F H} = ({\hat{γ}}_{d}) {\hat{τ}}_{d}^{D I R} + (1 - {\hat{γ}}_{d}) {x^{'}}_{d} \hat{β} & (10) \end{array}

with weights ${\hat{γ}}_{d} = {\hat{σ}}_{ψ_{d}}^{2} / ({\hat{σ}}_{ψ_{d}}^{2} + σ_{ϵ_{d}}^{2})$ composed as the model variance relative to the total variance. A difference in the shrinkage weights' formulations for Equations (7) and (10) is that $σ_{ϵ_{d}}^{2}$ in Equation (10) is assumed to be known, but typically specified as the variance estimated from sample observations y ∈ d. Wang and Fuller (2003) developed a modified area-level estimator that accounts for the empirical estimation of domain-direct variances, which has been demonstrated in forest inventory applications (Magnussen et al., 2017). Similar to the unit-level model, $\hat{β}$ in Equation (10) is a vector of fixed-effects coefficients from a linear mixed model having domain-specific random effects. As in Equation (7) the F-H EBLUP tends toward the direct estimator when the direct variance is low, and toward the synthetic estimator when the direct variance is high. Because the synthetic estimator in the F-H model operates on subpopulation domains, the same aggregated measures x_d from the auxiliary data that were used in estimating the regression coefficients are also used in predicting ${\hat{τ}}_{d}^{F H}$ in Equation (10). In contrast, the unit-level EBLUP uses an aggregated measure such as ${\bar{x}}_{d}$ in Equation (7), despite the model coefficients having been estimated using observed x (and y) values from individual sample units.

In applications framed in a geographic context, such as forest inventories, where field plot observations are often paired to auxiliary data from remote sensing, area level modeling obviates the need for highly accurate plot coordinates, and can be used when there is a degree of misalignment between plots and auxiliary data (Goerndt et al., 2011). This concern is highly relevant when the protection of confidential information prevents the release of exact coordinates of sampled locations to unauthorized personnel. Instead, only direct linkage to a specific domain is needed for each plot. This also facilitates using sample units that possess indistinct sampling boundaries, such as where linear transects or variable-radius plots are used in field sampling (Ver Planck et al., 2018). This method also tends to require less processing time which lends itself to analysis involving very large datasets.

3.4. Other SAE Methods

In addition to the aforementioned model forms, there are other methods which serve as alterations or variations of the above model types and as such are not mutually exclusive from them. Such methods include Bayes methods, and nearest neighbor models which are worthy of particular mention due to their use in forestry SAE (McRoberts, 2012; Babcock et al., 2018; Ver Planck et al., 2018). Bayesian methods, both empirical Bayes (EB) and hierarchical Bayes (HB) offer some advantages over their frequentest counterparts such as being able to model different target variable types such as binary or count data (He and Sun, 2000). HB also gives posterior distributions of the small area parameters, and can therefore avoid relying on unrealistic asymptotic assumptions (Pfeffermann, 2013). Bayesian methods offer flexibility in specifying spatial structures such as spatially correlated random effects (Wang et al., 2018).

Nearest neighbor techniques offer a similar set of benefits for SAE, for example, where estimators for categorical, binary, or count variables are involved. They are non-parametric in that no distributional assumptions regarding response or predictor variables are necessary, which has proven useful when multivariate auxiliary information is used to construct synthetic models. Nearest neighbor models can also accommodate correlated sample and auxiliary data that may arise in spatial or temporal domains (McRoberts et al., 2007; McRoberts, 2012).

4. Applications in Forest Inventory

We now turn to selected examples related to SAE in forest inventory from published research (Table 1). Note that in Table 1, we exclude a large body of literature employing post-stratification, a widely-used model-assisted estimation technique. Instead we aim to focus on estimators less-commonly used in existing forest inventory production processes. Post-stratification notwithstanding, model-assisted estimators including GREG were among the most widely-used and earliest-adopted methods for increasing estimator precision using models to link auxiliary information from remote-sensing with field sample data. As such, we elected to include a number of model-assisted applications in our example references, even though some of the selected examples do not involve “small area” domains as the term is often used in SAE (Table 1).

TABLE 1

Table 1. Table of selected studies using SAE-related methods in forest inventory with methods and data used.

Among the reasons authors have given for adopting model-assisted estimators was the need to ensure design-based inference in large, multi-resource sample designs (Reich and Aguirre-Bravo, 2009; Næsset et al., 2011). Others cited the need for precise estimation in small area domains nested within broader population surveys as a motivating factor (Goerndt et al., 2011; McRoberts, 2012; Magnussen et al., 2014). A few noted that consistency and additivity of estimates from small areas nested within larger domains were motivating concerns (Reich and Aguirre-Bravo, 2009; Nagle et al., 2019). Others noted the need for sample survey organizations to conduct generic inference, i.e., to make compatible estimates of all forest attributes simultaneously by using the same model to define survey weights (Opsomer et al., 2007; Johnson et al., 2008; McConville et al., 2020). Nearly all research referenced in Table 1 reported the potential for increased efficiency in estimating forest inventory attributes as a reason for pursuing the work, with the high cost of increasing field sample sizes frequently noted as an operational constraint.

Lidar and digital aerial photogrammetry (DAP) were major data sources used as auxiliary information, with some authors using lidar or DAP point cloud metrics, e.g., height percentiles or pulse return densities aggregated to unit or area levels, and others using canopy height models (CHM) processed from point cloud data (Steinmann et al., 2013; Babcock et al., 2015). Field plots in forest inventories were the primary source of directly-sampled observations, with more than half of the selected studies using NFI or other land-management agency field sample observations, and most others using plots installed for management or research purposes, such as on state, university, private, or public experimental and working forests (Anderson and Breidenbach, 2007; Mauro et al., 2017; Green et al., 2020).

Several studies aimed to examine and augment existing estimation frameworks in forest inventory settings to address potential violations in underlying assumptions. Examples include accounting for heteroscedasticity of variance in synthetic model residuals and spatial or temporal autocorrelation among measurements on observational units or areas of interest (Babcock et al., 2018; Breidenbach et al., 2018; Ver Planck et al., 2018; Mauro et al., 2019). Estimating the change of forest attributes over time with SAE methods was demonstrated successfully by Mauro et al. (2019) and by Coulston et al. (2021), both of which used repeated measurements from forest field plots in their work.

Multiple studies applying one or both unit-level and area-level EBLUP-based SAE appear in Table 1. The choice of adopting unit- or area-level SAE in forest inventory applications can depend on limitations of sample or auxiliary data sets, for instance, where georeferencing inaccuracies introduce significant errors in pairing field observations to remotely-sensed or other geospatial data sets, e.g., maps (Næsset et al., 2011; Green et al., 2020). Similar challenges in pairing field data to remote-sensing can occur where plot designs and raster layers are incompatible, such as when variable radius field plots (i.e., angle gauge sampling) are used, or when field plot sizes are small compared to the pixel size in available auxiliary data sources (Goerndt et al., 2011; Ver Planck et al., 2018; Temesgen et al., 2021). Although area-level estimators are flexible to accommodate situations where precise georeferencing is impractical, a trade-off may arise due to the loss of information from aggregating unit-level observations to domain- or area-level scales; further, aggregating sample observations also reduces effective numbers of observations available for synthetic model development (Magnussen et al., 2017).

When georeferencing enables pairing of auxiliary and sample data, sampled units can all serve as (x, y) observations for fitting or training synthetic models. The models can then be used to make predictions on raster cells or other spatial units that correspond to unobserved population units (Babcock et al., 2018; Frank et al., 2020). It follows that unit-level SAE estimators can be used to map predictions or estimates at finer spatial resolutions than area-level methods, the latter of which allow for mapping SAE predictions only at domain levels. Another potential advantage of the unit-level approach is the ability to define arbitrary spatial domains subsequent to sampling without necessarily losing substantial power of inference (Pascual et al., 2018). Gains in efficiency between unit-level and area-level SAE methods have been compared in multiple studies, generally confirming the potential for greater gains from unit-level approaches (Mauro et al., 2017; Breidenbach et al., 2018).

4.1. Gains Over Direct Estimation

4.1.1. Model-Assisted Estimation

Model-assisted estimators have proven to reduce estimator errors considerably, e.g., when measured by the relative efficiency of a model-assisted estimator compared to its simplest direct counterpart—often characterized as simple random sampling—cf. Equations (2) and (3)

\begin{array}{l} R E = \frac{V a r ({\hat{τ}}_{y}^{M A})}{V a r ({\hat{τ}}_{y}^{D I R})} & (11) \end{array}

McRoberts et al. (2013) demonstrated substantial variance reduction 100%(1−RE) = 84% using a non-linear regression-based model-assisted estimator of growing stock volume per unit area in a 1,300 km² study area in southeastern Norway. The gains corresponded to over sixfold increase in apparent sample size, calculated as RE⁻¹. Aerial lidar (0.7 pulses m^-2) served as the auxiliary data, with direct observations from n = 145 NFI-type (200 m²) field plots over the study area. They noted similar variance reduction (82%) using the same method to estimate growing stock volume over a one-half partition of the study area represented by n = 69 field plots, thus demonstrating how sample data from a larger area can be used to increase the precision of estimates in a smaller area (McRoberts et al., 2013).

Model-assisted estimation has been demonstrated to increasing precision in multiple population sub-domains including Breidenbach and Astrup (2012), who tested GREG as a model-assisted estimator in a similar sized study area in Norway divided into 14 municipalities, each having from 1 to 35 NFI sample plots of a total n = 145. They noted gains in precision were smaller and more variable than McRoberts et al. (2013), with RE ranging from 0.35 to 0.87 in eight municipalities having n_d≥ 6. They concluded that sample sizes n_d <6 in the other six municipalities gave unstable estimates, a finding similar to Næsset et al. (2011). Their data revealed that despite GREG's limitation in areas with small n_d, its estimates deviated from sample-direct design-based estimates by less than half as much as synthetic model estimates alone, a result consistent with the design-unbiased property of GREG in the model-assisted framework. Næsset et al. (2013) observed consistent improvement in precision of estimates of aboveground forest biomass using GREG with aerial lidar auxiliary data in model-assisted two-stage sampling. Their gains were greatest (RE = 0.125) for all cover classes combined (n = 632), and only slightly more modest 0.09 ≤ RE ≤ 0.20 for individual age and productivity classes, all of which had class-specific sample sizes n_c≥46.

Reduced RMSE of synthetic model estimates when compared to direct estimator variance has been demonstrated in a number of applications involving the pairing of ALS and NFI-type field sample data (Næsset et al., 2011; Järnstedt et al., 2012; Nord-Larsen and Schumacher, 2012; Kotivuori et al., 2016; Nilsson et al., 2017; Novo-Fernandez et al., 2019). A variety of synthetic modeling approaches have been tested including parametric and non-parametric regression modeling, Random Forests and other ensemble predictive models, and nearest-neighbor imputation, often with a goal of identifying suitable auxiliary data sources for estimating forest biophysical attributes (Latifi et al., 2010; Popescu et al., 2011; Bright et al., 2012; Rahlf et al., 2014). A common theme in these studies is the direct examination of synthetic model prediction errors (e.g., using cross validation) without formulating the models in a design-based or composite modeling framework to mitigate potential synthetic estimator bias (cf. McRoberts et al., 2013; Irulappa-Pillai-Vijayakumar et al., 2019; McConville et al., 2020). In model-assisted applications the usual goal is to increase the precision of population-level estimates, and less often to produce estimates for domains that divide a larger population into small areas where direct estimator instability can be a concern. Where investigated, model-assisted estimators were able to reduce small area uncertainties considerably, within limits of direct-data sampling intensity and the strength of model relationships involving indirect and auxiliary data (Breidenbach and Astrup, 2012).

4.1.2. Unit-Level SAE

Research comparing unit-level SAE to model-assisted estimators has shown that gains in precision are generally greater for unit-level EBLUPS than model-assisted estimates (e.g., GREG) primarily when direct-domain sample sizes are small. In directly comparing unit-level EBLUPs to model-assisted GREG estimates, Breidenbach and Astrup (2012) noted an average $\bar{R E} = 0.86$ , meaning unit-level SAE MSEs were lower than direct estimate variances by an additional 14%, on average, compared to GREG. Not all EBLUP MSEs were smaller than those computed for GREG. Comparisons showed greatest gains in five of six municipalities having 6 ≤ n_d ≤ 17, but smaller—even negative—gains ( $\bar{R E} = 0.93$ and 1.15) were noted in two municipalities having n_d≥29. Such findings indicate that unit-level EBLUPS may not have as clear of an advantage over GREG in reducing estimator variance when domain sample sizes are relatively large; nonetheless, even when n_d was relatively large, EBLUP performance was not much worse than model-assisted regression estimates (Breidenbach and Astrup, 2012). EBLUPS showed considerable stability across the range from 1 ≤ n_d ≤ 35 compared to GREG, with unit-level SAE relative errors ranging from just [7.0, 12.4] % compared to a range of [0.6, 25.4] % for GREG, even with three municipalities having n_d = 1 excluded for the GREG results since their errors could not be calculated (Breidenbach and Astrup, 2012). Magnussen et al. (2014) reported virtually identical gains for model-assisted and unit-level EBLUP estimates (both $\bar{R E} = 0.49$ ) when compared to direct volume per hectare estimates for forest districts in Switzerland. The near identical results may have been related to comparatively large average sample sizes ${\bar{n}}_{d} = 79$ across the D = 108 Swiss forest districts.

Of the studies listed in Table 1 that compared unit-level EBLUPs to domain-direct estimates, the largest variance reductions noted ( $\bar{R E} = 0.03$ ) were in a study of forest volume in Burgos Province, Spain, made up of D = 54 stands in an area covering about 13.65 km² (Pascual et al., 2018). The same authors reported more modest gains ( $\bar{R E} = 0.50$ ) in a 3 km² management area having D = 6 stands near Cercedilla, Spain. The authors attributed the greater gains at Burgos as being due to the lower sampling intensity there (about 0.5 %) compared to a 4 % sampling intensity at the Cercedilla site (Pascual et al., 2018). Large gains were reported for unit-level volume EBLUPs ( $\bar{R E} = 0.09$ ) tested by Mauro et al. (2017) in an area roughly 8 km². A high degree of positive skewness was evident in the distribution of n_d across domains, with roughly 30% of map unit domains having n_d ≤ 2 despite ${\bar{n}}_{d} = 10.3$ across D = 64 map units having any sample observations (Mauro et al., 2017). In estimating volume change in D = 24 stands experimentally manipulated for three levels of forest structural diversity, Mauro et al. (2019) reported unit-level gains ( $\bar{R E} = 0.71$ ) for 7-year volume change estimates. The sample design relied on 1 remeasured plot per 8 ha on a systematic grid, for a sparse but narrow range of domain-direct sample sizes, with 3 ≤ n_d ≤ 10 and ${\bar{n}}_{d} = 6.3$ (Mauro et al., 2019). Breidenbach et al. (2018) reported greater efficiencies $\bar{R E} = 0.43$ and $\bar{R E} = 0.28$ compared to direct estimate variances, with the latter result including a formulation that accounted for heteroskedasticity of variance in synthetic model residuals.

By artificially reducing sample sizes from the available n = 680 NFI plots across D = 12 county-sized domains for estimating Oregon Coast Range forest volumes, Goerndt et al. (2013) demonstrated diminishing efficiency gains with increasing n_d in unit-level EBLUPS from $\bar{R E} = 0.41$ (n = 136), to $\bar{R E} = 0.57$ (n = 204), to $\bar{R E} = 0.71$ (n = 272). They also found that alternative unit-level composite estimators calculated with smoothed variances performed well in terms of increased precision and low apparent biases in unit-level SAE (Costa et al., 2003; Goerndt et al., 2013). The alternative estimators employed multiple linear regression, as well as nearest neighbor and gradient-nearest-neighbor imputation in synthetic models to achieve balances between bias and precision of SAE (Ohmann and Gregory, 2002).

4.1.3. Area-Level SAE

A number of studies have shown area-level SAE precision gains for timber volume or biomass compared to sample-direct estimates. Breidenbach et al. (2018), for example, reported a reduction in overall standard error from 32.7 m³ha⁻¹ for direct volume estimates compared to F-H area-level RMSE of 23.1 m³ha⁻¹ ( $\bar{R E}$ = 0.50). A nearly identical gain in efficiency ( $\bar{R E}$ = 0.50) was reported by Goerndt et al. (2011) for volume estimates in forest stands treated as small area domains with intentionally-reduced sample sizes between 2 ≤ n_d ≤ 4. Relative gains tended to be less when the simulated small n_d were increased by factors of two and three, with no gains for larger sample size increases. Despite finding modest gains when n_d was large, area-level EBLUPS were demonstrably superior—in terms of RE and lack of apparent biases—to two synthetic estimators and two composite James-Stein type estimators tested (James and Stein, 1961; Goerndt et al., 2011). In testing area-level SAE with counties as small-area domains, a composite estimator similar to F-H showed $0.43 \leq \bar{R E} \leq 0.91$ over a 20-state region of the northeastern U.S. (Goerndt et al., 2019). In the same study a composite estimator based on a non-parametric nearest-neighbor (NN) synthetic model showed slightly less gain in efficiency than the F-H type approach. Despite this, Goerndt et al. (2019) indicated the NN approach may have lower model bias and they cautioned against potential biases in model-based estimators, pointing out the need for thorough checking of model assumptions to ensure validity of model-based inferences.

The pattern of decreasing gains in efficiency with increasing n_d was also noted by Mauro et al. (2017), who reported an average $\bar{R E}$ = 0.48 over D = 84 management areas (domains), while no gain ( $\bar{R E}$ = 1.13) was noted in 14 of the domains having n_d≥25 sample plots (cf. Goerndt et al., 2011, 2019). Green et al. (2020) noted average $\bar{R E}$ = 0.79 in F-H estimates of timber volume across D = 40 stands, with little or no efficiency gains in stands where direct estimates were already quite precise (relative standard errors <10%). Greater gains ( $\bar{R E} \approx 0.35$ ) were noted in stands having direct relative standard errors >25%. Findings such as these indicate that where domain-direct estimates are already quite precise, as in cases where n_d is large or variation within domains is inherently low, F-H type estimators may exhibit a limited ability to further increase precision over domain-direct estimates.

Magnussen et al. (2017) reported $\bar{R E}$ ranging from 0.44 to 0.77 in four study areas in Spain, Norway, Switzerland, and Germany, using a modification of F-H that treats domain-specific variances as estimates rather than known constants (Wang and Fuller, 2003). In the same study they found greater gains $0.28 \leq \bar{R E} \leq 0.34$ by including a non-stationary spatial correlation process in an area-level composite estimator which accounted for the spatial covariance structure in model residuals (Chandra et al., 2012, 2015). In a third approach Magnussen et al. (2017) used the HB approach of Datta and Mandal (2015) to obtain efficiency gains intermediate ( $0.27 \leq \bar{R E} \leq 0.81$ ) compared to their baseline—specifying empirically estimated variances—and non-stationary spatial F-H approaches. The Bayesian approach demonstrated several advantages related to estimated posterior distributions for specific domains, especially when the random-effect variance was largely attributable to a small number of domains from the larger population (Magnussen et al., 2017). Coulston et al. (2021) evaluated the performance of area-level F-H models including a simultaneous autoregressive (SAR) model of residual spatial correlation among domains in estimating forest removals, noting the spatial model improved efficiency of estimates at scales of individual counties, but not at the scale of larger survey regions encompassing groups of 12-20 counties each in the southeastearn United States. Ver Planck et al. (2018) compared F-H gains with no accounting for spatial autocorrelation ( $\bar{R E} = 0.23$ ) to a conditionally-autoregressive (CAR) F-H model that further reduced estimator variance ( $\bar{R E} = 0.19$ ), noting that the CAR model performance was generally greater in domains (stands) sharing boundaries with high numbers (> 10) of neighboring stands. The non-stationary spatial process (Chandra et al., 2012) may improve upon simpler CAR and SAR modeling approaches, as it employs a distance-weighted measure of correlation among domains rather than a simple binary model based on domain adjacency. Spatial models provide a promising tool for future applications of area-level SAE that account for non-trivial spatial correlations among small area domains likely to be realistic in many forest inventory applications (Finley et al., 2011; Magnussen et al., 2017).

4.1.4. Unit Level vs. Area Level SAE

Although area-level EBLUPS have shown clear gains in efficiency over direct estimates in forest inventory SAE applications, still greater gains have been demonstrated using unit-level approaches when suitable data are available and model assumptions are met (Breidenbach et al., 2016, 2018). In comparing both approaches to direct estimates of forest volume, Mauro et al. (2017) observed halving of variance ( $\bar{R E}$ = 0.48) and ten-fold reduction ( $\bar{R E}$ = 0.09) for area-level and unit-level estimates, respectively. A notable feature in the study was the large proportion—slightly more than half—of the 84 stand groupings defined as small area domains containing n_d ≤ 6 field plots, linking greatest gains in efficiency to domains having relatively small n_d. Their unit-level results achieved variance reductions among the largest of any reported in forest inventory literature (Mauro et al., 2017; Pascual et al., 2018). By comparison, Breidenbach et al. (2018) reported $\bar{R E} = 0.50$ (see Section 4.1.3) for area-level and $\bar{R E} = 0.28$ (see Section 4.1.2) for unit-level estimators applied to a common data set. Their domain-direct sample sizes were also small [4 ≤ n_d ≤ 7], so the source of differential gains between the two studies' unit-level estimators may be related to other factors including the strength of the synthetic model relationships, which we weren't able to compare between the two studies (Mauro et al., 2017; Breidenbach et al., 2018). A distinct barrier to achieving large gains in efficiency with unit-level SAE, especially compared to area-level approaches, is the ability to accurately georeference field plots and spatial auxiliary data sources. Green et al. (2020) noted this as a possible reason for the lack of gains in their unit-level models compared to area-level SAE.

4.2. Variance Estimation

Variances for model coefficients and estimates of finite-population parameters for design-based direct or model-assisted estimates (Sections 2.1 and 2.2) are in most cases calculable using commercially available statistical software packages (Molina and Marhuenda, 2015; Breidenbach, 2018; McConville et al., 2018; Hill et al., 2021). Other variance estimators are documented in research literature in sufficient detail to facilitate calculation with scientific programming software (McRoberts, 2012; Mandallaz et al., 2013; Babcock et al., 2015; Magnussen et al., 2017; Mauro et al., 2017; Frank et al., 2020). In design-based model-assisted estimators, variance calculations exist in closed form for some estimators such as GREG, or as approximations for others including ratio estimators (Särndal et al., 1992; Breidenbach and Astrup, 2012; Mandallaz, 2013; Magnussen et al., 2018). Because variance calculations often require accounting for non-independence of observations or heteroskedasticity of residuals, or where algorithmic synthetic models such as non-parametric or nearest-neighbor modeling are employed, iterative methods can be used to estimate approximate variances, even where closed-form solutions exist when model assumptions allow for them (McRoberts et al., 2007). Numerical approaches, such as leave-one-out cross validation or parametric bootstrap estimation have proven useful in variance estimation, although care must be taken to ensure that bootstrap data-generating mechanisms are aligned with error correlation structures and distributional assumptions.

Standard errors of domain-direct estimates are required inputs for FH and other area-based SAE, which can pose a problem when n_d is insufficient to give stable variance estimates (Särndal, 1984; Breidenbach and Astrup, 2012). A solution for such applications is the use of generalized variance estimators (Valliant, 1987; Wolter, 2007; Goerndt et al., 2013; Coulston et al., 2021). Generalized variance functions tend to give variance estimates that are highly dependent on n_d, e.g., Coulston et al. (2021), which may differ from direct variance estimates in domains having sufficient sampling intensity to produce stable standard errors.

Closed form solutions typically do not exist for variance estimation for SAE composite estimators, e.g., when domain-level estimates are obtained as EBLUPs (Fay and Herriot, 1979; Battese et al., 1988; Rao and Molina, 2015). Variance of EBLUPs is assessed by mean squared errors (MSE) rather than standard errors to distinguish EBLUPS from design-unbiased estimators. The MSEs are routinely calculated in SAE software as additive combinations of terms representing uncertainties associated with (a) prediction of random effects, (b) estimation of regression model coefficients, and (c) estimation of the random-effect variance, i.e., the variance among small-area domains (Prasad and Rao, 1990). Alternatives involving parametric bootstrap variance estimation are employed in some applications, as are variances determined from posterior distributions in Bayesian analyses (Prasad and Rao, 1990; Babcock et al., 2015; Molina and Marhuenda, 2015).

4.3. Emerging Applications

Bayesian methods have been applied to SAE across disciplines for several decades (Morris, 1983; Ghosh and Rao, 1994), with recent applications in forest inventory settings as well (e.g., Finley et al., 2011). Babcock et al. (2018) used HB to estimate aboveground biomass with coupled auxiliary data from Landsat and lidar, to resolve incomplete coverage of remote sensing data. Of the HB models they tested, one incorporating spatial random effects with lidar as auxiliary data led to $0.33 \leq \bar{R E} \leq 0.51$ across their 4 areas of interest. By incorporating coregionalization and adding tree cover derived from Landsat to complement incomplete lidar coverage, the range of $\bar{R E}$ decreased to $0.16 \leq \bar{R E} \leq 0.35$ . Ver Planck et al. (2018) also saw success in formulating the F-H approach in a HB framework to increase precision of aboveground biomass SAE over direct estimates ( $\bar{R E} = 0.23$ ). Further improvement was demonstrated by adding conditional autoregressive random effects to account for spatial correlation ( $\bar{R E} = 0.80$ ), with the autoregressive model giving greater gains in precision in domains that shared boundaries with larger numbers of neighbors (Ver Planck et al., 2018). Adding a spatial random effect did not always lead to greater predictive performance, however, as shown by Babcock et al. (2015), who attributed modest to negative gains ( $0.83 \leq \bar{R E} \leq 1.27$ ) to possible overfitting.

Estimating population parameters for multiple attributes of interest is an important concern in forest inventory applications ranging from local-scale stand assessments to regional and national multi-resource inventories (Babcock et al., 2013; Lochhead et al., 2018). Multivariate F-H and spatial F-H approaches have been developed and incorporated into statistical software, but their application to forest inventory has been limited (Molina and Marhuenda, 2015; Benavent and Morales, 2016). While multiple domain-specific estimates can be obtained using SAE in separate modeling procedures for each attribute, doing so fails to maintain logical consistencies among estimates, and overlooks potential gains in efficiency that may otherwise be realized by accounting for cross-attribute correlations (Mauro et al., 2017; Coulston et al., 2021). Generic inference in model-assisted design-based estimation affords consistency in multivariate estimates, but may come at a cost of increased standard errors in attributes uncorrelated or weakly correlated with selected auxiliary variables (McConville et al., 2020). Nearest-neighbor approaches have been used successfully in multivariate forest inventory settings, including multivariate model-assisted estimation to improve estimator efficiencies while preserving consistency among estimates (Chirici et al., 2016; McRoberts et al., 2017). Hierarchical Bayesian multivariate methods for SAE have been demonstrated for both unit-level and area-level settings suitable for forest inventory applications (Datta et al., 1998; Arima et al., 2017). Recent advances have also increased computational efficiencies for Bayesian analyses that can be applied to multivariate SAE involving very large data sets, expanding opportunities for further advances in this area (Finley et al., 2015, 2017; Datta et al., 2016; Babcock et al., 2018).

Another development in SAE applications for forest inventory aims at modeling sample and observational units at a level approaching that of individual trees rather than forest plots or larger subpopulation domains (Næsset, 2002; Mauro et al., 2016). Frank et al. (2020) tested a semi-individual tree (s-ITC) model approach—analogous to a unit-level approach—by segmenting tree crowns in lidar point clouds and delineating s-ITC units around them. Compared to a unit-level approach the s-ITC model showed a $\bar{R E} = 1.04$ for volume, $\bar{R E} = 0.71$ for basal area, $\bar{R E} = 1.38$ for stem density, and $\bar{R E} = 0.48$ for quadratic mean diameter. Despite attaining similar precision as plot-based unit-level EBLUPs for volume estimates at the population level, the s-ITC approach showed potential for its increased spatial resolution and ability to estimate population parameters more closely related to individual trees, such as mean diameter (Frank et al., 2020).

Applications of enhanced estimators such as model-assisted and SAE are not limited to cases where estimates per unit of land area are needed. Affleck and Gregoire (2015) compared estimation of crown biomass using randomized branch sampling to provide sample data for GREG. In their examination of a univariate estimator, model-assisted estimation led to improvement in the precision of estimates across a range of simulated sample sizes in randomized branch sampling (n = 5, 10, 20 and $\bar{R E} = 0.54$ , 0.71, and 0.97, respectively), although the authors cautioned against the possible trade-off between precision and design-unbiasedness (Affleck and Gregoire, 2015). Gains in precision were not limited to one sampling scheme. Increased precision was also seen for n = 5 based on other branch sampling methods: probability proportional to size sampling ( $\bar{R E} = 0.94$ ), simple random sampling ( $\bar{R E} = 0.28$ ), and stratified random sampling ( $\bar{R E} = 0.28$ ). The potential gains in biomass estimation at the tree level demonstrates how SAE may prove a useful tool in other contexts than estimating forest inventory attributes for small geographic areas.

5. Conclusions

Small area estimation (SAE) is a growing area of research in forest inventory owing largely to its ability to support model-based inference in small area domains lacking sufficient sample data to provide stable estimates using purely design-based estimation. A variety of modeling techniques can be employed in the SAE framework with linear mixed modeling among the most widely used to-date. A unifying requirement is the use of auxiliary data that allows estimators to both borrow strength from indirect sample data that co-vary with auxiliary observations and to provide auxiliary population parameters as predictors in synthetic models for composite estimation. The availability of large data sets like those collected in NFIs, along with increasingly available auxiliary data such as DAP, ALS, satellite remote-sensing, or other digital map products has made SAE of particular interest to forest inventory specialists. Increases in precision from SAE can provide efficient alternatives to sample intensification when insufficient sample data are available to meet needed tolerances for estimator error.

The examples summarized here demonstrate potential benefits of SAE, along with some limitations researchers have encountered in applying evolving model-assisted design-based estimators or composite estimators that lie within unit-level or area-level SAE frameworks. As with any model-based approaches considerable attention should be paid to model assumptions including distributional assumptions for residuals, correct model form (e.g., linear vs. non-linear), careful selection of model predictors, and accounting for correlation structures among synthetic model residuals. Challenges may arise from a lack of correspondence between sample and auxiliary data such as when precise pairing of (x, y) observations is impractical, or when auxiliary data are unavailable for specific areas or time periods of interest. Consideration of alternative approaches including design-based and model-assisted estimation, area-level, or unit-level (model-based) SAE is needed to ensure suitability of methods given inferential needs. Many topics for further research have been identified in the literature reviewed here, pointing out opportunities for future improvement in forest inventory applications of SAE. Although no tool can address needs in all circumstances, methods like those reviewed here provide flexible, efficient alternatives to reduce the need for sample intensification in many cases and to meet tolerance specifications in others at reduced cost by increasing estimator efficiency.

Author Contributions

GD, PR, and JC: conception, design, and outline. GD and PR: literature review and document drafts. JC, PG, BW, and GM: reviewed drafts, provided corrections and revisions, discussion of needed changes, additions, and redesign. All authors contributed to the article and approved the submitted version.

Funding

This work was supported by USDA Forest Service Southern Research Station and Virginia Tech Department of Forest Resources and Environmental Conservation, Joint Venture Agreement 20-JV-11330145-074. Sponsorship of research into improving estimation of timber damage in tropical storms in the southeastern United States. Funds from the Virginia Tech Library paid for open-access fees.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Affleck, D. L. R., and Gregoire, T. G. (2015). Generalized and synthetic regression estimators for randomized branch sampling. Forestry 88, 599–611. doi: 10.1093/forestry/cpv027

Review and Synthesis of Estimation Strategies to Meet Small Area Needs in Forest Inventory

1. Introduction

2. Background

2.1. Design-Based Estimators

2.2. Model-Assisted Estimators

2.3. Model-Based Estimators

3. Terminology and Methods

3.1. Direct, Indirect, and Composite Estimators

3.2. Unit Level SAE

3.3. Area Level SAE

3.4. Other SAE Methods

4. Applications in Forest Inventory

4.1. Gains Over Direct Estimation

4.1.1. Model-Assisted Estimation

4.1.2. Unit-Level SAE

4.1.3. Area-Level SAE

4.1.4. Unit Level vs. Area Level SAE

4.2. Variance Estimation

4.3. Emerging Applications

5. Conclusions

Author Contributions

Funding

Conflict of Interest

Publisher's Note

References

95% of researchers rate our articles as excellent or good