Data Collection, Weighting, and Modeling Techniques to Estimate Unbiased Population Parameters

Term Start:

July 1, 2025

Term End:

May 31, 2026

Budget:

$75,000

Keywords:

Endogenous Selection Modeling, Survey Sampling Bias, Weighting

Thrust Area(s):

Data Collection Mechanisms, Data Modeling and Analytic Tools

University Lead:

The University of Texas at Austin

Researcher(s):

Chandra Bhat

Empirical research studies across multiple fields employ data from large surveys for their analysis. In doing so, studies must address such sampling-related issues as non-response, missing data, unequal sampling, and other survey biases. The voluntary nature of most surveys means that, in many empirical applications, data are not randomly selected from the population. Instead, researchers only observe the responses of those who choose to respond to the survey, potentially resulting in sample selection biases. A variety of modeling approaches have been proposed to accommodate such selection biases. Specifically, sampling weights have long been considered essential when undertaking descriptive statistical analysis (such as determining population averages) on data with unequal sampling probabilities. However, for causal effects modeling, if individuals have nearly equal sampling selection probabilities given their values of exogenous variables, then both weighted and unweighted estimators are consistent, and the lower variance of the unweighted estimator is preferred. But, when the probability of selection differs significantly among individuals due to a selection mechanism that is endogenous (that is, the probability of selection is not completely explainable based on exogenous variables), using sampling weights (representing the inverse probability of sample selection) can yield consistent estimates of population parameters, while unweighted estimators are generally inconsistent.

A critical issue, however, is that the true probability of selection is generally unknown in cases of nonresponse. In such cases, weights are not based on the true probability of selection. Instead, they are estimated using post-data collection comparisons with population statistics to match the proportion of respondents in each demographic group with their population proportions in an external independent control. The basic idea is that, by employing such weighting, one essentially gets back to the case of an equal probability sample, which takes care of any selection bias. However, unobserved factors may also play a significant role in response decisions (and thus, sampling probabilities), and such unobserved factors may also be correlated with the main outcome of interest. Such situations cannot be addressed through post-data collection weights, which rely on the assumption that selection is based solely on observed characteristics. Relatedly, while descriptive statistics are often considered separately from model-based approaches, since weights are needed even when sampling is based only on exogenous variables using standard formulas, these same statistics can be calculated using model-based approaches. While either approach yields an unbiased result when sampling is based only on exogenous variables, the traditional weight-based approach cannot accommodate unobserved self-selection effects while the model-based approach can accommodate such selection on unobserved variables.

Therefore, in this study, we consider the ways that appropriate sampling strategies and modeling techniques can be used to improve estimation results when the collection of a representative sample is unnecessary or impractical. Through theoretical and simulation-backed support, we underscore the importance of adopting appropriate sampling and estimation methods when sampling is based (a) only on observed exogenous variables, or (b) also on unobserved variables. In the context of exogenous sampling, we demonstrate that range variation in exogenous variables needs to be the key in survey designs (not necessarily population representativeness) to estimate individual-level causal relationships. Further, we demonstrate that weighting approaches are unable to accommodate endogenous selection, when sampling is based on unobserved variables. Instead, we propose a joint modeling approach that can recover the true population parameters when the joint distribution of exogenous variables in the population is known. We also demonstrate that this method can improve upon existing methods that do not account for endogenous selection even when only the population marginal distribution of exogenous variables is known. This analysis should be of interest to all empirical researchers working in the area of survey research and associated data modeling.

Scroll to Top