# Procedures for Sampling Error Estimation in Design-based Analysis of the CPES Data

The CPES data set is the product of the merger of three probability samples of the
U.S. population and therefore shares the primary stage sample stratification and
clustering features of the component sample designs. The NCS-R, NSAL and NLAAS
sample designs were very similar in their basic structure to the multi-stage designs
used for major survey programs such as the U.S. Health Interview Survey (HIS), the
National Survey of Family Growth (NSFG) or the other national scientific surveys.
The survey literature refers to the these samples as complex designs, a loosely-used
term meant to denote the fact that the sample incorporates special design features
such as stratification, clustering and differential selection probabilities (i.e.,
weighting) that analysts must consider in computing sampling errors for sample
estimates of descriptive statistics and model parameters. Standard programs in
statistical analysis software packages assume simple random sampling (SRS) or
independence of observations in computing standard errors for sample estimates. In
general, the SRS assumption results in underestimation of variances of survey
estimates of descriptive statistics and model parameters. Confidence intervals based
on computed variances that assume independence of observations will be biased
(generally too narrow) and design-based inferences will be affected accordingly.
Likewise, test statistics (t, X^{2}, F) computed in complex survey data
analysis using standard programs will tend to be biased upward and overstate the
significance of tests of effects.

This section focuses on sampling error estimation and construction of confidence intervals for survey estimates of descriptive statistics such as means, proportions, ratios, and coefficients for linear and logistic regression models.

#### VII.A Sampling Error Computation Methods and Programs

Over the past 50 years, advances in survey sampling theory have guided the development of a number of methods for correctly estimating variances from complex sample data sets. Sampling error programs that implement these complex sample variance estimation methods are available to CPES data analysts. The two most common approaches (Rust, 1985) to the estimation of sampling error for complex sample data are through the use of a Taylor Series linearization of the estimator (and corresponding approximation to its variance) or through the use of resampling variance estimation procedures such as Balanced Repeated Replication (BRR) or Jackknife Repeated Replication (JRR).

#### VII.B Taylor Series linearization method:

When survey data are collected using a complex sample design with unequal size clusters, most statistics of interest will not be simple linear functions of the observed data. The linearization approach applies Taylor's method to derive an approximate form of the estimator that is linear in statistics for which variances and covariances can be directly and easily estimated. Stata Release 8 and 9, SAS V8.2/V9.0, SUDAAN Version 9, and the most recent releases of SPSS are commercially available statistical software packages that include procedures that apply the Taylor Series method to sampling error estimation and inference for complex sample data.

Stata (StataCorp, 2005) is a more recent commercial entry to the available software for analysis of complex sample survey data and has a growing body of research users. Stata includes special versions of its standard analysis routines that are designed for the analysis of complex sample survey data. Special survey analysis programs are available for descriptive estimation of means (SVY MEAN), ratios (SVY RATIO), proportions (SVY TAB), and population totals (SVY TOTAL). STATA programs for multivariate analysis of survey data include linear regression (SVY REGRESS), logistic regression (SVY LOGIT) and probit regression (SVY PROBT). STATA program offerings for survey data analysts are constantly being expanded. Information on the STATA analysis software system can be found on the Web at: http://www.stata.com.

Programs in SAS Version 9 (SAS, 2003; http://www.sas.com/) also use the Taylor Series method to estimate variances of means (PROC Surveymeans), proportions and cross-tabular analysis (PROC SurveyFreq), linear regression (PROC SurveyReg), and logistic regression (PROC SurveyLogistic).

SUDAAN (RTI, 2004) is a commercially available software system developed and marketed by the Research Triangle Institute of Research Triangle Park, North Carolina (USA). SUDAAN was developed as a stand-alone software system with capabilities for the more important methods for descriptive and multivariate analysis of survey data, including: estimation and inference for means, proportions, and rates (PROC DESCRIPT and PROC RATIO); contingency table analysis (PROC CROSSTAB); linear regression (PROC REGRESS); logistic regression (PROC LOGISTIC); log-linear models (PROC CATAN); and survival analysis (PROC SURVIVAL). SUDAAN V9.0 and earlier versions were designed to read directly from ASCII and SAS system data sets. The latest versions of SUDAAN permit procedures to be called directly from the SAS system. Information on SUDAAN is available at the following website address: http://www.rti.org/.

SPSS Version 14.0 (http:// www.spss.com/) users can obtain the SPSS Complex Samples module which supports Taylor Series linearization estimation of sampling errors for descriptive statistics (CSDESCRIPTIVES), cross-tabulated data (CSTABULATE), general linear models (CSGLM), and logistic regression (CSLOGISTIC).

#### VII.C Resampling methods:

BRR, JRR, and the bootstrap comprise a second class of nonparametric methods for conducting estimation and inference from complex sample data. As suggested by the generic label for this class of methods, BRR, JRR, and the bootstrap utilize replicated subsampling of the sample database to develop sampling variance estimates for linear and nonlinear statistics. WesVar PC (Westat, Inc., 2000) is a software system for personal computers that employs replicated variance estimation methods to conduct the more common types of statistical analysis of complex sample survey data. WesVar PC was developed by Westat, Inc. and is distributed along with documentation to researchers at Westat's website: http://www.westat.com/westat/statistical_software/wesvar/index.cfm . WesVar PC includes a Windows-based application generator that enables the analyst to select the form of data input (SAS data file, SPSS for Windows data base, ASCII data set) and the computation method (BRR or JRR methods). Analysis programs contained in WesVar PC provide the capability for basic descriptive (means, proportions, totals, cross tabulations) and regression (linear, logistic) analysis of complex sample survey data. WesVar also provides the best facility for estimating quantiles of continuous variables (e.g., 95%-tile of a cognitive test score) from survey data. WesVar Complex Samples 4.0 is the latest version of WesVar. Researchers who wish to analyze the CPES data using WesVar PC should choose the BRR or JRR (JK2) replication option.

STATA V9 has introduced the option to use JRR or BRR calculation methods as an alternative to the Taylor Series method for all of its svy command options. SUDAAN V9.0 also allows the analysts to select the JRR method for computing sampling variances of survey estimates.

IVEWare is another software option for the JRR estimation of sampling errors for survey statistics. IVEWare has been developed by the Survey Methodology Program of the Survey Research Center and is available free of charge to users at: http://www.isr.umich.edu/src/smp/ive/ . IVEWare is based on SAS Macros and requires SAS Version 6.12 or higher. The system includes programs for multiple imputation of item missing data as well as programs for variance estimation in descriptive (means, proportions) and multivariate (regression, logistic regression, survival analysis) analysis of complex sample survey data.

These new and updated software packages include an expanded set of user-friendly, well-documented analysis procedures. Difficulties with sample design specification, data preparation, and data input in the earlier generations of survey analysis software created a barrier to use by analysts who were not survey design specialists. The new software enables the user to input data and output results in a variety of common formats, and the latest versions accommodate direct input of data files from the major analysis software systems.

#### VII.D Sampling Error Computation Models

Regardless of whether the linearization method or a resampling approach is used,
estimation of variances for complex sample survey estimates requires the
specification of a *sampling error computation model*. CPES data analysts who
are interested in performing sampling error computations should be aware that the
estimation programs identified in the preceding section assume a specific sampling
error computation model and will require special sampling error codes. Individual
records in the analysis data set must be assigned sampling error codes that identify
to the programs the complex structure of the sample (stratification, clustering) and
are compatible with the computation algorithms of the various programs. To
facilitate the computation of sampling error for statistics based on CPES data,
design-specific sampling error codes will be routinely included in all versions of
the data set. Although minor recoding may be required to conform to the input
requirements of the individual programs, the sampling error codes that are provided
should enable analysts to conduct either Taylor Series or Replicated estimation of
sampling errors for survey statistics. In programs that use the Taylor Series
Linearization method, the sampling error codes (SESTRAT and SECLUSTR) will typically
be input as keyword statements (SAS V9.1, SUDAAN V9.0) or as global settings (Stata
V9) along with the analysis weight and will be used directly in the computational
algorithms. Programs that permit BRR or JRR computations will require the user
supplied sampling error codes to construct "replicate weights" that are required for
these approaches to variance estimation.

Two sampling error code variables are defined for each case based on the sample design stratum and primary stage unit (PSU) cluster in which the sample respondent resided: Sampling Error Stratum Code (SESTRAT) and Sampling Error Cluster Code (SECLUSTR). The CPES SESTRAT codes were derived directly from a concatenation of the existing sampling error stratum codes for the NCS-R, NSAL and NLAAS sample designs. A total of 180 sampling error strata were defined. These were allocated to the individual contributing samples according to the coding scheme shown in Table 6.

Table 6. CPES Sampling Error Strata | |
---|---|

CPES Component Sample | CPES Sampling Error Strata |

NCS-R | 1-42 |

NSAL | 43-111 |

NLAAS | 112-180 |

All original sampling error strata definitions for the NCS-R and NLAAS were preserved unchanged in the mapping to the CPES sampling error stratum code. In general, the assignment of NSAL cases to CPES sampling error strata also followed the original NSAL coding. The single exception involved a NSAL sampling error stratum that included multiple clusters. This stratum was divided into several pseudo-strata each with a pair of combined clusters. This minor change enables CPES analysts to use any of the sampling error calculation methods (Taylor, BRR or JRR) without having to perform additional recoding of the sampling error variables.

Likewise, with one exception, the values of SECLUSTR for CPES sampling error strata are identical to those in the original NCS-R, NSAL and NLAAS data sets. The exception was the cluster numbering for the one NSAL sampling error stratum with multiple clusters. Clusters in this stratum were randomly grouped into pairs and assigned to pseudo-strata as described in the preceding paragraph. The result is that the CPES SECLUSTR code takes a value of either 1 or 2 and exactly two sampling error clusters are assigned to each sampling error stratum.

#### VII.E Syntax for CPES Design-based Variance Estimation Using STATA and SAS

The following two sections provide a short overview of the general syntax and command file structure for computing sampling errors using STATA and SAS programs that have been designed for the analysis of complex sample survey data. Analysts are referred to the user guides and the on-line help facilities of these two software systems for documentation of the individual programs.

*VII.E.1 Stata command syntax*

As described above, CPES data analysts who are familiar with the STATA software system can utilize STATA's "svy" commands for the analysis of complex sample survey data. STATA Version 9 syntax for some of the more commonly used analysis programs is illustrated below (shown for the Part 2 weight option) :

`.svyset seclustr [pweight=cpeswgtl], strata(sestrat)`

This statement defines the sample design variables for the duration of the analysis session. SVY commands issued after this statement will automatically incorporate these design specifications.

To conduct analyses, the following STATA commands and syntax are used (please refer to STATA V9 Reference Manual for specific command syntax and output options):

`.svy, vce(linearized): mean vars`

[estimates, standard errors, design effects for means]

`.svy, vce(linearized): tab v1 v2`

[estimates, standard errors for proportions of single variable categories, or crosstabulations of two variables with tests of independence]

`.svy, vce(linearized): regress dep x1 ...`

[simple linear regression model for a continuous dependent variable]

`.svy, vce(linearized): logit dep x1...`

[simple logistic regression model for a binary dependent variable]

To estimate the single statistics or regression models for subpopulations of the survey population in STATA, the following optional syntax is used (illustrated for svytab):

`.svy, vce(linearized): tab v1 v2, over(var) `

where var is a categorical variable that defines the subpopulations for which separate estimates are desired (e.g. gender).

*VII.E.2 SAS Version 9 Command Syntax*

SAS Version 9 includes four programs for the analysis of complex sample survey data: PROCS Surveymeans, SurveyFreq, SurveyReg and SurveyLogistic. The general syntax for specifying the CPES design structure in the SAS system is as follows:

`PROC SurveyXXXX data=libname.filename;`

STRATUM SESTRAT;

CLUSTER
SECLUSTR;

WEIGHT CPESWTLG;

program specific statements here;

RUN;

Users are referred to the SAS/STAT(R) 9.1 User's Guide (SAS, 2004) for documentation on program specific statements, keywords and options