Accounting for the effects of complex sampling design (design effects) when analyzing DAWN data
The DAWN (Drug Abuse Warning Network) employs a two-stage (stratified cluster) sample design for the selection of hospital emergency department (ED) visits caused or contributed to by drugs. The DAWN public use file (PUF) includes the following complex design variables: variance estimation stratum (STRATA), primary sampling unit (PSU), secondary sampling unit (REPLICATE), PSU frame size (PSUFRAME), and analysis case weight (CASEWGT). The DAWN PUF has no missing values in the design variables STRATA, PSU, PSUFRAME, REPLICATE, and CASEWGT; however, analysis variables can have missing values.
The default method for estimating standard errors is Taylor series linearization for SAS, SPSS, SUDAAN, and Stata software, but SAS can only account the variance contribution from the first-stage. SAS is not currently able to fully and properly account for the DAWN sampling design.
Example code/syntax specific to each statistical software package given below uses the DAWN 2010 PUF. The examples estimate the proportions (or means) of alcohol-related ED visits-along with the standard errors (SE) and confidence intervals by racial group. In short, the statistical analysis plan (SAP) is to obtain the mean and its SE along with confidence intervals and other related statistics. Note that the variable ALCOHOL is coded as 0 and 1; thus, the mean of ALCOHOL is namely the proportion of ALCOHOL=1.
Users should read the help document (of her/his respective statistical package) regarding how missing values are being handled if any exist in the analysis variables.
The following SUDAAN stand-alone code uses all of the design variables provided in the PUF for appropriate calculation of the variance of means/proportions.
- nest STRATA PSU REPLICATE / MISSUNIT;
totcnt PSUFRAME _minus1_ _zero_;
- weight CASEWGT;
The SPSS specific code for the same analysis is:
- Get file='path\dawn2010puf.sav'.
/PLANVARS ANALYSISWEIGHT= casewgt
/DESIGN STRATA= strata CLUSTER= psu
/ESTIMATOR TYPE = EQUAL_WOR
/ESTIMATOR TYPE = WR.
/PLAN FILE= 'folder-path\dawn_stage2.csplan'
/SUMMARY VARIABLES= alcohol
/SUBPOP TABLE= race
/STATISTICS SE CIN(95)
/MISSING SCOPE=ANALYSIS CLASSMISSING=EXCLUDE.
Stata specific code for the same analysis is (recall that Stata commands are case sensitive):
- use drivename:\path\statadatafilename.dta
svyset psu[pweight=CASEWGT], strata(STRATA) fpc(PSUFRAME) || REPLICATE, strata(PSU)
save drivename:\path\mystatadata.dta, replace
- svy: mean ALCOHOL, over(RACE)
estat effects, deff
The SAS code for this analysis, ignoring the second stage variance, is:
(with replacement at the first stage of sampling; ignoring the finite population correction)
- proc surveymeans data=libname.dataSetName mean clm;
- domain race;
Note that the design variables (except the sampling weight variable) do not affect first-order moment statistics (such as mean, proportion, percent, totals and weighted counts). However, the design variables must be used to produce the SE estimates of descriptive and inferential statistics.
The target population of hospitals was divided into disjoint strata (denoted as STRATA) such that every hospital in the population belongs to exactly one stratum. Hospitals are the PSUs. In the first stage of sampling, within each stratum, hospitals were selected by the simple random sampling (SRS) without replacement for finite populations. Stratum size (denoted by PSUFRAME) is the population count of hospitals (i.e., PSUs) contained in the stratum. Depending upon the year and the size of the hospital, either all REPLICATEs were selected, which resulted in no contribution to the second stage variance for that hospital, or second-stage sampling at rates as small as 1/3 was performed using systematic random sampling. This resulted in sample files that averaged about 300,000 visits per year. Tabulation and analysis of these large sample files when many variables were involved proved to be computationally very expensive due to the amount of time necessary to calculate exact estimates of the second stage variance. In multi-stage sampling, the greatest portion of the variability occurs in the first stage. Thus, the second-stage contribution is usually a small fraction of the first-stage variance. Realizing that unacceptable amounts of time were required to calculate the estimates of a small fraction of total variance, an approximation method was adopted that made variance estimation for large tables with many variables practical. The sampled visits within each hospital were sorted into two groups, and the between group variance was calculated and substituted for the standard second-stage variance. DAWN research simulations showed that the above approximation is sufficient for DAWN precision and efficient for DAWN standard error calculation. Note that STRATA is the first stage and PSU is the second stage stratification variable in the sample. Proper use of the REPLICATE variable should account for within-hospital variation, i.e., the second stage variance. The final analysis weight variable, CASEWGT, was adjusted for unequal probability of selection, nonresponse of hospitals, and coverage bias of ED visits to the benchmark population totals of ED visits from American Hospital Association (AHA) database.
CASEWGT is a poststratification adjusted analysis weight variable. By using CASEWGT, the weighted estimates of counts or totals obtained for any of the survey variables are the estimates for the entire universe of DAWN-eligible hospitals in the United States. Poststratification adjustments were implemented within the design strata to offset the coverage bias on the estimates. That means post-strata were not constructed from any of the ED visit characteristic variables. Therefore, the software packages correctly take into account the design effects on the variance estimates of estimated counts or totals for variables related to DAWN visits. For further details on sample design and weighting adjustment, please see the DAWN Methodology Report.
For additional technical information on options and how the various statistical software handle issues like missing values and degrees of freedom, please see the FAQ: Technical issues of sampling design analysis of DAWN data.