# Technical issues of sampling design analysis of DAWN data

Primary sampling units (PSUs) are hospitals within strata and secondary sampling units (SSUs) are records of emergency department (ED) visits within PSUs. Some hospitals chosen with a probability equal to one in the first stage of sampling are "certainty hospitals." This means that all of the hospitals in a stratum are selected. So, the finite population correction factor (1-f_{h}) is zero for those strata with certainty hospitals (since sampling was without replacement (WOR) for other strata from finite populations), and consequently there will be no variance contribution to those strata at the first stage sampling. Where f_{h}=n_{h}/N_{h}, n_{h} is the count of hospitals in *h*-th stratum and N_{h} is the corresponding (population) frame count given in the variable PSUFRAME. The records of the ED visits of such certainty hospitals were randomly chosen, i.e., visits were not a complete enumeration. To take into account the within-hospital variation for ED visits, the DAWN PUF provides the additional design variable, REPLICATE, for the second stage of sampling, which is required for the correct statistical inferences. In sum, each of the strata have at least 2 hospitals (PSUs) and each of the hospitals have exactly two replicates (SSUs); and each of the replicates should have numerous ED visit records.

There are some issues with variance estimation when using the Taylor method and the calculations of degrees of freedom that should be noted. The SAS, SPSS, Stata, and SUDAAN software packages calculate the variance contribution for each stage of the design using the deviations between the unit's value (i.e., total) and the mean of all units' values within the stage. (Unit indicates either the PSU and SSU for the first and second sampling stages.) There are no single unit (i.e., singleton) strata in the DAWN PUF data, but certain analyses may encounter singleton-stratum while calculating the variance for a domain or subclass/subgroup or subpopulation. Singleton-stratum is when a single unit (PSU or SSU) has at least one observation and other units have no observation in that stratum. Units with no observations are handled in different ways by the different software packages when calculating the variance and degrees of freedom. The MISSUNIT option in SUDAAN and singleunit(center) option in Stata handle such cases by calculating the variance contribution for those singleton-strata using the deviations of that unit-total value and the grand mean of the sample. By default, SPSS handles this situation based on the assumption that there was at least one other unit (if not, then that stratum contribute null variance) in that stage in the sample and thus units with no observation (sampling zeroes) would have unit totals as 0 and definitely would contribute to the stratum-variance. Moreover, an analysis can also encounter some strata with no observations (empty strata). Users may experience such a situation in domain analysis. The question is how the variance and the degrees of freedom computation handle this situation to account for the design effect into the overall variance and the degrees of freedom by the software packages. SUDAAN assumes there were actually no strata with non-missing cases in the population, but strata with missing cases as part of sample of selection. SUDAAN considers those missing units are sampling zeros. Thus, each of the empty strata contributes zero variance into the overall variance and in sequel, contributes to the degrees of freedom. The Stata software package does this by default method; but certain Stata procedures have options, for instance the singleunit(center) option, which digress from this assumption of sampling zeroes and consider that such empty strata are structural zeros. The logic is that when a stratum has no cases at all then one should assume that this stratum is not part of the sampling for domains and they would contribute null to the overall variance. In Stata, the degrees of freedom determined with the singleton(center) option is smaller than that obtained by the default method for domains not in common. Note that the variance estimates from SUDAAN and Stata software packages are always the same whether the assumption of sampling zeroes is retained or overlooked, but they produce different degrees of freedom.

The calculation of degrees of freedom (*df*) is crucial for all these software packages and influences calculation of inferential statistics such as confidence intervals and p-values of test statistics. Conventionally the *df* is calculated by the fixed-PSU method and the 'fixed' *df* is defined by the number of PSUs minus number of strata for the first stage in the sample design with any number of stages of sampling (i.e., from full data file). The SPSS software package always uses this fixed-PSU method for calculating the *df* in all aspects of analysis. This is the default setting in SUDAAN, but users can provide a predetermined number as *df* with the user interactive *DDF=* option. This fixed-PSU method is also the default in Stata, but this package has options that invoke Stata procedures to calculate an alternate *df* by the method known as variable-PSU method. For example, Stata with the singleunit(center) option uses the variable-PSU method for calculating the *df* and the variable *df* is calculated as the number of non-empty PSUs minus the number of non-empty strata. The number of non-empty PSUs is the number of PSUs in the sample MINUS the number of PSUs with no observation in all singleton strata. User can manually calculate the 'variable' *df* for a domain analysis and specify it in SUDAAN with DDF=*df* parameter option in the PROC statement, and/or user can specify 'design' *df* in Stata with svy, dof(*df*): in order to compare the estimates of inferential statistics across software packages.

SAMHDA's online data analysis system (called SDA) calculates slightly different but appropriate *df* as the number of PSUs in non-empty strata minus the number of non-empty strata. SAS, SPSS, and SDA handle singleton-strata almost equally. Note that SAS and SDA can only take into account the 1^{st} stage sampling design effects. DAWN data in SDA use a modified (pseudo) single-stage stratified cluster sample that was prepared for compatibility with SDA's complex survey data analysis capability.

For related technical information, please see the FAQ: Accounting for the effects of complex sampling design (design effects) when analyzing DAWN data.