Standard Errors for Complex Samples
Consequently, there are three possible specifications:
Each of these three specifications results in standard errors being computed differently.
In assessing the completeness of the stratum and cluster information, cases with missing data on the variables used in the analysis are excluded. Users must therefore be careful not to create variables to be used as selection filters which assign a missing-data code to many of the cases.
In order to create these pseudo-strata, each pair of adjacent clusters (in numeric order) is combined into a stratum. For example, clusters 1 and 2 would be paired as belonging to stratum 1, and clusters 3 and 4 would belong to stratum 2. If there is an odd number of clusters, the last cluster will be combined with the preceding two clusters, to form a final stratum with 3 clusters.
If there are expected to be substantial differences between the clusters, it may be preferable to create explicit pseudo-strata yourself (based on criteria that do not involve peeking at the data), rather than to let the program create strata automatically from the numeric order of the clusters.
It is important to understand that the method of automatically creating strata is done only once -- for the sample as a whole. If a specific cell in a table (which is effectively a subclass of the sample) has no valid cases in a particular cluster, the pseudo-strata are not re-created. The calculation of variances can proceed with only one cluster in a stratum, so long as this happens in a subclass of the sample and not in the sample as a whole. See the discussion below on subclasses.
On the other hand, if a subclass has no valid cases in any cluster in a stratum, that whole stratum is dropped from the calculation, and the missing strata and clusters do not contribute to the calculation of the degrees of freedom for the statistic in that cell.
Once the clusters have been combined into pseudo-strata by the program, the Taylor series method is used to calculate standard errors, just as for the stratified cluster design.
This procedure has the same effect as creating a stratum variable that has the same value for all the cases in the sample (the number '1', for instance) and then defining that variable as the stratum variable in the HARC file. The computation then proceeds as for the stratified cluster design.
This procedure will sacrifice any potential gains that might result from the implicit stratification of the clusters (if they have been ordered by some relevant criterion). But it will also avoid the inflation of variance that could result from the pairing up of very different clusters.
After defining the subgroups, we will then calculate statistics like percentages or means, together with standard errors and confidence intervals, for each subgroup.
The problem is that there is no obvious way for SDA
to determine in advance whether a particular subgroup of the sample is
a sampling domain or is simply a subclass created for analysis.
SDA (beginning with version 3.4) therefore uses the following rule:
If a particular subgroup of the sample
has no cases at all in any of the clusters in a
that subgroup is assumed to be a sampling domain,
selected in a way that excludes the strata with no cases in that domain.
SDA (beginning with version 3.4) therefore uses the following rule: If a particular subgroup of the sample has no cases at all in any of the clusters in a particular stratum, that subgroup is assumed to be a sampling domain, selected in a way that excludes the strata with no cases in that domain.In other words, a judgment is made that the lack of any cases at all in a sampling stratum must be the result of the sample design. Then the calculation of confidence intervals for that subgroup is carried out without using the strata with no valid cases.
Concretely, this affects the calculation of degrees of freedom used to create the confidence intervals. The degrees of freedom are calculated as the number of clusters minus the number of strata. In carrying out this calculation for a specific domain, the strata without valid cases, and the associated clusters that fall into those strata in the sample as a whole, are excluded from consideration. This means that the degrees of freedom for a particular domain will be fewer than the degrees of freedom for subclasses spread over the sample as a whole. The fewer the degrees of freedom, the greater must be the t-statistic for a particular confidence level. This reduction in the degrees of freedom means that the confidence interval for a domain will be a little wider (for a given standard error) than it would be for a subclass with more degrees of freedom.
Since this exclusion of strata is done separately for each subgroup (or cell of a table), the confidence intervals in different cells of a table may be based on different numbers of strata. The optional table of diagnostic information (available in the MEANS program) reports how many strata and clusters were actually used to generate the degrees of freedom for calculating confidence intervals in each cell or for each comparison.
For stratified cluster designs, the sampling variances are calculated based on the differences in the percentages or in the mean values of the dependent variable between clusters within each stratum. This method of calculation is discussed in Kish, Survey Sampling, pp. 190-193. The actual formula is 6.4.4 on p. 192. The finite population correction (1-f) is ignored.
Designs with only a cluster variable are basically converted to a stratified cluster design, as described above. Standard errors are then computed as for a stratified cluster sample.
Designs with only a stratum variable are equivalent to a stratified cluster sample with only one case in each cluster. In other words, each case is treated as a cluster of size one. The computation of standard errors for stratified element samples is a little simpler than for cluster samples, since there is no covariance between sampled elements within the strata. The actual formula used is 6.4.2 in Kish, Survey Sampling, p. 192. Once again, the finite population correction (1-f) is ignored.
The standard error for each difference between two means is the square root of the sum of the two variances of the means minus the covariance, and it is calculated as: sqrt(VARIANCE1 + VARIANCE2 - COVARIANCE12). The variance of each mean is the square of the corresponding standard error. Each standard error is calculated according to the sample design, as described in the sections above. The covariance term arises because of the complex design (in cluster samples).
The confidence interval for each difference is calculated as a multiple of the standard error that is added to, or subtracted from, the difference. This multiple is based on Student's t-statistic, The value of Student's t used for computing confidence intervals depends on the desired level of confidence (usually 95 percent) and the degrees of freedom (df) for the comparison. The smaller the df, the larger the required value of Student's t and, consequently, the width of the confidence intervals. As the df increase, the size of the required Student's t value decreases until it approaches the familiar constant for the normal distribution (which is 1.96, for the 95 percent confidence level).
In complex samples, the degrees of freedom are based on the number of clusters and strata used for the comparison. The optional diagnostic table reports those numbers for each difference shown.
(The current version of SDA must generate these replicate weights internally. It is not currently possible to use this method on a dataset that contains replicate weights but not the stratum and cluster variables themselves.)
This method is relatively simple and can be used for many types of analysis. However, it does require more computation time than the Taylor series method used for TABLES and MEANS.
Logistic and probit regression, in particular, can require extra time. The LOGIT program, even without a complex sample design, uses multiple iterations to converge to a solution. For complex samples, each iteration through the data requires separate calculations for each set of replicate weights. For large datasets with many PSUs, therefore, the user should not expect the almost instantaneous results that SDA usually provides.