Log In/Create Account

Online Help for Analysis Programs-SDA 3.5

This file contains the online help available that is also available from inside each SDA analysis program (by selecting the corresponding word highlighted on the form or screen for selecting options). In addition to the help specific to each program, this file includes information on features common to all analysis programs.

CONTENTS

Help for Specific Analysis Programs

Features Common to All Analysis Programs

SDA Frequencies and Crosstabulation Program

This program generates the univariate distribution of one variable or the crosstabulation of two variables. If a control variable is specified, a separate table will be produced for each category of the control variable. An explanation of each option can be obtained by selecting the corresponding word highlighted on the form.

Steps to take

Specify variables
To specify that a certain survey question or variable is to be included in a table, use the name for that variable as given in the documentation for this study. Aside from simply specifying the name of a variable, certain variable options are available to change or restrict the scope of a variable.

Select display options
After specifying the names of variables, select the display options you wish. These affect percentaging, text to display, and statistics to show.

Select an action After specifying all variables and options, select the action to take.

REQUIRED variable name

Row variable(s)
Variable down the side of the table

OPTIONAL variable names

Column variable(s)
Variable along the top of the table

Control variable(s)
A separate table is produced for each category of a control variable. If charts are being generated, a separate chart is also produced for each category of the control variable.

If more than one row, column and/or control variable is specified, a separate table (and chart) will be generated for each combination of variables.

Selection filter variable(s)
Some cases are included in the analysis; others are excluded.

Weight variable
Cases are given different relative weights.

Table Display Options for Crosstabulation

Percentaging
Defines which way to make the percents add up to 100 percent:

  • Column: down each column
  • Row: across each row
  • Total: as a percent of the total number of cases in the table

You can request more than one type of percentaging in a table, but such tables are hard to read.

It is important to understand that if a weight variable has been specified, the percentages and the statistics are always computed using the weighted number of cases. If you want to calculate percentages and statistics using only the unweighted N's, do not specify a weight variable.

Confidence intervals
If this option is selected, an additional row of numbers is generated that contains the upper and lower bound of the confidence interval of the percentage (column, row, and/or total) in each cell. The confidence interval is the range of values within which the population value of the statistic is likely to fall. By default, the level of confidence is 95 percent, but the user can also select 99 percent or 90 percent.

The confidence interval is computed by converting the standard error of each percentage to a natural logarithm and then multiplying the log of the standard error by the value of Student's t appropriate to the level of confidence requested and to the number of degrees of freedom. The result is added to the log of the percentage to obtain the upper bound of the confidence interval, and it is subtracted from the log of the percentage to obtain the lower bound. The logs of the upper bound and of the lower bound are then converted back to percentages (by taking the antilogs) and displayed in the table cell.

This conversion back and forth to logarithms results in confidence intervals that are asymmetric -- they are a little wider in the direction of 50% than in the direction of 0% or 100%. This is the same procedure used by Stata to calculate confidence intervals of percentages. Notice that the calculation of confidence intervals for a proportion (or for any mean) by the Comparison of Means program does not use this log transformation. Therefore, the confidence intervals calculated by the Comparison of Means program will be a little different than the confidence intervals calulated by the Crosstablulation program for the same proportions. This is also the case for Stata.

Standard error of each percent
Standard errors for each type of percentage (column, row, or total) can be computed and displayed for each cell of the table. Standard errors are used to create confidence intervals for the percentages in each cell.

Simple random samples
If the sample is equivalent to a simple random sample of a population, the standard error of each percentage is computed using the familiar "pq/n" formula for the normal approximation to the standard error of a proportion. For each proportion p, the formula is:
sqrt(p * (1-p) / (n-1))
where n is the number of cases in the denominator of the percentage -- the total number of cases in that particular column, row, or total table, depending on the percentage being calculated. For this calculation, n is the actual number of cases, even if weights have been used to calculate the percentages.

Complex samples
If the sample for a particular study is more complex than a simple random sample, the appropriate standard errors can still be computed provided that the stratum and/or cluster variables were specified when the dataset was set up in the SDA Web archive. Otherwise, the standard errors calculated by assuming simple random sampling are probably too small.

For complex samples the appropriate standard errors are computed using the Taylor series method. If you want additional technical information, see the document on standard error calculation methods.

Note that the calculations for standard errors in cluster samples require that the coefficient of variation of the sample size of the denominator for each percentage, CV(x), be under 0.20; otherwise, the computed standard errors (and the confidence intervals) are probably too small, and they are flagged in the table with an asterisk. CV(x) and other diagnostic information is available for standard error calculations done by the SDA Comparison of Means program. That program and the SDA Crosstabulation program use the same information and methods to calculate standard errors.

Design effect (deft) for each percent
The design effect for each percentage based on a complex sample is the ratio of the standard error of each percent in a table cell divided by the standard error of the same percent in a simple random sample of the same size. For the calculation of standard errors, see the discussion of standard errors above. (The design effect for a percent based on a simple random sample is 1.)

The design effect for each percent in a cell is used to calculate the effective number of cases (N / deft-squared) on which the percent is based, for purposes of precision-based suppression.

The design effects for all of the total percents in a table are used to calculate the Rao-Scott adjustment to the chi-square statistic, if bivariate statistics have been requested for a complex sample.

DF -- Degrees of freedom
The number of degrees of freedom (df) is used to compute the width of each confidence interval. For a simple random sample the df equal the number of cases in the denominator for each each percentage for that cell, minus one.

For complex samples, the df equal the number of primary sampling units (clusters, for cluster samples; individual cases in the denominator, for unclustered samples) minus the number of strata (unstratified samples have a single stratum). Note that the number of strata and clusters used for this calculation is usually the number in the overall sample, and not in the subclass represented by a cell in a table. For a fuller discussion of this issue, see the treatment of domains and subclasses in the document on standard error methods.

The value of Student's t used for computing confidence intervals depends on the desired level of confidence (95 percent, by default) and the df. The fewer the df, the larger the required value of Student's t and, consequently, the larger the width of the confidence intervals. As the df increase, the size of the required Student's t value decreases until it approaches the familiar value for the normal distribution (which is 1.96, for the 95 percent confidence level).

Sample design
For complex samples, standard errors and confidence intervals are calculated that take the complex design into account. If bivariate statistics are requested, the Rao-Scott adjustment to the chi-square statistics are used to create F statistics. In this case, probability values are only calculated for the Rao-Scott-based F statistics, and not for the unadjusted chi-square statistics.

Nevertheless, you can specify that the standard errors, confidence intervals, and chi-square probability values should be calculated as if the sample were a simple random sample (SRS). One reason to request SRS calculations might be to compare the size of the SRS standard errors or confidence intervals with the corresponding statistics based on the complex sample design.

N of cases to display
By default, the number of cases used to calculate percentages is displayed in each cell. The box to display the weighted N is initially checked on the option form. If no weight variable was specified for the analysis, the unweighted N of cases is displayed in each cell, even if the box for weighted N was checked.

However, you can uncheck both boxes, and no N will be displayed. Or you can check both boxes, and both the unweighted and the weighted N of cases will be displayed (if a weight variable has been specified).

It is important to understand that if a weight variable has been specified, the percentages and the statistics are always computed using the weighted number of cases, regardless of which N is displayed in the table. If you want to calculate percentages and statistics using only the unweighted N's, do not specify a weight variable.

Summary statistics (Bivariate or Univariate)
Various numbers or statistics can be used to summarize the distributions of the variables. If you specify both a row and a column variable, a package of bivariate statistics is generated. If you specify a row variable only, a package of univariate statistics is generated. Consult any statistics textbook for more information on the meaning of these statistics.

Bivariate statistics

The bivariate statistics summarize the strength or the statistical significance of the observed relationship between the row and the column variables. Several of the most common statistics are displayed if you select this option.

  • Nominal-level statistics

    A nominal-level statistic does not take into account any ordering of the categories of the row and column variables. That is, you would get the same result even if the categories were put into another order.

    SDA displays two versions of the chi-square statistic, which is the most commonly used nominal-level statistic. For simple random samples (SRS) a probabilty level (p-value) is also calculated for each chi-square statistic.

    For complex samples a Rao-Scott adjustment to each chi-square is calculated. An F statistic is derived from the adjusted Rao-Scott statistics and is added to the statistics package. The p-values corresponding to those F statistics are displayed (instead of the p-values for the regular chi-square statistics, which do not take the sample design into account).

    If the p-value is low (about .05 or less), the chances that the observed relationship is only due to sampling error are correspondingly low, and in that case the relationship is said to be statistically significant. On the other hand, if the p-value is high, the chances are correspondingly high that the row and the column variables are not related to one another in the whole population from which the sample was drawn but are only related in the sample that happens to have been selected and that we are observing (analyzing).

    • The chi-square statistics
      Two versions of the chi-square statistic are displayed -- Pearson's Chi-square, displayed after 'Chisq-P(df)=', where df is the number of degrees of freedom; and the Likelihood-ratio Chi-square, displayed after 'Chisq-LR(df)='. For SRS, the p-value (probability statistic) corresponding to each chi-square statistic for the given df (degrees of freedom) is also displayed.

      Note that if the frequencies in the table are weighted, the chi-square statistic can be artificially inflated (or deflated). Consequently, if weights are used, the chi-square is adjusted by the factor: (Total unweighted N) / (Total weighted N).

    • Rao-Scott adjustment to chi-square (for complex samples)
      The probability associated with a regular Pearson or Likelihood ratio chi-square statistic assumes that the sample was a simple random sample. For complex samples, the probability associated with a given chi-square statistic is usually too small. This means that a particular relationship between two variables may appear to be statistically significant when it could really have arisen by chance.

      The Rao-Scott adjustment to the chi-square statistic takes the complex sample design into account. The probability associated with the Rao-Scott statistic is a more accurate indicator of the statistical significance of the relationship between the row and the column variables than the probability corresponding to a regular chi-square statistic.

      SDA displays the F statistic derived from each Rao-Scott statistic and the associated p-value of the F. This is done both for the Pearson chi-square, displayed after 'Rao-Scott-P:F(dfn, dfd)'; and for the Likelihood-ratio chi-square, displayed after 'Rao-Scott-LR:F(dfn, dfd)'. These are F-tests, where dfn is the number of numerator degrees of freedom and dfd is the number of denominator degrees of freedom.

      In generating these test statistics, SDA uses the first-order Rao-Scott approximation. The first step is to generate design effects for the estimated proportion of cases in each cell of the table and then to calculate a generalized design effect based on the cell design effects. The two chi-square statistics are divided by the generalized design effect, to obtain design-adjusted chi-square statistics. Then each design-adjusted chi-square statistic is divided by its numerator degrees of freedom to obtain F-statistics, which are then tested. The Rao-Scott adjustments to chi-square are explained in the following journal article: J.N.K. Rao and A.J. Scott, "On Chi-squared Tests for Multiway Contingency Tables with Cell Proportions Estimated from Survey Data," The Annals of Statistics, Vol. 12 (1984), No. 1, pp.46-60.

      Note that this use of the first-order Rao-Scott approximation is the same as in SAS. Stata uses a second-order approximation, which is a little different but should give the same substantive results.

  • Ordinal-level statistics

    Ordinal statistics take into account the order of the row and the column categories. However, there is no assumption made that the distance between successive rows or columns is of the same magnitude. Only the order is considered.

    Four ordinal statistics are given: Gamma, Tau (2 versions) and Somers' d (assuming the row variable to be the dependent variable).

    The ordinal statistics can be calculated either for numeric variables or for character variables (with the categories sorted into alphabetic order).

    These ordinal statistics are purely descriptive. No attempt is made to test them for sampling error.

  • Interval-level statistics

    Interval-level statistics take into account the ordering of the row and column categories (like ordinal statistics). And they also make the assumption that the distance between each successive category code is of equal importance.

    If interval-level statistics are reported for numeric variables that are ordered, note that they must be ordered in a way that approximates interval-level variables. This refers to variables coded like 1=Agree strongly; 2=Agree somewhat; 3=Disagree somewhat; 4=Disagree strongly. To report interval-level statistics for such variables, you must assume that the "distance" between 1 and 2 is of equal importance as the distance between 2 and 3, and between 3 and 4.

    Two interval-level statistics are given: R (the Pearson correlation coefficient), and Eta (the correlation ratio assuming the row variable to be the dependent variable).

    If the row variable is a character variable, Eta cannot be calculated. If either the row variable or the column variable is a character variable, the correlation coefficient cannot be calculated.

    These interval statistics are purely descriptive. No attempt is made to test them for sampling error. Use the regression program for tests of significance and confidence intervals for correlation statistics. The regression program can also handle complex sample designs.

Univariate statistics

The univariate statistics package includes the mean, median, mode, standard deviation, variance, and the coefficient of variation (standard deviation divided by the mean) of the specified variable, plus a few other descriptive statistics. All of these statistics are calculated using the weight variable, if one is specified.

Note that the univariate statistics cannot be calculated for character variables. If a character variable is used as a row variable, the request for univariate statistics is ignored. Even for numeric variables, be aware that the univariate statistics will not be meaningful unless the code values of the row variable are ordered in a way that approximates interval-level data.

These univariate statistics are purely descriptive. No attempt is made to test them for sampling error. To get standard errors and confidence intervals for the mean of a variable, you can use the Comparison of Means program.

Other Display Options

Question text
The text of the question that produced each variable is generally available.
Suppress display of the table
Occasionally you may want to see the summary statistics for a table and/or the chart, without wishing to view the table itself, especially if the table is a very large one. If you select this option, the table is generated internally but is not displayed.
Color coding of the table cells
The table cells are color coded, in order to aid in detecting patterns. Cells with more cases than expected (based on the marginal percentages) become redder, the more they exceed the expected value. Cells with fewer cases than expected become bluer, the smaller they are, compared to the expected value.

The transition from a lighter shade of red or blue to a darker shade depends on the magnitude of the Z-statistic. The lightest shade corresponds to Z-statistics between 0 and 1. The medium shade corresponds to Z-statistics between 1 and 2. The darkest shade corresponds to Z-statistics greater than 2.

The color coding can be turned off, if you prefer. Color coding may not be helpful if you are using a black-and-white monitor or if you intend to print out the table on a black-and-white printer.

Show the Z-statistics
The Z-statistic controls the color coding of cells in the table. If you select this option, the statistic will be displayed in each cell.

The Z-statistic shows whether the frequencies in a cell are greater or fewer than expected (in the same sense as used for the chi-square statistic). It also takes into account the total number of cases in the table. If there are only a few cases in the table, the deviations from the expected values are not as significant as if there are many cases in the table.

The Z-statistics are standardized residuals. The residual for each cell is calculated as the ratio of two quantities:

  • The numerator is the difference between the observed and the expected number of cases in each cell. (The number "expected" is the same number used to calculate the chi-square statistic.)
  • The denominator is the following quantity:
    sqrt(expected_n * (1-row_proportion) * (1-column_proportion))
For a discussion of the standardized residuals, see Alan Agresti, An Introduction to Categorical Data Analysis, New York: John Wiley, 1996, p. 31.

Note that if the frequencies in the table are weighted, the Z-statistic can be artificially inflated (or deflated). Consequently, if weights are used, each Z-statistic is divided by the average size of the weights. The average size of the weights is just the ratio of the total number of weighted cases in the table, divided by the actual number of unweighted cases in the table. For example, if the table is based on 1,000 actual cases, but the weighted number of cases is 100,000, the average size of the weights is 100,000/1,000 = 100. (The chi-square statistics are adjusted in the same way, to compensate for weights whose average is different from 1.) Note also that the Z-statistic does not take into account the complex sample design, if the table is based on such a sample.

Include missing-data values
With this option, the row, column, and control variables in the table will include ALL categories, including those defined as missing-data or out-of-range categories. The system-missing code will also appear in the table. Its category label will be the default "(No Data)" unless another label has been assigned to the system-missing code. Any range restrictions or temporary recode commands will be ignored, and every category will be shown.

If bivariate statistics are requested, nominal and ordinal statistics will be produced as usual, with the missing data codes sorted into order with the valid codes.

Interval-level statistics will also be computed if the included missing-data codes allow it. The Eta statistic will be calculated if the included missing data codes on the ROW variable are all numeric. The Pearson correlation coefficient can be calculated only if the included missing data codes are all numeric on BOTH the row and column variables.

If univariate statistics are requested, the row variable can only have numeric missing-data codes. Otherwise, no statistics can be generated, and the request is ignored.

Number of decimals to display
Each statistic displayed in the cells of the table has a default number of decimal places. If you want more or fewer decimal places, you can generally specify from 0 to 6 decimal places for most of the statistics displayed in each cell (with the exception of the unweighted number of cases). Note that the decimal place specifications for standard errors are RELATIVE to the number of decimal places in the percentages.

Chart Options for Crosstabulation

Type of chart to display
Select the type of chart you would like. A stacked bar chart is relatively compact and is suitable for most tables. Regular side-by-side bar charts, pie charts, and line charts are also available.

If you select column percentaging, the chart will include a separate set of bars (or a separate pie) describing the row variable, for each category of the column variable. For a line chart, there will be a separate line for each category of the row variable, plotted against the values of the column variable. The column variable is treated as the "break variable" in this layout.

If you select row percentaging, the chart will include a separate set of bars (or a separate pie) describing the column variable, for each category of the row variable. For a line chart, there will be a separate line for each category of the column variable, plotted against the values of the row variable. The row variable is treated as the "break variable" in this layout.

If you select total percentaging, a combination of row and column percentaging, or no percentaging at all, the effect is the same as selecting column percentaging only.

If there is only a row variable specified for the table, the chart will include one set of bars (or one pie, or one line) to show the distribution of that row variable.

Bar chart options
The appearance of bar charts (both stacked and side-by-side bar charts) can be modified in two ways:

  • Orientation (vertical or horizontal): The default orientation is vertical, but a horizontal orientation will sometimes offer a clearer picture, especially if there are many categories in the break variable. A horizontal orientation will also accommodate longer category labels for the break variable.

  • Visual effects (2-D or 3-D): The bars in the bar charts can be shaded, to give a 3-dimensional effect. The desirability of shading depends mostly on personal preference.

Show Percents
Each bar, pie slice, or point on a line will have its percent included on the chart, if you select this option.

Note that these percents may not always appear or may not be legible in all situations.

On stacked bar charts the percents may not have sufficient room to appear inside the area allocated to small categories.

On pie charts and line charts the percents for some slices or for some points on the lines may be almost overlaid and become illegible, if there are many categories or if the lines are very close together.

If you still want to show the percents in those situations, it will usually help if you increase the size of the charts. For stacked bar charts it can also help to change from a vertical to a horizontal orientation.

Palette
The charts are usually output in color. If you wish to print or copy the charts on a black-and-white printer or copier, you can select the grayscale palette for your charts. The charts will then be output in various shades of gray (instead of in various colors).

Size of chart
The width and height of the chart (expressed in the number of pixels) can be modified. If there is a large (or a very small) number of categories in either the row or the column variable, it may be helpful to increase (or decrease) one or both of the dimensions of the chart.

Pie charts in particular may require an increase in the dimensions of the chart if the number of category slices is large. Otherwise, the labels for each slice of the pies might overlay one another.

Stacked bar charts with only two or three break categories may look better if the chart is made narrower. But if there is a large number of break categories (like years of age), the best solution is often to combine a horizontal chart orientation with an increase in the height of the chart.

Side-by-side bar charts are best limited to tables with a relatively small number of categories in both the row and the column variables. If there are many categories in either or both of the variables, the proliferation of bars can be confusing, even if the chart dimensions are increased. In such cases it is probably better to use stacked bar charts instead of side-by-side bar charts.

Line charts may need to be enlarged if the lines are close to being overlaid. If percents are being shown, they also can become overlaid. In such cases it may help to increase the height of the chart.

REQUIRED variable names

Dependent variable(s)
A numeric variable whose mean or average value is to be computed for each combination of the row and (optionally) column and control variables and displayed in a table.

Row variable(s)
Variable down the side of the table

OPTIONAL variable names

Column variable(s)
Variable along the top of the table

Control variable(s)
A separate table is produced for each category of a control variable.

If more than one dependent variable, row variable, column variable, and/or control variable is specified, a separate table will be generated for each combination of variables.

Selection filter variable(s)
Some cases are included in the analysis; others are excluded.

Weight variable
Cases are given different relative weights.

Display Options for Comparison of Means

Main statistic to display
Each cell of the table will usually contain the MEAN of the dependent variable for that particular combination of the row and (optionally) column and control variables.

Sometimes, however, it is more helpful to express each cell mean in another way:

  • DIFFERENCES from the overall mean.
    Select this option to have those differences calculated and put into each cell of the table.

  • DIFFERENCES from a ROW category.
    A specific ROW category is used as the base category. The other cells in the table are expressed as the difference between that cell and the base category cell in the same column.

    The rightmost column of the table usually shows the Row Totals. In some setups, however, the average of the differences is shown. This is the weighted average of the differences in that row. The weight is the number of cases (weighted number of cases if a weight was used) in that cell plus the number of cases in the corresponding base category.

  • DIFFERENCES from a COLUMN category.
    A specific COLUMN category is used as the base category The other cells in the table are expressed as the difference between that cell and the base category cell in the same row.

    The bottom row of the table usually shows the Column Totals. In some setups, however, the average of the differences is shown. This is the weighted average of the differences in that column. The weight is the number of cases (weighted number of cases if a weight was used) in that cell plus the number of cases in the corresponding base category.

  • TOTALS for each cell.
    The total is the numerator of the ratio used to calculate the mean. (The denominator of the ratio is the number of cases in that cell.)

    The totals are usually of interest only when a weight is being used to expand the cell counts up to their estimated values in the population. For example, one may be interested in the total estimated NUMBER of persons in each cell who have some characteristic (e.g., who smoke, or drive cars), instead of the PROPORTION of persons who have that characteristic. This assumes that the dependent variable is coded `1' for a case which has the characteristic (smokes, for example) and `0' for a case which does not have the characteristic.

Base row or column category
When the main statistic to display is a DIFFERENCE from a row or column category, it is necessary to specify which row or column category is the base category.

Enter the code value for the row or column category that you want to consider the base category.

Transformation of the dependent variable (for 0/1 dependent variables)
The mean of a dependent variable coded 0 or 1 is a proportion. The problem with analyzing a proportion is that the standard deviation and variance depend on the magnitude of the proportion.

The proportion in each cell of the table can be transformed into another statistic that has a more stable distribution. These options are provided for didactic purposes, so that students and researchers can readily compare the logit and and probit transformations with the original proportions in a table. The following options are available:

  • Logit
    The proportion p is reexpressed as log(p/(1-p)).

    Proportions greater than .5 have a positive logit. Proportions less than .5 have a negative logit. A proportion of .5 has a logit value of 0.

    The logit has a constant standard deviation of 1.81 (pi / sqrt(3), to be exact).

    The standard error of each logit is 1.81 / sqrt(n) where n is the number of cases in that cell. (This assumes simple random sampling. For complex samples, it is necessary to use the Logit/Probit Regression program to calculate standard errors.)

  • Probit
    The proportion is reexpressed as the value corresponding to that specific probability on the cumulative distribution function of the normal distribution.

    Proportions greater than .5 have a positive probit. Proportions less than .5 have a negative probit. A proportion of .5 has a probit value of 0.

    The probit has a constant standard deviation of 1.0.

    The standard error of each probit is 1.0 / sqrt(n) where n is the number of cases in that cell. (This assumes simple random sampling. For complex samples, it is necessary to use the Logit/Probit Regression program with the probit regression option.) to calculate standard errors.)

  • Logit scaled as a probit
    The logit and probit distributions are very similar. They differ primarily in the tails of the distributions. However, because the two statistics are scaled differently, the similarity is not evident by simply examining the statistics.

    This option converts a proportion into a logit and then rescales it by making the standard deviation equal to 1.0 (like a probit) instead of 1.81 (the usual standard deviation of a logit).

    The standard error of each logit scaled as a probit is 1.0 / sqrt(n) where n is the number of cases in that cell. (This assumes simple random sampling. For complex samples, it is necessary to use the Logit/Probit Regression program to calculate standard errors.)

These transformations require that the dependent variable be coded as a value of 0 or 1. If the variable is not coded that way, SDA will create a temporary 0/1 variable by recoding the lowest value to 0 and all other values to 1.

Features Common to All Analysis Programs

Options for specifying variables

Multiple variable names

More than one name may be entered for variables to be analyzed, such as for the row and the column variables. The names should be separated by a comma or blanks. Separate analyses for each combination of variables will be generated.

For example, the following specifications would generate six separate tables:

  • Row variables: spend spend2
  • Column variables: gender, education income

Restricting the valid range

The name of each analysis variable can be followed, in parentheses, by a list of values to be included in the analysis.

Basic range restriction

A single value such as 'gender(2)' or a range of codes such as 'age(30-50)', will limit the analysis to cases having those codes.
Multiple ranges and codes may be specified.
For example: age(1-17, 25, 95-100)
Open-ended Ranges using '*' and '**'
In a range, one asterisk '*' can be used to signify the lowest or highest VALID value.
For example: age(*-25,75-*)
This would include all VALID values less than or equal to 25 and all VALID values greater than or equal to 75. However, any missing-data values within those ranges would still be excluded.

In a range, two asterisks '**' can be used to signify the lowest or highest NUMERIC value, regardless of whether or not the codes are defined as missing data.
For example: age(50-**)
This would include ALL numeric values greater than or equal to 50, including data values like 98 or 99, even if they had been defined as missing-data codes. Note that '**' cannot be used alone (without '-') as a range specification. If you want to include all NUMERIC codes, you can use the range '(**-**)'.

Temporarily Transforming a Variable

A numeric variable can be transformed temporarily, for purposes of running the current analysis. There are four types of temporary transformations:

Temporarily Recode a Variable

Temporary recodes are created by specifying groups of codes that are to be combined into a single category. This type of transformation can be very simple, but certain options can make it a little more complex. These are the possibilities:

Basic recoding
For example, to combine the categories of 'age' into three groups, you can specify the variable as:
age(r: 18-30; 31-50; 51-95)
Notice that the name of the variable ('age') is followed by parentheses, then the instruction 'r' (or 'R') followed by a colon (':'), and then the groupings of codes. Those groupings can consist of single code values, ranges, or a combination of many values and/or ranges. Each group is separated from the other by a semicolon (';'). Spaces are optional, but are added here for readability.

Using this basic method of recoding, the new groupings of codes are given the default code values 1, 2, 3, and so forth. The default label for each group is the range of original codes that constitute that group ("18-30", for example).

Any categories of 'age' not included in the specified groupings will become missing-data on the recoded version, and they will be excluded from the analysis in the table.

On the other hand, any original missing-data categories of 'age' that are explicitly mentioned in the recode, will be included. For instance, if the value '90' for 'age' were flagged as a missing-data code, but included as in the example above, it would become part of the third recoded category. This is discussed in more detail in the section on "Treatment of missing data."

Assigning particular new code values
It is possible to assign new code values that are different from the default 1, 2, 3, and so forth. To do this, give the new code value, then an equal sign, then the grouping. (The new code value must be a whole number, and decimal places will be ignored. If you want the new code value to include decimal places, use the regular SDA RECODE program.)

For example, the variable 'age' can be recoded into the same three groups as above, but with the new code values 1, 5, and 10, by specifying the recode as follows:
age(r: 1 = 18-30; 5 = 31-50; 10 = 51-90)

For column, row, or control variables it will not usually matter what the new code values are. For variables on which statistics are computed, however, the new code values will affect the value of those statistics.

Assigning labels to the new code values
To assign your own label to a new grouping of code values, place the label in double quotes after the group codes, but before the semicolon. There is no set limit on the length of these labels; however, very long labels may distort the formatting of the tables.

For example, you can assign labels to the recoded categories of race by using the following specification:
race(r: 800-869 "White"; 870-934 "Black"; 600-652, 979-982 "Asian")

These labels will appear in the table, in place of the range of original codes that constitute that group. Nevertheless, the recode specifications will still be documented. A summary is always given at the bottom of the table.

Open ranges (with '*' or '**')
If you are not sure of the ranges of the variable to be recoded, you can specify an open range with an asterisk ('*'). A single asterisk matches the lowest or highest VALID code in the data for that variable.

For example, the 'age' recode could be specified as: age(r: *-30; 31-50; 51-*)
Using this method, all valid age values up to 30 would go into the first recoded group. And all valid age values of 51 or older would go into the third group.

If you want to use a range that includes NUMERIC codes that were defined as missing-data values, you can specify the range with two asterisks ('**') instead of one.

For example, the 'age' recode could be specified as: age(r: *-30; 31-50; 51-**)
Using this method, all valid age values up to 30 would go into the first recoded group. But every numeric value of 51 or greater would go into the third group, including codes like 99 that may have been defined as missing-data codes.

For more discussion about including codes that have been defined as missing-data codes, see the section on "Treatment of missing data."

Overlapping ranges
If the same original code value is mentioned in two or more groupings, it is recoded the FIRST time that the value is encountered.
For example, the following two specifications have the same effect:
age(r: 18-30; 30-50; 50-90), and
age(r: 18-30; 31-50; 51-90)

In both cases, the original 'age' value of 30 ends up in the first group, and the original 'age' value of 50 ends up in the second group.

Notice that order is important with overlapping ranges. The following specification will NOT have the same effect as the preceding two:
age(r: 3= 50-90; 2= 30-50; 1= 18-30)
In this example, the 'age' value of 50 will end up in the recode group with the value '3' (instead of in the second group), and the 'age' value of 30 will end up in the recode group with the value '2' (instead of in the first group).

Multiple specifications for one recoded group
It may sometimes be useful to have more than one specification for a new recoded group. This can be done by specifying the desired outcome code more than once.
For example, to have race recoded into two categories, with the first category including everyone EXCEPT those originally coded as '2', you could use the following specification:
race(r: 1=1 "Non-black"; 2=2 "Black"; 1=3-20)

Treatment of missing data
NUMERIC codes that have been defined as missing data on the original variable can be included in one of the categories of the recoded variable in two ways.

The first method is to mention the code explicitly, either as a single value or as part of a range. For example, if the 'age' value of 99 has been defined as a missing-data code, it can still be included by either of the following specifications:
age(r: 18-30; 31-50; 51-90; 99), or
age(r: 18-30; 31-50; 51-100)

In the first case the code 99 will become its own fourth recode category. In the second case, it will be included as part of the third category.

A second method to include NUMERIC missing data codes is to use an open range with two asterisks ('**') instead of one. For example, the following specification will include all numeric codes above 50 as part of the third recoded group:
age(r: 18-30; 31-50; 51-**)

Note that at present there is no way to include in a temporary recode the system-missing value or a character missing-data value (like 'D' or 'R'). You must use the regular recode program to handle those special missing-data codes. (Your data archive may or may not have enabled that program to run on your current dataset.)

Temporarily Collapse a Variable into Fewer Categories

A simple way to recode a variable into fewer categories is to "collapse" the variable, using a fixed interval.
Collapse syntax
For example, to collapse the variable 'age' into 10-year categories, you can specify the variable as:
age(c: 10, 1)
Notice that the name of the variable ('age') is followed by parentheses, then the instruction 'c' (or 'C') followed by a colon (':'), and then the interval, a comma, and the starting point. Spaces are optional, but are added here for readability.

Using this simple method of collapsing, the new groupings of codes are given the code values 1, 2, 3, and so forth. The label for each group is the range of original codes that constitute that group ("21-30", for example).

Effect of the starting point
The specified starting point affects the range. If the starting point is '1', the age ranges will be: 1-10, 11-20, 21-30, etc. On the other hand, if the starting point is '0', the age ranges will be: 0-9, 10-19, 20-29, etc.

If the starting point is HIGHER than the lowest actual value in the data, the values lower than the starting point become missing-data. For example, with a starting point of '21', any lower values of 'age' (like 18, 19, and 20) would not be included in a range and would become missing-data.

If the starting point is LOWER than the actual minimum value in the data, the ending point of each range is not affected. However, the first range includes only the valid values in that range, if any. For example, if the starting point for collapsing 'age' is '1', with an interval of '10', but the lowest valid value in the data is '18', then the age ranges will be: 18-20, 21-30, 31-40, etc.

The highest range is affected by the highest valid value in the data. For example, if the highest valid value for 'age' is '97', and the starting point is '1' and the interval is '10', the highest intervals will be: 71-80, 81-90, 91-97.

Treatment of missing-data in a collapse
The intervals created by the collapse procedure will exclude missing-data codes that are either above or below the valid codes. Character missing-data codes (like 'D' or 'R') will also be excluded.

A numeric missing-data code that happened to fall in between valid codes, however, would be included in the range that covers that code. For example, if '0' were defined as missing-data, but both '-1' and '+1' were actual valid codes, '0' would be included in one of the ranges.

Optional variables

Control variables (for table-generating programs)

A separate table is produced for each category of a control variable. If charts are being generated, a separate chart is also produced for each category of the control variable.

For example, if the control variable is gender, there will be one table for men alone and then one table for women alone. A table will also be produced for the total of all valid categories of the control variable (e.g., men and women combined).

Only one variable at a time can be used as a control variable. If more than one control variable is specified, a separate set of tables (and charts) will be generated for each control variable.

Selection filter variables

Selection filters are used in order to limit an analysis to a subset of the cases in the data file. This is done by specifying one or more variables as selection filters, and by indicating which codes of those variables to include.

Some filter variables may be set up ahead of time by the data archive. That type of filter variable is discussed below.

Note that it is also possible to limit the table to a subset of the cases by restricting the valid range of any of the other variables. But when the desired subset of cases is defined by a variable that is not one of the variables in the table or analysis, you must use filter variables.

Numeric variables as selection filters

Basic filter use
The name of each filter variable is followed, in parentheses, by a single value such as 'gender(2)' or a range of codes such as 'age(30-50)', to limit the analysis to cases having those codes.

Multiple ranges and codes may be specified.
For example: age(1-17, 25, 95-100)

Multiple filter variables
If you specify more than one filter variable, a case must satisfy ALL of the conditions in order to be included in the table.
For example: gender(1), age(30-50)

Open-ended Ranges using '*' and '**'
A single asterisk, '*', can be used to specify that all cases with VALID codes for a variable will pass the filter.
For example: age(*) includes all cases with valid data on the variable 'age'.

In a range, the '*' can be used to signify the lowest or highest VALID value. For example: age(*-25,75-*). This filter would include all VALID values less than or equal to 25 and all VALID values greater than or equal to 75. However, any missing-data values within those ranges would still be excluded.

In a range, two asterisks '**' can be used to signify the lowest or highest numeric value, regardless of whether or not the codes are defined as missing data. For example: age(50-**) would include ALL numeric values greater than or equal to 50, including data values like 98 or 99, even if they had been defined as missing-data codes. However, any character missing-data values would still be excluded. Note that '**' cannot be used alone in a filter variable. It can only be used as part of a range.

Character variables as selection filters
The syntax for specifying character variable filters is similar to the syntax for numeric variables but with a few differences. Like numeric variable filters, character variable filters specify the variable name followed by the filter value(s) in parentheses.
For example: city( Atlanta )

Multiple filter values can be specified, separated by spaces or commas:
city( Chicago,Atlanta Seattle)

Character variable filters are case-insensitive. For example, the following filters are functionally identical:
city( Atlanta )
city( ATLANTA )
city( AtLAnta )

If a filter value contains internal spaces or commas, it must be enclosed in matching quotation marks (either single or double):
city( "New York" )
state("Cal, Calif")

A filter value containing a single quote (apostrophe) can be specified by enclosing it in double quotes:
city( "Knot's Landing" )

Or, conversely, a filter value containing double quotes can be specified by enclosing it in single quotes:
name( 'William "Bill" Smith' )

Leading and trailing spaces, and multiple internal spaces, are NOT significant. The following filters are all functionally equivalent:
city( "New York    " )
city( "New    York" )
city( "   New York    " )

Note that ranges, which are legal for numeric variables, are not allowed for character variables:
The following syntax is NOT legal: city( Atlanta-Seattle)

Pre-set selection filters
One or more filter variables may be pre-set by the archive so that they appear automatically on the option screen for the various analysis programs. The user can then select the desired filter-variable categories from a drop-down menu.

For example, the variable 'gender' might be set up as a pre-set filter variable. The user could then choose 'Males' or 'Females' (or 'Both genders') from the drop-down list.

Pre-set filter variables are only a convenience for the user. The same result can be obtained by using the regular selection filter option to specify the filter variable(s) and the desired code categories to include in the analysis.

One possible difference between the pre-set filters and the regular user-defined selection filter specifications concerns cases with missing-data on the filter variable. A user-defined filter specification of 'gender(*)' would include all cases with a valid code on the variable 'gender', excluding any cases with missing-data on that variable, if there are any. On the other hand, selecting the '(Both genders)' option (or whatever the '##none' specification is labeled) for a pre-set filter would generally include cases with missing-data on the filter variable. (The '##none' specification has the same effect as not using that variable as a filter at all.)

To avoid any doubt about which cases are included or excluded, remember that the analysis output always reports which filter variables have been used and which code values have been included in the analysis. This is true both for pre-set selection filters and for user-defined filters.

Weight variable

Depending on the design and implementation of the study, it may be appropriate to give some of the cases more weight than other cases in computing frequency distributions and statistics. The way you do this is to specify that a certain variable contains the relative weight for each case and is to be considered a weight variable. The documentation for the study should explain the reasons for using a weight variable, if there is one, and what its name is.

SDA studies can be set up with a weight variable specified ahead of time so that the weight variable is used automatically. Other studies may be set up with a drop-down list of choices to be presented to the user, who then selects one of the available weight variables (or no weight variable, if that option is included in the list). If no weight variables have been pre-specified, the user is free to enter the name of an appropriate variable to be used as a weight.

Question text

All of the descriptive text available for each variable included in the analysis will be appended to the bottom of the results, if you select this option.

The usual text available for a variable is the text of the question that produced the variable, provided that the text was included in the study documentation. Sometimes other explanatory text has been included.

If the variable was created by the 'recode' or the 'compute' program, the commands used to create the new variable are included in the descriptive text.

Title or label for this analysis

On the option screen for an analysis program, you can enter a title or a label for this analysis. If a title is specified, it will appear as the first line of the HTML output generated by the SDA program.

Actions to take

After you specify variables and select the options you want, go to the bottom section of the form, and select one of two actions:
Run the Table (or Run a specific type of analysis)
Select this when you have finished specifying the variables and options you want. The requested table (or other analysis) will then be generated by the server computer and displayed on your screen.

Clear Fields
Select this to delete all previously specified variables and options, so that you can start over.