# Imputations

## Longitudinal Cohort Study

At each wave, three primary caregiver-level variables have versions where missing values have been imputed by the Scientific Directors. The three variables pertain to education level (ordinal), salary level (ordinal), and SEI (socioeconomic index, continuous). Imputation calculations are done at the primary caregiver level: 1 record per unique value of FAM_ID in the PHDCN MASTER data file (ICPSR 13580).

The mean values of the three variables over all records with the same FAM_ID are calculated and those means values are used in the imputation calculations. Imputations are done only if at least one of the three variables is non-missing; otherwise, the imputed values are also missing. Imputed values for each variable are based on the results of regression models where the other two variables are independent covariates. The principal component of the three variables also has a version that is imputed if any of the three component variables are missing.

### Wave 1 imputations

The 3 variables for which imputed versions are derived are as follows:

EDUC_MAX (maximum of EDUC_PC [education level of PC] and EDUC_PR [education level of partner]); 5 levels; grade equivalent for level is sed in regression models(range 8-16).

SALARY (household income); 7 levels; dollar equivalent of level is used in regression models(range 2500-55000).

SEIMAX (maximum of SEI for PC job and Partner job); continuous.

SESCOMP is the principal component of the 3 variables.

**Note**: EDUC_MAX is derived even if PC education level is missing(49 cases) and even if a partner is present and the partner's education level is missing(177 cases). Thus, for these 226 cases, the value of EDUC_MAX itself involves some imputation. The derivation of EDUC_MAX is justified for the following reasons:

- The Spearman correlation of the education levels of PC and artner for the 2812 cases where both are non-missing is .604 nd the mean grade difference(PC-Partner) for these cases is .36.
- The mean grade completed for 2989 PC's with a Partner is 11.89 nd for 1226 PC's without a Partner is 12.09.
- For the 2989 PC's with a Partner, the 2812 with non-missing Partner education have a mean grade completed of 11.99 and the 177 with missing Partner education have a mean grade completed of 10.18.
- For the 49 cases where PC education is missing, the mean grade completed for the Partner is 9.39.

Three regressions are run:

- EDUC_MAX dependent independent: SALARY (0 if missing), variable to indicate if SALARY is missing, SEIMAX (0 if missing), variable to indicate if SEIMAX is missing.
- SALARY dependent independent: EDUC_MAX (0 if missing), variable to indicate if EDUC_MAX is missing, SEIMAX (0 if missing), variable to indicate if SEIMAX is missing.
- SEIMAX dependent independent: EDUC_MAX (0 if missing), variable to indicate if EDUC_MAX is missing, SALARY (0 if missing), variable to indicate if SALARY is missing.

After each regression, the imputed value of the dependent variable is calculated: it is the predicted value plus a random value from a normal distribution multiplied by the MSE (mean squared error) for the regression. For SEIMAX and SALARY, negative imputed values are set to 0. Continuous imputed values for EDUC_MAX and SALARY are categorized.

If the value of the original variable (SP-level) is missing, the imputed value is assigned the imputed version of the variable; otherwise, the value of the imputed version equals the value of the original version. The imputed version of EDUC_MAX is IEDUCMAX; variable EDUCMAXI indicates whether the value of IEDUCMAX has been imputed. The imputed version of SALARY is ISALARY; variable SALARYI indicates whether the value of ISALARY has been imputed. The imputed version of SEIMAX is ISEIMAX; variable SEIMAXI indicates whether the value of ISEIMAX has been imputed. A principal components analysis is run using the 3 imputed variables; if SESCOMP is non-missing then ISESCOMP is set equal to SESCOMP; otherwise, it equals the principal component of this new analysis; variable SESCOMPI indicates whether the value of ISESCOMP was calculated using any imputed variables.

### Wave 2 imputations

The 3 variables for which imputed versions are derived are as follows:

EDUCMAX2 (maximum of EDUC_PC [education level of PC] and EDUC_PR [education level of partner at Wave 1]); 5 levels; grade equivalent for level is used in regression models (range 8-16).

SALARY2 (household income); 11 levels; dollar equivalent of level is used in regression models (range 2500-95000).

SEIMAX2 (maximum of SEI for PC job and Partner job); continuous. SESCOMP2 is the principal component of the 3 variables.

**Note**: Partner education level was not measured in Wave 2. At Wave 2 the PC education level was inquired about only if the PC had attended school since Wave 1. For 66 cases it is unknown whether the Wave 2 PC is the same as the Wave 1 PC. Unfortunately, for another 177 cases the Wave 2 PC differs from the Wave 1 PC but PC education level was not inquired about in Wave 2. For 64 cases the PC education level at both waves is missing as is the Partner education level (if there is a partner). For these 307 cases, EDUCMAX2 is missing. For the remaining cases EDUCMAX2 is assigned as follows:

If the Wave 2 PC differs from the Wave 1 PC, EDUCMAX2 is set to the PC education level collected at Wave 2.

Else if PC has no partner at Wave 2, EDUCMAX2 is set to PC education level(from Wave 2 if collected, otherwise from Wave 1).

Else if PC has a partner at Wave 2 and partner is same as in Wave 1, EDUCMAX2 is set to the maximum of the PC education level(from Wave 2 if collected, otherwise from Wave 1) and the Wave 1 partner education.

Else if PC has a partner at Wave 2 and partner differs from Wave 1 partner, EDUCMAX2 is set to the PC education level(from Wave 2 if collected, otherwise from Wave 1).

The remainder of the Wave 2 imputation algorithm is exactly comparable to the Wave 1 imputation algorithm.

The imputed version of EDUCMAX2 is IEDUMAX2; variable EDUMAXI2 indicates whether the value of IEDUMAX2 has been imputed.

The imputed version of SALARY2 is ISALARY2; variable SALARYI2 indicates whether the value of ISALARY2 has been imputed.

The imputed version of SEIMAX2 is ISEIMAX2; variable SEIMAXI2 indicates whether the value of ISEIMAX2 has been imputed.

The imputed version of SESCOMP2 is ISESCOMP2; variable SESCOMPI2 indicates whether any component of ISESCOMP2 has been imputed.