Los Angeles Family and Neighborhood Survey (L.A.FANS): Linking Data

QUESTION: How do I link data about the mother to data about the child?

RESPONSE:

Please read chapter 5 of the LAFANS-1 main codebook. It describes how to identify and link various types of individuals, including parents and children.

QUESTION: I have been trying to merge imputed family income onto the PCG dataset and keep losing records. Based on the documentation, it looks like I should merge on SAMPLEID and HHRF, but I tried doing so and lose 587 records. If I merge on SAMPLEID and PID, I lose the same records. Is there a better way to merge these files?

RESPONSE:

There are two ways to add imputed income file data to adult respondents, be they RSAs or PCGs.

One is to use SAMPLEID and HHFAM1 in the adult and pcg files, and match it to SAMPLEID and PID in the imputed income file. HHFAM1 is the person id of the household respondent that goes with the given adult/pcg respondent. For linking HHLD1 data, you would use SAMPLEID and PID as well to link to SAMPLEID and HHFAM1. Ignore HHFAM2 in the adult/pcg files as it’s blank for virtually all cases except the odd few where we collected two household modules for the same family. If you care about it, you can look at those two households modules and decide which one you want to use or combine them.

The second method uses SAMPLEID and HHRF to merge imputed income data after one has added HHRF from the ROSTER1 file to the relevant adult respondent file (that merge is by SAMPLEID and PID).

Now remember that in the IMPINC1 file there are 49 SAMPLEID/HHRF combinations that appear more than once since FIs wound up giving the household module to two people in the same family when they only needed to do one. Thus, you need to remove those duplicates (they all have the same totals for FAMINC) from IMPINC1 first otherwise when you merge on the income data to the PCG file, you’ll get some PCGs with two income records and the file size will grow unexpectedly.

Note that the HHRF variable is really only needed if you want to link IMPINC1 data to NON-W1 sampled respondents, or if you need to identify those people in the same family. For W1 respondents you don’t need to merge on HHRF as you can use HHFAM1, the pid of the household module respondent, to link to the person id (PID) of the respondent in the IMPINC1 file.

QUESTION: I am trying to merge the public imputed income file (IMPINC1.dta) into my child-based dataset. I noticed that HHID seems to be the only logical variable to merge by. However, I am ending up with 86 more cases than I should. Do you have an idea of what could be happening? I have merged all of the public files thus far with no problems (mostly using sampid_n).

RESPONSE:

If you check the documentation for the impinc1 file (imputed_income.doc or imputed_income.pdf), you’ll see that the income info is for FAMILIES within a household. Thus the identifiers to merge by are SAMPLEID and HHRF. As noted in that documentation, the HHRF variable can be found on the ROSTER1 file. You can merge it onto you child-level file using SAMPLEID PID, regardless if PID is the roster id number of the PCG or of the child. The PCG and RSC/SIBs are from the same family.

SAMPLEID and SAMPID_N are the same thing–the former is a string variable, the latter its numeric counterpart. HHID can be used in place of SAMPLEID, as they both identify the household, however we encourage the use of SAMPLEID/SAMPID_N since it is what we use in all our LAFANS work.

As discussed in the LAFANS documentation (L.A.FANS Codebook introduction.pdf), a household might have more than one household economy module if the RSA (or RSA spouse) was not the PCG of the RSC. In such cases a second household economy module was done to be sure that income information was collected for the family of the RSC. The income questions pertain only to the given respondent, the respondent’s spouse and the respondent’s children. Other household members are not included.

QUESTION: I have been working with the Adult, Parent, and Child modules. When I merge the Adult and Parent modules (using HHID and PID to merge), there are 66 primary caregivers who do not have observations in the adult module. While a couple of these are documented, most of them are not mentioned in the codebook.

Similarly, there are 36 children who completed the Child module but do not have matching completed Parent modules, and the majority of these are not documented. Is there a reason these files are not lining up well?

RESPONSE:

The main documentation codebook section on response rates (Section 2) shows that partial completes exist for different types of respondents and explains that partials are those respondents who did not complete all of their requisite modules. I guess we assumed that would be sufficient to alert users to the fact that some respondents won’t have all their expected modules.

In section 3 where we briefly discuss the MODSTAT1 file, we note that the final disposition variable, FIN_STAT, has codes of 496-499 for households where not everyone completed everything required of them.

It did not seem necessary to list the individual cases since users could identify problem households from the info in MODSTAT1. The individual module completion status variables in MODSTAT1 show what was done and not done by each type of respondent.

So, as you noticed, there are PCGs who did the Parent module but did not do the Adult module. There are PCGs who did the Adult module but not the Parent module. There are kids in the CHILD module whose mother/PCG did not do the Parent module. If a respondent is not in a module file he/she should be in, it is because the person did not start the module.

The length of the LAFANS sometimes meant that not all modules could be administered in one visit and at the subsequent visits the FI could not connect with the respondent to finish the rest of the modules. Also, some respondents decided it was too much and refused to continue after having done a module or two. You’ll also see cases where a respondent started a module and quit shortly thereafter (e.g., in the Adult module). The completion flags in each module file identify those who finished vs those who did not finish the given module.

Because some respondents will not have all the modules required for them, there is obviously a missing data problem that users will have to address. How to handle missing data is up to individual users and their given analyses.

QUESTION: I am trying to merge in the actual census tract numbers from the private data into my public data file. I am using the data in restricted data version 2 called rstrv2_1.dta. It includes the following variables: sampleid pid tracth00 city tracth90. When I merge my datasets by sampleid and limit my sample to those children who were in the assess1.dta sample, I should get an N of 2,500. Instead, I get over 8 thousand. I am clearly doing something wrong, but I thought all people in the same household should have the same census tract number. Can you please offer any insight?

RESPONSE:

RSTRV2_1 is a PERSON-LEVEL file, not a household-level file. That’s why it has nearly 13,000 records while there are only a little over 3,000 households in the LAFANS data.

To merge on TRACTH90 from it to assess1.dta (which is also a person-level file), you need to merge by SAMPLEID and PID, keeping only those records in the assess1.dta file. By only using SAMPLEID you got not only those people in assess1.dta but everyone else in their households. That’s why you had over 8,000 records after the merge.

If you had used RSTHV2_1, a household-level file, and merged TRACTH90 onto assess1.dta by SAMPLEID, keeping only those records in the assess1.dta file, you would have had no problem.

A quick check of the LAFANS restricted Version 2 data documentation, which lists the contents of xxxxV2_1 files, would have shown that RSTHV2_1 is a household-level file because it only has SAMPLEID, and RSTRV2_1 is a person-level file because it has SAMPLEID and PID.