Index to Loans on Veterans Administration Guaranteed Mortgages, [United States], 1946-1954 (ICPSR 38906)

Version Date: Oct 9, 2023 View help for published

Principal Investigator(s): View help for Principal Investigator(s)
David Bleckley, Inter-university Consortium for Political and Social Research; Sara Lafia, Inter-university Consortium for Political and Social Research; J. Trent Alexander, Inter-university Consortium for Political and Social Research

https://doi.org/10.3886/ICPSR38906.v1

Version V1

Slide tabs to view more

Background

This study contains the digitized data originally stored on 3"x5" index cards, archived at the National Archives and Records Administration (NARA). Part of the Records of the Reconstruction Finance Corporation (NARA Record Group 234), the Index to Loans on Veterans Administration Guaranteed Mortgages, 1946-1954 is an index of loans made by the Reconstruction Finance Corporation Mortgage Company on Veterans Administration mortgages.

Digitizing and Parsing

The project team transformed the images into digital text through optical character recognition (OCR). After experimentation with multiple OCR engines, the team implemented two parallel workflows, each using Tesseract as its OCR engine: LayoutParser and Python-tesseract. The output of both were parsed into tabular datasets using regular expressions. For more information on the digitization and parsing processes, please refer to the project team's article. The combined output of those processes is presented as Dataset 1. Users should note that, although the project team took steps to find the most accurate OCR processes for this study, OCR is not perfect. There are errors in these data when compared against the original index cards.

Cleaning and Geographic Standardization

The project team was most interested in the name, city, and state fields in the OCR output. With this in mind, the team created a working dataset comprised of only those fields. The original images included images of the reverse sides of index cards when pencil notations were present; these records were removed from this working dataset. The index also included blue-colored cards that referred to other cards; these reference cards were also removed from the working file. The removal of these two types of records left 24,589 mortgage records in the dataset.

Several steps were then taken to prepare the name, city, and state fields for future analysis. The name fields were parsed to separate middle names/initials as well as suffixes (Jr., III, etc.) from first names. The state field was standardized to the two-letter United States Postal Service state codes. The two-letter codes were also translated to their corresponding two-digit Federal Information Processing System (FIPS) codes. Using the standardized states, the team attempted to standardize each record's city to the United States Census Bureau's list of places. Attempts were made to deterministically match the city names to the Census Bureau's list. For unmatched records, probabilistic matching was used. Due to the inexact nature of probabilistic matching, the wrong place name or city FIPS code may have been assigned in error, in some cases. The result of this cleaning and geographic standardization is presented in Dataset 2.

The project team created a truth deck of 1,000 records, hand-keyed from the original images. Each truth record contains the last name of the mortgagor(s), the name (whatever combination of first, middle, and suffix might appear on the card) of the first mortgagor, the name of the second mortgagor if applicable, the city, and the state. These hand-keyed records were then further parsed and geographically standardized in the same manner described above. This truth dataset is presented in Dataset 3.

Dataset 4 is a combination of Datasets 2 and 3, with the truth records replacing the corresponding 1,000 records of Dataset 2.

Bleckley, David, Lafia, Sara, and Alexander, J. Trent. Index to Loans on Veterans Administration Guaranteed Mortgages, [United States], 1946-1954. Inter-university Consortium for Political and Social Research [distributor], 2023-10-09. https://doi.org/10.3886/ICPSR38906.v1

Export Citation:

  • RIS (generic format for RefWorks, EndNote, etc.)
  • EndNote
University of Michigan. Michigan Institute for Data Science, University of Michigan. Office of the Vice-President for Research

City

Inter-university Consortium for Political and Social Research
Hide

1946 -- 1954
2022-01 -- 2022-02
Hide

This study contains the digitized data originally stored on 3"x5" index cards, archived at the National Archives and Records Administration (NARA). Part of the Records of the Reconstruction Finance Corporation (NARA Record Group 234), the Index to Loans on Veterans Administration Guaranteed Mortgages, 1946-1954 is an index of loans made by the Reconstruction Finance Corporation Mortgage Company on Veterans Administration mortgages.

The project team transformed the images into digital text through optical character recognition (OCR). After experimentation with multiple OCR engines, the team implemented two parallel workflows, each using Tesseract as its OCR engine: LayoutParser and Python-tesseract. The output of both were parsed into tabular datasets using regular expressions.

Loans made by the Reconstruction Finance Corporation Mortgage Company on Veterans Administration mortgages

Mortgages

Dataset 1 (OCR Output) is the output of the two methods of optical character recognition (OCR) used to digitize the scanned images. Each record contains information from one card from the National Archives and Records Administration, with a variable named "file" indicating the file name of each card's digitized image. Based on the accuracy of each method for digitizing each field (when compared to a set of hand-keyed truth data) , one method's output was labeled as the primary version of that field, and the other method's output was labelled as the alternate (using the suffix "_alt"). The variable labels indicate which method of OCR was used for a given variable.

Dataset 2 (Cleaned, geographically-standardized data with no truth data) is the OCR output from Dataset 1 with further cleaning, parsing, and geographic standardization. All reference records and back-of-card records have been removed; therefore, ref_flag and backcard_flag variables have been deleted. The project team focused on names, city, and state variables; all other variables have been excluded from this dataset. The p1_name, p1_name_alt, p2_name, and p2_name_alt variables have been parsed into first name, middle name/initial, and suffix variables. While the team sought to accurately parse names in this manner, the original variables are retained. The geographic standardization processes have also introduced error in some cases; all original variables are retained. There were nine place-state combinations with two associated FIPS codes (e.g., a city and township with the same name). In those cases, the second FIPS code is presented as FIPS_alt.

Dataset 3 (Cleaned, geographically-standardized data with only truth data) contains the names, cities, and states of the first 1,000 records of the index (excluding card backs and reference records), which have been hand-keyed to create a truth dataset that was used to support the OCR process. Due to this hand-keying, the truth dataset has no alternate versions of OCR output fields. The same parsing of the names and geographic standardization of cities and states described in Dataset 2 were applied to the hand-keyed records. They were not hand-parsed or -standardized, and similar to Dataset 2, errors may exist in the resulting variables. No FIPS_alt variable is included because none of the cities in these 1,000 records were in the list of place names with two FIPS codes.

Dataset 4 (Cleaned, geographically-standardized data combining truth and non-truth data) combines Datasets 2 and 3. The first 1,000 records of Dataset 2 have been replaced by Dataset 3. The manual_clean variable can be used to differentiate the hand-keyed vs. OCR output records.

Hide

2023-10-09

2023-10-09 ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection:

  • Performed consistency checks.
Hide

Notes

  • The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.