Index to Loans on Veterans Administration Guaranteed Mortgages, [United States], 1946-1954 (ICPSR 38906)
Version Date: Oct 9, 2023 View help for published
Principal Investigator(s): View help for Principal Investigator(s)
David Bleckley, Inter-university Consortium for Political and Social Research;
Sara Lafia, Inter-university Consortium for Political and Social Research;
J. Trent Alexander, Inter-university Consortium for Political and Social Research
https://doi.org/10.3886/ICPSR38906.v1
Version V1
Summary View help for Summary
Background
This study contains the digitized data originally stored on 3"x5" index cards, archived at the National Archives and Records Administration (NARA). Part of the Records of the Reconstruction Finance Corporation (NARA Record Group 234), the Index to Loans on Veterans Administration Guaranteed Mortgages, 1946-1954 is an index of loans made by the Reconstruction Finance Corporation Mortgage Company on Veterans Administration mortgages.Digitizing and Parsing
The project team transformed the images into digital text through optical character recognition (OCR). After experimentation with multiple OCR engines, the team implemented two parallel workflows, each using Tesseract as its OCR engine: LayoutParser and Python-tesseract. The output of both were parsed into tabular datasets using regular expressions. For more information on the digitization and parsing processes, please refer to the project team's article. The combined output of those processes is presented as Dataset 1. Users should note that, although the project team took steps to find the most accurate OCR processes for this study, OCR is not perfect. There are errors in these data when compared against the original index cards.Cleaning and Geographic Standardization
The project team was most interested in the name, city, and state fields in the OCR output. With this in mind, the team created a working dataset comprised of only those fields. The original images included images of the reverse sides of index cards when pencil notations were present; these records were removed from this working dataset. The index also included blue-colored cards that referred to other cards; these reference cards were also removed from the working file. The removal of these two types of records left 24,589 mortgage records in the dataset.Several steps were then taken to prepare the name, city, and state fields for future analysis. The name fields were parsed to separate middle names/initials as well as suffixes (Jr., III, etc.) from first names. The state field was standardized to the two-letter United States Postal Service state codes. The two-letter codes were also translated to their corresponding two-digit Federal Information Processing System (FIPS) codes. Using the standardized states, the team attempted to standardize each record's city to the United States Census Bureau's list of places. Attempts were made to deterministically match the city names to the Census Bureau's list. For unmatched records, probabilistic matching was used. Due to the inexact nature of probabilistic matching, the wrong place name or city FIPS code may have been assigned in error, in some cases. The result of this cleaning and geographic standardization is presented in Dataset 2.
The project team created a truth deck of 1,000 records, hand-keyed from the original images. Each truth record contains the last name of the mortgagor(s), the name (whatever combination of first, middle, and suffix might appear on the card) of the first mortgagor, the name of the second mortgagor if applicable, the city, and the state. These hand-keyed records were then further parsed and geographically standardized in the same manner described above. This truth dataset is presented in Dataset 3.
Dataset 4 is a combination of Datasets 2 and 3, with the truth records replacing the corresponding 1,000 records of Dataset 2.
Citation View help for Citation
Export Citation:
Funding View help for Funding
Subject Terms View help for Subject Terms
Geographic Coverage View help for Geographic Coverage
Smallest Geographic Unit View help for Smallest Geographic Unit
City
Distributor(s) View help for Distributor(s)
Time Period(s) View help for Time Period(s)
Date of Collection View help for Date of Collection
Study Purpose View help for Study Purpose
This study contains the digitized data originally stored on 3"x5" index cards, archived at the National Archives and Records Administration (NARA). Part of the Records of the Reconstruction Finance Corporation (NARA Record Group 234), the Index to Loans on Veterans Administration Guaranteed Mortgages, 1946-1954 is an index of loans made by the Reconstruction Finance Corporation Mortgage Company on Veterans Administration mortgages.
Study Design View help for Study Design
The project team transformed the images into digital text through optical character recognition (OCR). After experimentation with multiple OCR engines, the team implemented two parallel workflows, each using Tesseract as its OCR engine: LayoutParser and Python-tesseract. The output of both were parsed into tabular datasets using regular expressions.
Universe View help for Universe
Loans made by the Reconstruction Finance Corporation Mortgage Company on Veterans Administration mortgages
Unit(s) of Observation View help for Unit(s) of Observation
Data Type(s) View help for Data Type(s)
Description of Variables View help for Description of Variables
Dataset 1 (OCR Output) is the output of the two methods of optical character recognition (OCR) used to digitize the scanned images. Each record contains information from one card from the National Archives and Records Administration, with a variable named "file" indicating the file name of each card's digitized image. Based on the accuracy of each method for digitizing each field (when compared to a set of hand-keyed truth data) , one method's output was labeled as the primary version of that field, and the other method's output was labelled as the alternate (using the suffix "_alt"). The variable labels indicate which method of OCR was used for a given variable.
Dataset 2 (Cleaned, geographically-standardized data with no truth data) is the OCR output from Dataset 1 with further cleaning, parsing, and geographic standardization. All reference records and back-of-card records have been removed; therefore, ref_flag and backcard_flag variables have been deleted. The project team focused on names, city, and state variables; all other variables have been excluded from this dataset. The p1_name, p1_name_alt, p2_name, and p2_name_alt variables have been parsed into first name, middle name/initial, and suffix variables. While the team sought to accurately parse names in this manner, the original variables are retained. The geographic standardization processes have also introduced error in some cases; all original variables are retained. There were nine place-state combinations with two associated FIPS codes (e.g., a city and township with the same name). In those cases, the second FIPS code is presented as FIPS_alt.
Dataset 3 (Cleaned, geographically-standardized data with only truth data) contains the names, cities, and states of the first 1,000 records of the index (excluding card backs and reference records), which have been hand-keyed to create a truth dataset that was used to support the OCR process. Due to this hand-keying, the truth dataset has no alternate versions of OCR output fields. The same parsing of the names and geographic standardization of cities and states described in Dataset 2 were applied to the hand-keyed records. They were not hand-parsed or -standardized, and similar to Dataset 2, errors may exist in the resulting variables. No FIPS_alt variable is included because none of the cities in these 1,000 records were in the list of place names with two FIPS codes.
Dataset 4 (Cleaned, geographically-standardized data combining truth and non-truth data) combines Datasets 2 and 3. The first 1,000 records of Dataset 2 have been replaced by Dataset 3. The manual_clean variable can be used to differentiate the hand-keyed vs. OCR output records.
HideOriginal Release Date View help for Original Release Date
2023-10-09
Version History View help for Version History
2023-10-09 ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection:
- Performed consistency checks.
Notes
The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.