Index to Loans on Veterans Administration Guaranteed Mortgages, [United States], 1946-1954

Name: Index to Loans on Veterans Administration Guaranteed Mortgages, [United States], 1946-1954
Published: 2023-10-09
License: https://www.icpsr.umich.edu/web/ICPSR/studies/38906/terms

Summary View help for Summary

Background

This study contains the digitized data originally stored on 3"x5" index cards, archived at the National Archives and Records Administration (NARA). Part of the Records of the Reconstruction Finance Corporation (NARA Record Group 234), the Index to Loans on Veterans Administration Guaranteed Mortgages, 1946-1954 is an index of loans made by the Reconstruction Finance Corporation Mortgage Company on Veterans Administration mortgages.

Digitizing and Parsing

The project team transformed the images into digital text through optical character recognition (OCR). After experimentation with multiple OCR engines, the team implemented two parallel workflows, each using Tesseract as its OCR engine: LayoutParser and Python-tesseract. The output of both were parsed into tabular datasets using regular expressions. For more information on the digitization and parsing processes, please refer to the project team's article. The combined output of those processes is presented as Dataset 1. Users should note that, although the project team took steps to find the most accurate OCR processes for this study, OCR is not perfect. There are errors in these data when compared against the original index cards.

Cleaning and Geographic Standardization

The project team was most interested in the name, city, and state fields in the OCR output. With this in mind, the team created a working dataset comprised of only those fields. The original images included images of the reverse sides of index cards when pencil notations were present; these records were removed from this working dataset. The index also included blue-colored cards that referred to other cards; these reference cards were also removed from the working file. The removal of these two types of records left 24,589 mortgage records in the dataset.

Several steps were then taken to prepare the name, city, and state fields for future analysis. The name fields were parsed to separate middle names/initials as well as suffixes (Jr., III, etc.) from first names. The state field was standardized to the two-letter United States Postal Service state codes. The two-letter codes were also translated to their corresponding two-digit Federal Information Processing System (FIPS) codes. Using the standardized states, the team attempted to standardize each record's city to the United States Census Bureau's list of places. Attempts were made to deterministically match the city names to the Census Bureau's list. For unmatched records, probabilistic matching was used. Due to the inexact nature of probabilistic matching, the wrong place name or city FIPS code may have been assigned in error, in some cases. The result of this cleaning and geographic standardization is presented in Dataset 2.

The project team created a truth deck of 1,000 records, hand-keyed from the original images. Each truth record contains the last name of the mortgagor(s), the name (whatever combination of first, middle, and suffix might appear on the card) of the first mortgagor, the name of the second mortgagor if applicable, the city, and the state. These hand-keyed records were then further parsed and geographically standardized in the same manner described above. This truth dataset is presented in Dataset 3.

Dataset 4 is a combination of Datasets 2 and 3, with the truth records replacing the corresponding 1,000 records of Dataset 2.

Citation View help for Citation

Bleckley, David, Lafia, Sara, and Alexander, J. Trent. Index to Loans on Veterans Administration Guaranteed Mortgages, [United States], 1946-1954. Inter-university Consortium for Political and Social Research [distributor], 2023-10-09. https://doi.org/10.3886/ICPSR38906.v1

Export Citation:

RIS (generic format for RefWorks, EndNote, etc.)
EndNote

Funding View help for Funding

University of Michigan. Michigan Institute for Data Science, University of Michigan. Office of the Vice-President for Research

Subject Terms View help for Subject Terms

G.I. Bill housing mortgages veterans World War II

Geographic Coverage View help for Geographic Coverage

United States

Smallest Geographic Unit View help for Smallest Geographic Unit

City

Distributor(s) View help for Distributor(s)

Inter-university Consortium for Political and Social Research

Hide

Time Period(s) View help for Time Period(s)

1946 -- 1954

Date of Collection View help for Date of Collection

2022-01 -- 2022-02

Hide

Study Purpose View help for Study Purpose

This study contains the digitized data originally stored on 3"x5" index cards, archived at the National Archives and Records Administration (NARA). Part of the Records of the Reconstruction Finance Corporation (NARA Record Group 234), the Index to Loans on Veterans Administration Guaranteed Mortgages, 1946-1954 is an index of loans made by the Reconstruction Finance Corporation Mortgage Company on Veterans Administration mortgages.

Study Design View help for Study Design

The project team transformed the images into digital text through optical character recognition (OCR). After experimentation with multiple OCR engines, the team implemented two parallel workflows, each using Tesseract as its OCR engine: LayoutParser and Python-tesseract. The output of both were parsed into tabular datasets using regular expressions.

Universe View help for Universe

Loans made by the Reconstruction Finance Corporation Mortgage Company on Veterans Administration mortgages

Unit(s) of Observation View help for Unit(s) of Observation

Mortgages

Data Type(s) View help for Data Type(s)

administrative records data

Description of Variables View help for Description of Variables

Dataset 1 (OCR Output) is the output of the two methods of optical character recognition (OCR) used to digitize the scanned images. Each record contains information from one card from the National Archives and Records Administration, with a variable named "file" indicating the file name of each card's digitized image. Based on the accuracy of each method for digitizing each field (when compared to a set of hand-keyed truth data) , one method's output was labeled as the primary version of that field, and the other method's output was labelled as the alternate (using the suffix "_alt"). The variable labels indicate which method of OCR was used for a given variable.

Dataset 2 (Cleaned, geographically-standardized data with no truth data) is the OCR output from Dataset 1 with further cleaning, parsing, and geographic standardization. All reference records and back-of-card records have been removed; therefore, ref_flag and backcard_flag variables have been deleted. The project team focused on names, city, and state variables; all other variables have been excluded from this dataset. The p1_name, p1_name_alt, p2_name, and p2_name_alt variables have been parsed into first name, middle name/initial, and suffix variables. While the team sought to accurately parse names in this manner, the original variables are retained. The geographic standardization processes have also introduced error in some cases; all original variables are retained. There were nine place-state combinations with two associated FIPS codes (e.g., a city and township with the same name). In those cases, the second FIPS code is presented as FIPS_alt.

Dataset 3 (Cleaned, geographically-standardized data with only truth data) contains the names, cities, and states of the first 1,000 records of the index (excluding card backs and reference records), which have been hand-keyed to create a truth dataset that was used to support the OCR process. Due to this hand-keying, the truth dataset has no alternate versions of OCR output fields. The same parsing of the names and geographic standardization of cities and states described in Dataset 2 were applied to the hand-keyed records. They were not hand-parsed or -standardized, and similar to Dataset 2, errors may exist in the resulting variables. No FIPS_alt variable is included because none of the cities in these 1,000 records were in the list of place names with two FIPS codes.

Dataset 4 (Cleaned, geographically-standardized data combining truth and non-truth data) combines Datasets 2 and 3. The first 1,000 records of Dataset 2 have been replaced by Dataset 3. The manual_clean variable can be used to differentiate the hand-keyed vs. OCR output records.

Hide

Original Release Date View help for Original Release Date

2023-10-09

Version History View help for Version History

2023-10-09 ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection:

Performed consistency checks.

Hide

Index to Loans on Veterans Administration Guaranteed Mortgages, [United States], 1946-1954 (ICPSR 38906)

Summary View help for Summary

Background

Digitizing and Parsing

Cleaning and Geographic Standardization

Citation View help for Citation

Funding View help for Funding

Subject Terms View help for Subject Terms

Geographic Coverage View help for Geographic Coverage

Smallest Geographic Unit View help for Smallest Geographic Unit

Distributor(s) View help for Distributor(s)

Time Period(s) View help for Time Period(s)

Date of Collection View help for Date of Collection

Study Purpose View help for Study Purpose

Study Design View help for Study Design

Universe View help for Universe

Unit(s) of Observation View help for Unit(s) of Observation

Data Type(s) View help for Data Type(s)

Description of Variables View help for Description of Variables

Original Release Date View help for Original Release Date

Version History View help for Version History

Notes

Index to Loans on Veterans Administration Guaranteed Mortgages, [United States], 1946-1954 (ICPSR 38906)

Project Description

Summary View help for Summary

Background

Digitizing and Parsing

Cleaning and Geographic Standardization

Citation View help for Citation

Funding View help for Funding

Subject Terms View help for Subject Terms

Geographic Coverage View help for Geographic Coverage

Smallest Geographic Unit View help for Smallest Geographic Unit

Distributor(s) View help for Distributor(s)

Scope of Project

Time Period(s) View help for Time Period(s)

Date of Collection View help for Date of Collection

Methodology

Study Purpose View help for Study Purpose

Study Design View help for Study Design

Universe View help for Universe

Unit(s) of Observation View help for Unit(s) of Observation

Data Type(s) View help for Data Type(s)

Description of Variables View help for Description of Variables

Version(s)

Original Release Date View help for Original Release Date

Version History View help for Version History

Notes