How do I use a Stata setup file to import ASCII data?
Setup files contain syntax or program code to read columnar ASCII data into a statistical package. The instructions below demonstrate how to use Stata setup files. Please note that while the examples and illustrations that follow depict Stata in a Windows environment, the steps and procedures are platform independent. Please see the appropriate Getting Started With Stata manual for operating system details.
Getting Ready to Use Stata
These instructions assume that you have already downloaded and decompressed the ASCII data and the Stata setup files from our Web site.
Be sure to make a note of the exact location of the uncompressed files extracted from the downloaded file you obtained from ICPSR as you will need to input that information into one of the setup files.
The Stata Setup Components
There are three Stata setup components:
A columnar ASCII data file
A dictionary file which defines the elements of the data file.
For more information about Stata dictionaries please reference Stata's online help or Reference Manual Set:
Stata Reference Manuals
[R] Infile (fixed format)
[R] Infix (fixed format)
A Stata do-file, which contains Stata's processing instructions to import and save the data in Stata's system format.
Figure 1 displays these three files for ICPSR 6399, Homicides in Chicago, 1965-1995:
Note that the files are located on the D drive in a folder titled "homicide". Elsewhere in this document, we refer to this address (D:\homicide) as the path.
Also note that the file extensions (.txt, .dct, and .do) are visible. If you are using Microsoft Windows, certain file extensions may be hidden. To set Windows so that the full filename, including extensions, is visible, select Tools and then Folder Options from the Windows directory menu. At the dialog box, click on the View tab and ensure that the toggle for "Hide extensions for known file types" is not checked, as shown in Figure 2.
The remainder of these instructions proceed from the assumption that the necessary files were decompressed to D:\homicide.
Using the Stata Setup Files
The Stata setup files prepared by ICPSR are designed to import a columnar ASCII data file into a Stata system file and apply appropriate variable-level metadata, such as labels for variables and variable values. These setups are designed to work across platforms with any recent implementation of Stata.
Of the three files shown in Figure 1, only the do-file (06399-0002-Stata_setup.do) requires editing. To edit, open the file in a text editor that is capable of saving output in plain ASCII text format.
Please take note of the following caveats:
Stata is packaged with an editing utility. An example of a do-file header is shown in Figure 3. Most
setup files contain a header that describes the contents of the file. Once you have opened the setup file in your editor, read the head, if present, for important information about what is contained in the file.
While the editor is an adequate tool for small files, many of the
Stata Setup files will exceed the utility's size limitations. The do-file editor is limitied to files that are 32k or smaller. It cannot open larger files.
Any text editor capable of working with and saving plain ASCII text files is sufficient.
Note that a text editor differs from a word processor. Word processors like Microsoft's Word or Corel's Word Perfect save files in proprietary formats. Stata cannot interpret files saved in those formats; Stata can only interpret ASCII text files.
For more information about common text editors that work well with Stata, please see the FAQ "Some notes on text editors for Stata users" maintained by Boston College or Wikipedia's Text Editors page.
If a setup file is too large for the do-file editor and an alternative text editor is unavailable, word processors can be used for editing. However, please be sure to set the output format to plain text. Figure 4 shows how to select plain text format.
Editing the Stata setup file
The Setup file contains 5 distinct sections.
- Section 1 defines filenames and locations.
- Section 2 reads the raw data into memory in a Stata system format.
- Section 3 applies value labels to appropriate variable values.
- Section 4 recodes numeric missing value codes from numbers to Stata recognized system missing values.
- Section 5 saves the dataset in a Stata system format.
Figure 5 shows the first two statements of the Stata do-file.
These statements define internal system settings:
set mem 9m: Assigns 9 megabytes of RAM to Stata to receive and store the data. Unlike SPSS or SAS, Stata stores the entire data array in the computer's RAM memory. If the amount of memory allocated to Stata is insufficient to read an entire file, Stata will terminate with an error as shown in Figure 6.
In Figure 6, we tried to run st6399-0002-Stata_setup.do with only 1 megabyte of RAM allocated. Though there are 12,000 observations in the file, Stata was only able to read 6,392 into the memory space. Since it could not read the entire array into memory, the process terminated.
The default set mem allocation (in this case, 9 megabytes) was specified by ICPSR to be large enough to accommodate the corresponding data file. Therefore, this number need not be adjusted.
set more off: Some Stata setup files contain thousands of lines. As the do-file runs, Stata displays or echoes these lines to the screen. If more is not set to off, the system will pause each time the screen buffer fills. Setting more to off allows the do-file to run until completion.
This section defines paths and filenames.
The setup files leverage Stata macros which are a programming feature. A local macro acts as a temporary storage container or alias for a string of text characters. Once defined, the contents of this container can be recalled at anytime within the do-file by the reference 'macro' (where macro is a placeholder for the actual macro name).
Please note that the macro reference uses a left quote mark [`], sometimes called a back-tick, on the left and an apostrophe ['] on the right. Therefore, to be clear, `macro' is not the same as 'macro'.
Next, it is necessary to declare the following three macros:
local raw_data -- the raw ASCII data file
If all the files are in the default directory, only the filename need be entered between the double quotes.
Example: local raw_data "06399-0002-Data.txt"
If files are not located in the default directory, then the path must also be specified.
Example: local raw_data "D:\homicide\06399-0002-Data.txt"
local dict -- the Stata dictionary file
Example without a path: local dict "06399-0002-Stata_dictionary.dct"
Example with a path: local dict "D:\homicide\06399-0002-Stata_dictionary.dct"
local outfile -- the filename you want to associate with the final Stata system file
Example without a path: local outfile "homicide.dta"
Example with a path: local outfile "D:\data\homicide.dta"
An example of Section 1: File Specifications correctly edited for files in the default directory is shown in Figure 7.
The infile command (see Figure 8) applies information stored in the dictionary (06399-0002-Stata_dictionary.dct) to the data stored in the data file (06399-0002-Data.txt) and stores the file in system memory in a format optimized for Stata.
The dictionary shown in Figure 9 defines the starting column locations, variable type, name, format, and label. The dictionary document should not need to be edited for any reason.
Section 3 defines value lables (if applicable) for numeric categorical variables (see Figure 10).
The command #delimit ; changes the command delimiter from a carriage return (the default delimiter) to a semi-colon. This allows for multi-line value label definitions. At the end of section 3, the delimiter can be reset to a carriage return with the command #delimit cr.
Section 4, shown in Figure 11, recodes values defined to represent missing information from numeric codes to Stata's system missing value (.). ICPSR's processing conventions use numeric values to represent such information in the ASCII data. This ensures that information is not lost across statistical packages. While Stata allows for up to 27 unique system missing values (. .a .b .c .d ...) the do-file programatically recodes all missing values to a single value (.). Accordingly, this section is commented out by default. To apply missing values, remove the comment delimiters (/* */) bracketing this section.
Section 5, shown in Figure 12, is the final section and saves the data on media in a Stata system format. If the local outfile macro was specified correctly in Section 1, this step will occur automatically.
Once a Stata system file has been saved, it can be used for subsequent analysis sessions. There is no need to run the setup file again.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.