What kind of data formats do you distribute?
We primarily distribute data files in eight data formats: three plain text formats (column-delimited ASCII, comma-delimited ASCII, and tab-delimited ASCII), two SAS formats (SAS XPORT and CPORT files), two SPSS formats (SPSS SAV and portable files), and the single Stata data format. Virtually every data file is available in a plain text format. We also supply many data files more than one format.
Plain Text
Column-, comma-, and tab-delimited ASCII data files store data, including numeric values, as lines of plain text, with one or more lines per observation (or subject or case). In the plain text format, every character of text--each digit, letter, or other symbol--is encoded in a separate byte in the data file. Thus, the number 133.5 occupies five bytes, the number 8 just one byte, and the string "computer programmer" requires nineteen bytes. Many of ICPSR's plain text data files are encoded with the ASCII character encoding system. However, some use other encodings, such as IBM PC code page 437, which is based on ASCII but supports more characters than ASCII does. Most use the ASCII-based ISO 8859-1 or Windows-1252 encodings.
In all three types of plain text data files, the line(s) allocated to a given observation contains the observation's values for the file's variables. What sets the three types apart is way the values are demarcated on the lines.
In a column-delimited ASCII data file, each variable occupies the same byte(s) on every observation. The bytes are usually called "columns," hence the name of this data format. For example, if a file with one line per observation has just three variables which occupy three bytes each, then the first variable would be located in columns 1-3, the second in columns 4-6, and the third in columns 7-9 on each line in the data file.
To facilitate the use of the column-delimited ASCII data files, which require programming expertise to import them into statistical packages for analysis, ICPSR usually provides programs, called "setups," to read them into SAS, SPSS, or Stata. The setups also assign variable labels and usually assign value labels and define missing values.
In a comma-delimited ASCII data file, the data values are separated with commas instead of being located in fixed column locations. Thus, in this format, the length of each line varies according to the magnitude of the line's data values. For example, the first two lines of a four-variable data file could look like this:
1,133.5,plumber,250778 2,44,librarian,20000
As with the column-delimited ASCII files, ICPSR usually provides setups to read the comma-delimited ASCII files into SAS, SPSS, or Stata.
Tab-delimited ASCII data files are the same as comma-delimited ASCII files except that values are delimited with a special tab control character instead of a comma. Most of these files were created by ICPSR for use with spreadsheets, such as Excel, into which they can be easily imported. These files can also be read into statistical packages like SAS, SPSS, and Stata. However, ICPSR rarely provides setups for that purpose.
R
R can be used as an alternative to traditional statistical packages such as SPSS, SAS, and Stata. It is an extensible, open-source language and computing environment for Windows, Macintosh, UNIX, and Linux platforms. R performs a wide variety of basic to advanced statistical and graphical techniques at little to no cost to the user. Installation files for Windows, Mac, and Linux can be found at the Comprehensive R Archive Network. The site also contains documentation for downloading and installing the software on different operating systems. There is no cost for downloading and using R.
How do I read ICPSR data into R?
We have a brief tutorial available on how to read data into R.
Can I use R without having to learn the details of the R language?
Yes (at least for the basics), there are a number of "front ends" that have been constructed in order to make it easier for users to interact with the R statistical computing environment. For example, a graphical user interface (or "GUI") allows the analyst to carry out data analysis tasks by selecting items from menus and lists, rather than entering commands.
One such GUI is the R Commander, written by John Fox. The R Commander is accessed by installing and loading the Rcmdrpackage within R. The R Commander provides an easy-to-use, menu-based system for loading data into R, manipulating data values, performing statistical analyses, creating graphical displays, and carrying out diagnostic tests on statistical models. Documentation for the R Commander is available.
SAS
We distribute two SAS data formats: SAS transport files generated by the SAS CPORT procedure and SAS transport files written by the SAS XPORT engine. Both types of files contain specially formatted SAS data sets, which contain variable labels as well as data. Many of ICPSR's SAS CPORT files also include SAS format catalogs with value labels.
SAS CPORT files should be imported into SAS with the SAS CIMPORT procedure.
Since SAS has an engine that reads SAS XPORT files, they can be read by any SAS command that can read an ordinary SAS data set, such as the SAS set statement or the SAS FREQ procedure. SAS XPORT files can also be converted to standard SAS data sets with the SAS COPY procedure.
SPSS
We distribute two types of SPSS data files: SPSS SAV files written by the SPSS save command and SPSS portable files written by the SPSS export command. Both types of data files include variable labels and usually include value labels and missing value definitions.
To load SPSS SAV files into SPSS use the SPSS get command.
To read SPSS portable files into SPSS use the SPSS import command.
Stata
Like the SAS and SPSS formats, Stata's proprietary data file format, which is written by the Stata save command, is platform independent. Our Stata data files include variable labels and often include value labels.
Stata data files should be loaded into Stata with the Stata use command.