Human Development in Chicago Neighborhoods Chicago skyline ICPSR PHDCN

Glossary of Social Science Terms

The following glossary is provided as a resource for data producers, data librarians, data users, and is based on a glossary prepared by James Jacobs, formerly at the University of California, San Diego. Some terms were also added from XML Terms: Jargon terms in XML and what they mean external link.

To supplement this glossary with more terms related to computing, consult the Encyclopedia of Computer Science, Fourth Edition, edited by Anthony Ralston and Edwin D. Reilly.

A

access

The act of making information available. Digital preservation is a requirement for providing long-term access to digital content. Access is "the OAIS entity that contains the services and functions which make the archival information holdings and related services visible to Consumers." OAIS requires that an archive be able to find and deliver digital content to authorized users; delivery may be to an individual or to an access delivery system.

The Collection Delivery Unit is responsible for providing access services and the digital preservation function preserves the capability to regenerate the DIPs (Dissemination Information Packages) as needed over time.

administration

"The OAIS entity that contains the services and functions needed to control the operation of the other OAIS functional entities on a day-to-day basis." The OAIS Reference Model identifies the policies and other documents that are the responsibility of Administration and are required by an OAIS.

The Administration function is currently provided by Computing and Network Services, which oversees the works of the Data Library staff, in conjunction with the Digital Preservation Officer, who develops requisite policies and guidance for digital preservation operations. Digital preservation policy development at ICPSR is informed by OAIS.

aggregate

(noun) A total created from smaller units. For instance, the population of a county is an aggregate of the populations of the cities, rural areas, etc., that comprise the county.

(verb) To total data from smaller units into a large unit. Example: "The Census Bureau aggregates data to preserve the confidentiality of individuals."

aggregate data

Data that have been aggregated. Contrast with microdata.

Archival Information Collection (AIC)

"An Archival Information Package whose Content Information is an aggregation of other Archival Information Packages."

The Collection Delivery Unit is responsible for providing access services and the digital preservation function preserves the capability to regenerate the DIPs (Dissemination Information Packages) as needed over time.

Archival Information Package (AIP)

"An Information Package, consisting of the Content Information and the associated Preservation Description Information (PDI), which is preserved within an OAIS."

The AIP consists of the original files deposited, processed versions of data files and documentation, normalized files, and associated metadata.

Archival Storage

"The OAIS entity that contains the services and functions used for the storage and retrieval of Archival Information Packages."

The Archival Storage function provides onsite and offsite redundancy through online copies (and a tape copy as extra backup) of ICPSR's digital content, both the archival copies and the access copies. ICPSR preserves the ability to regenerate the Dissemination Information Package (DIP); we do not preserve the software-dependent files (e.g., SAS, SPSS, Stata) that are distributed. Archival storage contributes to ensuring business continuity for ICPSR and is a component of the disaster planning at ICPSR.

archive

(noun) A data archive is a site where machine-readable materials are stored, preserved, and possibly redistributed to individuals interested in using the materials. (verb) To place or store in an archive.

ASCII

A character-encoding scheme used by many computers. The ASCII standard uses 7 of the 8 bits in a byte to define the codes for 128 characters. Example: In ASCII, the number "7" is treated as a character and is encoded as: 00010111. Because a byte can have a total of 256 possible values, there are an additional 128 possible characters that can be encoded into a byte, but there is no formal ASCII standard for those additional 128 characters. Most IBM-compatible personal computers do use an IBM "extended" character set that includes international characters, line and box drawing characters, Greek letters, and mathematical symbols. (ASCII stands for American Standard Code for Information Interchange.) See also EBCDIC.

attributes (XML)

XML elements can have attributes that further describe them, such as the following:

<Price currency="Euro">25.43</Price>

In the example above, "currency" is an attribute of "Price", and the attribute's value is "Euro".

B

binary format

Any file format in which information is encoded in some format other than a standard character-encoding scheme. A file written in binary format contains information that is not displayable as characters. Software capable of understanding the particular binary format method of encoding information must be used to interpret the information in a binary-formatted file. Binary formats are often used to store more information in less space than possible in a character format file. They can also be searched and analyzed more quickly by appropriate software. A file written in binary format could store the number "7" as a binary number (instead of as a character) in as little as 3 bits (i.e., 111), but would more typically use 4 bits (i.e., 0111). Binary formats are not normally portable, however. Software program files are written in binary format. Examples of numeric data files distributed in binary format include the IBM-binary versions of the Center for Research in Security Prices files and the U.S. Department of Commerce's National Trade Data Bank on CD-ROM. The International Monetary Fund distributes International Financial Statistics in a mixed-character format and binary (packed-decimal) format. SAS and SPSS store their system files in binary format.

binary number

A number written using binary notation which only uses zeros and ones. Example: Decimal number 7 in binary notation is: 111.

bit

A bit is the smallest unit of information that a computer can work with. Each bit is either a "1" or a "0". Often computers work with groups of bits rather than one bit at a time; the smallest group of bits a computer usually works with is a byte, which is 8 bits.

Born-digital

A descriptor for information that is created in digital form, as opposed to digitized from analog sources.

The majority of deposits consist of born digital content. There are some examples of hard copy and anolog materials that might be made digital (digitized) by ICPSR. For example, the Data-PASS project is identifying older social science data that include documentation and other components in hard copy format and there are some deposits that contain video in VHS format.

branching

(See skip pattern.)

Business Continuity

"Describes the processes and procedures an organization puts in place to ensure that essential functions can continue during and after a disaster." [SearchStorage.com] A note regarding preservation: "Backups vs Preservation: Disaster recovery strategies and backup systems are not sufficient to ensure survival and access to authentic digital resources over time. A backup is a short-term data recovery solution following loss or corruption and is fundamentally different to an electronic preservation archive." ["Continued access to authentic digital assets," JISC Digital Preservation Paper, Nov 26, 2006.]

We are addressing business continuity requirements by ensuring redundant backup of the preservation and access copies of ICPSR's digital content, by the establishment of a warm backup for the ICPSR Web server, by identifying our core functions for business continuity, by assessing the current backup and storage measures for our institutional records that support core functions to diminish the risk of loss in most emergency situations, by conducting a self-assessment of our information security program to comply with relevant standards, and by developing the requisite policies and procedures for business continuity.

byte

Eight bits. A byte is simply a chunk of 8 ones and zeros. For example: 01000001 is a byte. A computer often works with groups of bits rather than individual bits and the smallest group of bits that a computer usually works with is a byte. A byte is equal to one column in a file written in character format. Most data files distributed by ICPSR are in character format.

C

Canonical Formats

"In information technology, canonicalization is the process of making something [conform] with some specification... and is in an approved format. Canonicalization may sometimes mean generating canonical data from noncanonical data."[Clifford Lynch, "Canonicalization: A Fundamental Tool to Facilitate Preservation and Management of Digital Information," D-Lib Magazine, September 1999, volume 5, Number 9.]

Canonical formats are widely supported and considered to be optimal for long-term preservation.

card

(See card image.)

card image

(1) Eighty characters of data stored as a single physical record. (2) A file storage format of 80 characters or bytes per record. The card-image format is a remnant of the time when data were literally input on punch cards that had physical limits of 80 characters per card. Usually a case or all the data for a single respondent is stored on several 80-character "cards." Each "card" is numbered and stored in numerical sequence. Cards with the same sequence number (i.e., having a common format for the layout and contents of variables) are called a "deck"; thus cards are often referred to in documentation by their "deck number." Example: "The variable for age is stored in Deck 01 in columns 10-11 and the variable for race is stored in Deck 02 in column 10."

case

In survey research, an individual respondent. Contrast with unit of analysis.

CATI/CAPI

(See Computer-Assisted Telephone [Personal] Interviewing.)

CD-ROM

Compact Disc Read-Only Memory. A storage medium. Data are "stamped" onto the disc during the manufacturing process. The disc is read-only. A variant has appeared that is rewritable, but this variant is not in use for the dissemination of data.

character-encoding scheme

A method of encoding characters including alphabetic characters (A-Z, uppercase and lowercase), numbers 0-9, punctuation and other marks (e.g., comma, period, space, &, *), and various "control characters" (e.g., tab, carriage return, linefeed) using binary numbers. For a computer to print a capital "A" or a number "7" on the computer screen, for instance, we must have a way of telling the computer that a particular group of bits represents an "A" or a "7". There are standards, commonly called "character sets," that establish that a particular byte stands for an "A" and a different byte stands for a "7". The two most common standards for representing characters in bytes are ASCII and EBCDIC.

character format

Any file format in which information is encoded as characters using only a standard character-encoding scheme. A file written in "character format" contains only those bytes that are prescribed in the encoding scheme as corresponding to the characters in the scheme (e.g., alphabetic and numeric characters, punctuation marks, and spaces). A file written in the ASCII character format, for instance, would store the number "7" in eight bits (i.e., one byte): 00010111. A file written in EBCDIC would store the number "7" in eight bits as 11110111. Contrast with binary format.

character sets

(See character-encoding scheme.)

cleaning

Process to check data for adherence to standards, internal consistency, referential integrity, valid domain, and to replace/repair incorrect data with correct data.

To "clean" a data file is to check for wild codes and inconsistent responses (see consistency check); to verify that the file has the correct and expected number of records, cases, and cards or records per case; and to correct errors found.

code

In most numeric data files, answers to questions are recorded with numbers rather than text, and often even numeric answers are recorded with numbers other than the actual response. The numbers used in the data file are called "codes." Thus, for instance, when a respondent identifies herself as a member of a particular religion, a code of "1" might be used for Catholic, a "2" for Jewish, etc. Likewise, a person's age of 18 might be coded as a 2 indicating "18 or over." The codes that are used and their correspondence to the actual responses are listed in a codebook.

codebook

Generically, any information on the structure, contents, and layout of a data file. Typically, a codebook includes: column locations and widths for each variable; definitions of different record types; response codes for each variable; codes used to indicate nonresponse and missing data; exact questions and skip patterns used in a survey; and other indications of the content of each variable. Many codebooks also include frequencies of response. Codebooks vary widely in quality and amount of information included.

Codec

"A codec is the means by which sound and video files are compressed for storage and transmission purposes. There are various forms of compression: 'lossy' and 'lossless', but most codecs perform lossless compression because of the much larger data reduction ratios that occur [with lossy compression]. Most codecs are software, although in some areas codecs are hardware components of image and sound systems. Codecs are necessary for playback, since they uncompress [or decompress] the moving image and sound files and allow them to be rendered."

ICPSR will have to specify which type of codec they would like to use in creating digital files of video materials. Preferred codecs can change as frequently as preferred file formats; it will be important to conduct current research to know which codecs are most appropriate.

column

In a data file, a single vertical column, each being one byte in length. Fixed format data files are traditionally described as being arranged in lines and columns. In a fixed format file, column locations describe the locations of variables.

column location

The precise location in a data file of a variable expressed in column numbers, beginning with the first column in a physical record as column number 1.

Common Services

"The supporting services such as inter-process communication, name services, temporary storage allocation, exception handling, security, and directory services necessary to support the OAIS."

Computing and Network Services (CNS) provides or acquires requisite services to provide Common Services to meet the requirements of digital preservation.

compression

A method of reducing the size of computer files. There are several compression programs available, such as gzip and WinZip.

Compression ratio or reduction ratio

The ratio that is used to discuss the quantity of original data versus the quantity of data after compression.

Computer-Assisted Telephone Interviewing (CATI)/Personal Interviewing (CAPI)

A method of coding information from telephone or personal interviews directly into a computer during the interview. CATI/CAPI software usually has built-in consistency checks, will not allow wild codes to be entered, and automatically prompts the interviewer for correct skip pattern questions.

consistency check

A process of data cleaning that eliminates inappropriate responses to branched questions. For instance, one question might ask if the respondent attended church last week; a response of "no" should indicate that questions about church attendance should be coded as "inapplicable." If those questions were coded any other way than "inapplicable," this would be inconsistent with the skip patterns of the survey instrument.

Consumer

"The role played by those persons, or client systems, who interact with OAIS services to find preserved information of interest and to access that information in detail. This can include other OAISs, as well as internal OAIS persons or systems."

Member institutions and other users are the Consumers of ICPSR digital assets.

control cards

(See setup files.)

cross-sectional study

In survey research, a study in which data from particular subjects are obtained only once. Contrast with longitudinal studies, in which a panel of individuals is interviewed repeatedly over a period of time. Note that questions in a cross-sectional study can apply to previous time periods.

D

DAT

Digital Audio Tape. A high-density storage medium.

Data

For social science, data is generally numeric files originating from social research methodologies or administrative records, from which statistics are produced.

At ICPSR, the majority of digital content matches this definition of data. ICPSR's collections are expanding to include audio, video, geospatial, Web-based and other digital content that pertains to social science research.

data definition statements

(See setup files.)

Data Documentation Initiative (DDI)

An effort to develop a specification for documenting data files in XML. The DDI Alliance is the organization that created the specification, though "DDI" is often used to refer to the actual DTD created by the DDI Committee. More information can be found on the DDI website.

data entry

The process of converting verbal or written responses to electronic form.

Data Management

"The OAIS entity that contains the services and functions for populating, maintaining, and accessing a wide variety of information. Some examples of this information are catalogs and inventories on what may be retrieved from Archival Storage, processing algorithms that may be run on retrieved data, Consumer access statistics, Consumer billing, Event Based Orders, security controls, and OAIS schedules, policies, and procedures."

The pipeline incorporates a diagram and visualization of the Data Management function of OAIS for ICPSR. The increasingly comprehensive Oracle system provides Data Management services and content defined in OAIS, including information from the Deposit Form, the Study Tracking System, the metadata record, the current data library system, the growing preservation system, the turnover system, and other components of the lifecycle as they are automated. The process improvement initiative is reviewing and revising the lifecycle process at ICPSR.

Data Processing

Within the field of information technology, data processing typically means the processing of information by machines.

Data processing is defined by procedures designed to make a data collection easier to use, ensure its accuracy, enhance its utility, optimize its format, protect confidentiality, etc. For archival purposes, the process and results of data processing must be systematically and comprehensively captured so that the process applied to the data is transparent to users.

dataset

Or "data set." A collection of data records. In the SAS statistical software, a "SAS data set" is the internal representation of data.

DDI

(See Data Documentation Initiative.)

DDI instance

An XML document marked up according to the DDI DTD. In other words, a codebook or catalog record marked up in DDI-compliant XML.

deck

(See card image.)

decompression

Used to restore data to uncompressed form after compression.

Designated Community

An OAIS concept describing the constituency for which the archived information should be relevant and understandable.

The Designated Community includes depositors (Producers) and users (Consumers) who are typically members of the social science research community or extensions of that community, e.g., data librarians, digital archivists.

dictionary file

A special form of machine-readable codebook that contains information about the structure of a data file and the locations and, often, the names of variables in the data file. Typically, a researcher uses a dictionary file and a data file together with statistical software; the statistical software uses the dictionary to specify variables by name, rather than specifying their locations in the file.

Digital Curation

"Digital curation is all about maintaining and adding value to a trusted body of digital information for future and current use; specifically, the active management and appraisal of data over the entire life cycle. Digital curation builds upon the underlying concepts of digital preservation whilst emphasizing opportunities for added value and knowledge through annotation and continuing resource management. Preservation is a curation activity, although both are concerned with managing digital resources with no significant (or only controlled) changes over time."

Digital curation is a fairly new term. Curation of social science research data has always been the mission and purpose of ICPSR, if not the term used to described what we do. ICPSR is formalizing its data stewardship services at the University of Michigan and for member institutions.

Digital Preservation

A term that encompasses all of the activities required to ensure that the digital content designated for long-term preservation is maintained in usable formats, for as long as access to that content is needed or desired, and can be made available in meaningful ways to current and future users.

Digital preservation is a distributed function that includes the Digital Preservation Officer, who develops and promulgates requisite policies that reflect prevailing standards and practice in the digital preservation community; Computing and Network Services, which oversees the archival storage function, the day-to-day operations of digital preservation, and develops tools and procedures to perform digital preservation activities and meet archival requirements.

Digital Videotape Formats

"A related family of open bitstream encoding formats for recording digital video on physical media (tapes, hard disks) through digital video devices (digicams, camcorders)." Currently DV, DVCAM, and DVCPRO are the most widely used digital videotape formats.

These digital video formats are different and distinct from the digital video file formats that will comprise the main thrust of ICPSR's digital video preservation program. However, it is likely that many depositors will use these formats and ICPSR must be prepared to convert them. This will be somewhat challenging because it is can be difficult to transcode these formats to data files.

Disclosure Limitation

Procedures undertaken to limit the risk of disclosure of individual identities in data files.

The techniques used for disclosure limitation include data masking, recoding, topcoding, swapping, and perturbation (see other ICPSR sources for definitions of these terms). Like data processing, the process and results of disclosure limitation need to be systematically, comprehensively, and transparently documented for users.

Dissemination Information Package (DIP)

"The Information Package, derived from one or more AIPs, received by the Consumer in response to a request to the OAIS." An archive works with Consumers over time to ensure that DIPs remain useful.

The DIPs are the access copies of files (data, documentation, supporting files, and related metadata) that are made available to users by download via the ICPSR website; by CD via the mail, for a subset of files that require a user agreement; or in the ICPSR data enclave onsite, for files that contain sensitive information and cannot otherwise be made available.

Document Type Definition (DTD)

A set of rules that applies SGML (Standard Generalized Markup Language) or XML (eXtensible Markup Language) to the markup of documents of a particular type. A DTD provides a list of the elements, attributes, comments, notes, and entities that may be used in the document, as well as their relationships to one another.

Documentation

Generically, any information on the structure, contents, and layout of a data file. Sometimes called "technical documentation" or "a codebook". Documentation may be considered a specialized form of metadata.

Documentation has arrived in a wide array of formats since the establishment of ICPSR in 1962. To meet preservation requirements, documentation must be complete, correct, comprehensive, current, and compliant (to content and preservation standards). ICPSR produces documentation that conforms with the Data Documentation Initiative (DDI). (See the DDI website for current information about the version and current status of DDI.) As an XML-based format, DDI provides a preferred preservation format for documentation.

downloading

Downloading is the transmission of a file from one computer system to another, usually to a smaller computer system. From the Internet user's point-of-view, to download a file is to request it from a Web page on another computer and to receive it.

DTD

(See Document Type Definition.)

E

EBCDIC

A character-encoding scheme used by IBM mainframe computers and some other computers. Unlike ASCII, the EBCDIC standard specifies use of the entire 8 bits of each byte. Example: In EBCDIC the number "7" is treated as a character and is encoded as: 11110111. (EBCDIC stands for Extended Binary Coded Decimal Interchange Code.)

element (XML)

An XML element is the central building block of any XML document. Example of XML elements:

<book> <chapter> <title>The Beginning</title> <intro>blah blah blah...</intro>
</chapter> </book>

In the example above, book, chapter, title, and intro are elements.

export file

(See portable file.)

extract

To select or copy a subset of data from one or various data sources, using filters and/or a selection of variables.

F

file

A collection of any form of data that is stored, usually on a computer disk or tape.

fixed format

A file structure consisting of physical records of a constant size within which the precise location of each variable is based on the column location and width of the variable. Most data from ICPSR are distributed in a fixed format, and codebooks are used to specify the column location and width of each variable. Contrast with free format.

flat file

(See rectangular file.)

free format

A physical file structure that specifies the order of variables in a file and delimits them from each other by a special character or characters (usually a blank or other white space). Free format files may have variable physical record lengths; when they do, they are typically delimited by a newline character at the end of each line. Contrast with fixed format.

frequencies

Also called "marginals." In survey research, the number of respondents who responded to each of the possible answers to a question. Often codebooks list the frequency of response to each question.

frequency file

A file that contains the frequencies for each question in a survey.

FTP

File Transfer Protocol. A reliable method of transferring files electronically over the Internet.

H

header record

A record that denotes the beginning of a series of records and describes the contents of the records that follow. For example, the International Financial Statistics has a header record that describes a time series, and the header record is followed by a number of records containing the actual time series. Header records are often used when the number of physical records needed to contain the data for a particular variable is not constant for all variables. For instance, in economic time series data files, one variable might be a time series that contains 20 years of data and fills 20 physical records while the next variable might contain a time series with only 5 years of data and fill only 5 physical records. The use of a file format that includes header records enables one to determine where the series begin and end.

hierarchical file

A hierarchical file is one that contains information collected on multiple units of analysis in different record types. For example, the physical housing structure may be one unit, and individual persons within the structure are another. An example is the Current Population Survey: Annual Demographic File which has household, family, and person units of analysis. Studies that include data for different units of analysis often link those units to each other so that, for instance, one can analyze the persons as they group in a structure. Such studies are sometimes referred to as having a relational structure.

hierarchical file structure

A format for storing hierarchical files. Each unit of analysis has its own record structure or record type. Different units of analysis do not necessarily have the same number of bytes or characters as the records for other units of analysis. In order to give such a file a common physical record length, short logical records are typically "padded" with blanks so that they will all be the same physical record length. A hierarchical file can be also be stored in a rectangular file. For instance, early data from the Survey of Income and Program Participation are distributed both ways; users can choose the format they prefer. Typically, the hierarchical file structure is more space-efficient but more difficult to use.

HTML

HyperText Markup Language. A hypertext document format based on SGML used on the Web. Tags are embedded in the text to control display and presentation of a document.

I

Ingest

"The OAIS entity that contains the services and functions that accept Submission Information Packages from Producers, prepares Archival Information Packages for storage, and ensures that Archival Information Packages and their supporting Descriptive Information become established within the OAIS."

Ingest covers the lifecycle stages of selection and appraisal (based on the collection development policy and criteria), acquisition (with the Deposit Form serving as a Submission Agreement), and processing (quality control) followed by the generation of the AIP for Archival Storage.

J

Java servlet

Technology that provides Web developers with a simple, consistent mechanism for extending the functionality of a Web server and for accessing existing business systems. A servlet can be thought of as an applet that runs on the server side -- without a face.

L

line

Often used synonymously with physical record. Thus, it means the same as card in a card-image data file, and it means the same as logical record length in data files that have a logical record length format. In general, a "line" in data file terminology refers to a physical unit of data that the computer reads and processes, one at a time. In DOS and UNIX environments, most statistical software expects "lines" to end with a newline character, but most statistical software can be configured to read a specific number of bytes as a "line" regardless of the presence or absence of a newline character.

logical record

All the data for a given unit of analysis. It is distinguished from a physical record because it may take several physical records to store all the data for a given unit of analysis. For instance, in card image data, a "card" is a physical record and it usually takes several "cards" to store all the information for a single case or unit of analysis.

logical record length

Abbreviated "LRECL". (1) A file storage format in which the length of a logical record is equal to the length of a physical record (which is constant). Thus, when the data for each case or unit of analysis are stored in a single physical record, the file structure is called "logical record length." Typically, logical record length format is more space-efficient than card image. (2) The length, in bytes (i.e., columns), of a logical and physical record in a logical record length-formatted file.

longitudinal study

In survey research, a study in which the same group of individuals is interviewed at intervals over a period of time. See also panel study. Note that some cross-sectional studies are done regularly. For instance, the General Social Survey and the Current Population Survey: Annual Demographic File are conducted once a year, but different individuals are surveyed each time. Such a study is not a true longitudinal study. An example of a longitudinal study is the National Longitudinal Survey of Labor Market Experience, in which the same individuals have been followed over time.

M

Management

"The role played by those who set overall OAIS policy as one component in a broader policy domain."

The Director, the Digital Preservation Officer, and the Director's Group perform the role of Management in the OAIS context, with input from ICPSR Advisory Council and approval of the highest level policies.

margin of error

A measurement of the accuracy of the results of a survey. Example: A margin of error of plus or minus 3.5% means that there is a 95% chance that the responses of the target population as a whole would fall somewhere between 3.5% more or 3.5% less than the responses of the sample (a 7% spread).

marginals

(See frequencies.)

markup

The characters and codes that change a text document into an XML or other Markup Language document. This includes the < and > characters as well as the elements and attributes of a document.

metadata

A term that refers to structured data about data. Metadata is an old concept (e.g., card catalogs and indexes), but metadata is often essential for digital content to be useful and meaningful. Metadata can capture general or specific information about digital content that may define administrative, technical, or structural characteristics of the digital content. "Preservation metadata" is the term for a broader set of metadata that documents the lifecycle of digital content from creation through processing, storage, preservation, and use over time. Preservation metadata is required at the aggregate (e.g., collection and study level) and at the item (e.g., file and variable) level. All preservation actions that are applied to digital content over time should be captured in preservation metadata, for example. The Preservation Metadata Implementation Strategies (PREMIS) data dictionary is a digital preservation community development that is moving towards being a standard. There are additional format-specific (e.g., NISO Still Image data dictionary) and other standards that define additional metadata for preservation.

We prepare a metadata record for each data collection, and we present a searchable database of metadata records on our public website. ICPSR has defined a set of file-level metadata elements for preservation and intends to comply with PREMIS as it develops. The process improvement initiative at ICPSR includes the identification of metadata at each stage of the pipeline.

microdata

Microdata files are those that contain information on individuals rather than aggregate data. The U.S. Census Bureau's "Summary Files" contain aggregate data and consist of totals of individuals with various specified attributes in a particular geographic area. They are, in a sense, tables of totals. The Bureau's PUMS (Public Use Microdata Sample) files, however, contain the data from the original census survey instrument with certain information removed to protect the anonymity of the respondent.

missing data

Missing data values are assigned when the information being collected is missing, which can happen for several reasons -- among them, the respondent refused to answer or the question was inapplicable for that particular respondent because of previous responses.

N

newline character

One or two bytes that denote the end of a line. In UNIX, a newline character is one byte: a linefeed. In the Macintosh OS, a newline character is one byte: a carriage return.

NFS

Network File System. A process for mounting magnetic disks on a network so that disks not physically attached to a computer can be accessed as if they were physically attached.

Normalization

In a preservation context, normalization refers to a preservation strategy that involves the imposition of standard formats and rules to create preservable file formats. Normalization has specific connotations within the database (e.g., normalized tables), the Web (e.g., normalized URLs), and other communities, but the essence of the term is to standardize for more effective processing and exchange of information.

We use normalization as a preservation strategy. We convert deposited files from their original format to an accepted preservation format as needed. Both the original file and the normalized file are retained.

O

OAIS

The Open Archive Information System (OAIS) Reference Model, an ISO standard that formally expresses the roles (producer, management, consumer, and implicitly archives), functions (common services, ingest, archival storage, data management, administration, preservation planning, and access), and content (submission information package, archival information collection, archival information package, and dissemination information package) of an archive. It was approved as an ISO standard in 2003. OAIS is undergoing a five-year review in 2007.

The digital preservation policies program, system, and function are being developed in conformance with OAIS.

operating system

The special software required to make a computer work. It provides the link between the user and the hardware. Popular operating systems include DOS, MacOS, VMS, VM, MVS, UNIX, and OS/2. (Note that "Windows 3.x" is not an operating system as such, since it must have DOS to work, while Windows NT and Windows 98 are operating systems.)

OSIRIS

Statistical software similar to SPSS and SAS with strong data management features. In the past ICPSR distributed many studies in OSIRIS format with special machine-readable codebooks and dictionary files readable by the OSIRIS software. (Note: OSIRIS has been officially decommissioned by its sponsor, the Institute for Social Research, University of Michigan.)

OSIRIS codebook

A machine-readable codebook written in binary format for use with OSIRIS software.

OSIRIS dictionary

A machine-readable data dictionary usable with OSIRIS software. ICPSR distributes only "Type 1" OSIRIS dictionaries, which are in a binary format and must be written in EBCDIC. OSIRIS "Type 5" dictionaries are character format files.

P

packed decimal

A method of encoding two pieces of information in a single byte. For instance, instead of storing a digit in one byte and a sign in another byte using a traditional character encoding scheme, a packed decimal format might use a binary number to indicate the value of the digit in 4 bits of the byte and a code indicating whether the digit is positive or negative in the other 4 bits. The International Monetary Fund distributes data in packed decimal format.

panel

A group of individuals who are interviewed more than once over time in a longitudinal survey.

panel study

A longitudinal study in which a panel of individuals is interviewed at intervals over a period of time. In general usage, the definitions of longitudinal study and panel study overlap. At least one author says that the term "panel study" is sometimes used for studies that are restricted to a short period of time or are limited to two or three interviews, and "longitudinal study" is used for studies that last longer or include more interviews; but there are significant examples where this distinction is not accurate. In general, longitudinal studies involve panels of respondents and panel studies are longitudinal studies. Examples of panel studies include the Survey of Income and Program Participation (SIPP) and the Panel Study of Income Dynamics (PSID).

parser

An algorithm or program to determine the syntactic structure of a sentence or string of symbols in some language. Essentially, a program that analyzes the structure of text, looking for particular patterns and extracting/editing based upon pre-established rules.

PDF

(See Portable Document Format.)

physical record

A segment of data that has a specified and constant size in bytes or that is clearly delimited from other records by a newline character or sector of a disk or other means identifiable to a computer program reading the file. For example, a card-image data file has physical records of 80 bytes each, by definition. In a file in logical record length structure, each physical record is the same number of bytes in length as the "logical record length." See also line.

PI

An abbreviation for principal investigator.

Pipeline

In computer science, pipeline processing is "a category of techniques that provide simultaneous, or parallel, processing within the computer. It refers to overlapping operations by moving data or instructions into a conceptual pipe with all stages of the pipe processing simultaneously. For example, while one instruction is being executed, the computer is decoding the next instruction." The term pipeline calls to mind the assembly line approach in manufacturing.

The pipeline refers to the flow of digital content from reception through processing to public release with imbedded preservation milestones.

portable file

In computer usage, a file or program is "portable" if it can be used by a variety of software on a variety of hardware platforms. SPSS portable files can be produced using the "export" command.

Portable Document Format (PDF)

A universal file format that retains the page layout, typography, and graphics of the original document and can be viewed, printed, and searched with viewer software such as Adobe Acrobat.

Preservation Planning

The OAIS entity that "provides the services and functions for monitoring the environment of the OAIS and providing recommendations to ensure that the information stored in the OAIS remains accessible to the Designated User Community over the long term, even if the original computing environment becomes obsolete. Preservation Planning functions include evaluating the contents of the archive and periodically recommending archival information updates to migrate current archive holdings, developing recommendations for archive standards and policies, and monitoring changes in the technology environment and in the Designated Community's service requirements and Knowledge Base. Preservation Planning also designs IP templates and provides design assistance and review to specialize these templates into SIPs and AIPs for specific submissions. Preservation Planning also develops detailed Migration plans, software prototypes and test plans to enable implementation of Administration migration goals."

The Digital Preservation Officer is primarily responsible for Preservation Planning with the programming and technical infrastructure support of Computing and Network Services (CNS).

principal investigator

The person or organization responsible for a study; equivalent to "author" in bibliographic citations.

Producer

"The role played by those persons, or client systems, who provide the information to be preserved. This can include other OAISs or internal OAIS persons or systems."

The Producer includes principal investigators, project managers, federal agencies, other data archives, and others; it is anyone who authorizes (or requires) ICPSR to preserve digital content.

R

recode

Often a data analyst or data producer will produce new data values from raw data and include these in a data file; this process is called "recoding." For instance, an age variable might contain a respondent's actual age in years, but this information might be "recoded" to produce a new variable, "eligible voter," with a code of "1" for all those 18 and over and a code of "2" for all those under 18.

record

Depending on the context, "record" may refer to a physical record or a logical record. See also line.

record length

Depending on the context, the length in bytes (i.e., columns) of a physical record or a logical record.

record type

A record that has a consistent logical structure. In files that include different units of analysis, for instance, different record types are needed to hold the different variables. For example, one record type might have a variable for income in one column and another record type might have a variable for household size in that same column. The codebook will describe these different structures and how to determine which is which so that the user can tell statistical software how to interpret that particular column as income or household size.

rectangular file

A physical file structure. A rectangular file is one that contains the same number of card images or the same physical record length for each respondent or unit of analysis. Contrast with hierarchical files.

relational structure

A study that includes different units of analysis, particularly when those units are not arranged in a strict hierarchy as they are in a hierarchical file, has a relational structure. Note that the data could be arranged in several different physical structures to handle such a data structure. For instance, each unit of analysis might be stored in a separate rectangular file with identification numbers linking each case to the other units; or, the different units of analysis might be stored in one large file with a hierarchical file structure; or the different units could be stored in a special database structure used by a relational database management system such as INGRES. An example of a study with a relational structure is the Survey of Income and Program Participation, which has eight or more record types; these record types are related to each other but are not all members of a hierarchy of membership. For instance, there are record types for household, family, person, wage and salary job, and general income amounts.

respondent

In survey research, the person responding to the survey questions.

response rate

The ratio of returned questionnaires to the survey designed universe.

Response Rate

=

Returns
-------------------------
Designed Universe

x

100

The denominator is the number of designed subjects, whether a sample or a population, that was approached by mail, telephone, or other channel of investigation. The numerator is the number of actual responses.

response codes

Typically responses to questions are "coded" by assigning numeric codes to each possible response. Thus, a "yes" might be coded "1" and a "no" "2"; female respondents might be indicated "1" and male respondents "2"; each state or county might be assigned a numeric code.

Restricted-Use Data

Data that contain sensitive information (usually about human subjects) that could permit the identification of individuals.

To obtain access to these data through ICPSR, a user must complete a legal contract or in some cases travel to where the data are stored. The presence of sensitive information in deposited digital content presents a management challenge for long-term preservation to ensure that archival storage requirements for achieving distributed redundancy address confidentiality requirements, for example.

S

secondary analysis

The process of reexamining existing data to address new questions or use methods not previously employed.

setup files

A character format file written in a statistical software language (SAS, SPSS, Stata, etc.) describing a data file. These files are useful because they provide variable locations, names, and labels. Software-specific code must be added to perform analysis.

SGML

Standard Generalized Markup Language. A generic language for document representation. SGML is an international standard that describes the relationship between a document's content and its structure.

skip pattern

In survey research, the sequence of questions asked and skipped. For instance, if a respondent answers a question that indicates he did not vote in the last election, his data record should "skip" items regarding how he voted in the last election.

study

All the information collected at a single time or for a single purpose or by a single principal investigator. A study consists of one or more files. Examples: the General Social Survey; A Gallup Poll; the 1990 Census of Population and Housing STF 1A.

stylesheet

A document that explains the rules governing how another document (or group of documents) should display. For HTML pages, stylesheets are written as Cascading Stylesheets (CSS); for XML files, we use XSLT.

Submission Information Package (SIP)

"An Information Package that is delivered by the Producer to the OAIS for use in the construction of one or more AIPs."

The SIP includes the original files and associated metadata and documentation, including information provided on the ICPSR Deposit Form.

system file

A generic term for the native or internal storage format used by statistical software. When statistical software reads a "raw" character format data file consisting of ASCII or EBCDIC characters, it must read each byte in sequence. It can be more efficient in its storage, retrieval, and calculations by storing a data file in a special binary format called a system file. Typically, a system file for one brand of software cannot be read by another brand of software or by the same brand on another hardware platform. Some software is capable of creating a portable file that can then be read by other software or on other platforms.

T

tag library

A collection of documents explaining the correct way to tag documents in XML for a particular DTD. A tag library goes beyond the basic rules of the DTD in that it provides pointers on what is considered "best practice."

tags

Fragments of text used to organize content, usually delimited in a set format. Example of XML tags:

<book> <chapter> <title>The Beginning</title> <intro>blah blah blah...</intro>
</chapter> </book>

In the example above, book, chapter, title, and intro are tags. They do not convey content, but rather the context of the content. The < and > are used to signify what is a tag and what is content.

text file

In computer usage, any file written in pure character format. Sometimes called a "plain text file."

time series

Observations of a variable made over time. Many economic studies, such as IMF's International Financial Statistics, are time series data files. Time series, of a sort, can also be constructed from a cross-sectional study if the same questions are asked more than once over time. See also longitudinal study.

U

undocumented code

(See wild code.)

unit of analysis

The basic observable entity being analyzed by a study and for which data are collected in the form of variables. Although a unit of analysis is sometimes referred to as the case or "observation," these are not always synonymous. For instance, in public opinion polls, the unit of analysis is usually a single person and the answers to the survey questions by one person constitute a "case." In a census, however, a "case" could be considered the household because all the data for one household are collected on one survey instrument; the household "case" may contain different variables for the different units of analysis: a physical housing structure, a family within the structure, a person within the family.

V

valid (xml)

An XML document that is verified correct against a DTD or schema. The process of checking to be sure that document is valid is called validation. Note this is more stringent than simply verifying that the document is well-formed.

variable

In social science research, for each unit of analysis, each item of data (e.g., age of person, income of family, consumer price index) is called a variable.

W

wave

In a panel study, a wave is the interviewing period during which the entire panel is surveyed and asked the same questions. Typically, a panel study consists of several waves. Waves are important because each wave usually covers a different time period and, often, different topics.

weight

In survey research, a number associated with a case or unit of analysis; the weight is used as a measure of the relative contribution of the variables of that case when making estimates for the entire population. When a probability sample is used, there is often a chance that some elements of the population are under- or overrepresented in the sample. In order to allow more accurate estimates of a complete population, "weights" are assigned to each case and used to adjust the overall results to conform more closely to the total population.

well-formed (XML)

An XML document that follows the rules set forth by the XML specification, including having an XML declaration, correct comments, all tags are closed, all attributes are quoted, every document has one "container" element. Note this means that the XML is correct, but not necessarily following the rules specified by the DTD. (See valid.)

wild code

In survey research, "wild" codes are codes that are not authorized for a particular question. For instance, if a question that records the sex of the respondent has documented codes of "1" for female and "2" for male and "9" for "missing data," a code of "3" would be a "wild" code, sometimes called an "undocumented code."

X

XML

An initiative of the World Wide Web Consortium (W3C), the eXtensible Markup Language (XML) is a simple dialect of SGML. Its goal is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML. XML was designed for ease of implementation and for interoperability with both SGML and HTML.

XML Schema

Schema are essentially DTDs that are written in XML, instead of the special DTD syntax.

XSL/XSLT

XSLT, the Extensible Stylesheet Language for Transformations, is an official recommendation of the World Wide Web Consortium (W3C). XSLT is a language used for transforming XML into other formats, most commonly HTML, PDF, or different forms of XML. If XML is all about content, then XSLT is about display.