Data Stewardship Publications

ICPSR-published Documents and Staff-authored Publications about Data Stewardship

To address a broad range of subjects related to data stewardship, ICPSR staff and researchers have authored white papers and reports, as well as published articles. Some of these documents are the outcome of funded projects, while others outline ICPSR policy or training.

This collection is grouped into several main subject categories within which many topics are discussed in relation to data stewardship. Some of those topics include data archiving, data repositories, government policy, community engagement, national and international data sharing issues, and training outreach.

Full-text links are provided for all ICPSR-published documents, as well as links to the abstract or full text of articles published elsewhere. When citing these works, please include the persistent identifiers listed with each.

Data Confidentiality

O'Rourke, Joanne McFarland, Stephen Roehrig, Steven G. Heeringa, Beth Glover Reed, William C. Birdsall, Margaret Overcashier, and Kelly Zidar.
Solving Problems of Disclosure Risk While Retaining Key Analytic Uses of Publicly Released Microdata. Journal of Empirical Research on Human Research Ethics 1, no. 3 (September 1, 2006): 63-83.
DOI: 10.1525/jer.2006.1.3.63

Abstract: Measures used to protect subjects in publicly distributed microdata files often have a significant negative impact on key analytic uses of the data. For example, it may be important to analyze subpopulations within a data file such as racial minorities, yet these subjects may present the greatest disclosure risk because their records tend to stand out or be unique. Files or records that are linkable create another type of disclosure risk-common elements between two files can be used to link files with sensitive data to externally available files that disclose identity. Examples of disclosure limitation methods used to address these types of issues include blanking out data, coarsening response categories, or withholding data altogether. However, the very detail that creates the greatest risk also provides insight into differences that are of greatest interest to analysts. Restricted-use agreements that provide unaltered versions of the data may not be available, or only selectively so. The public-use version of the data is very important because it is likely to be the only one to which most researchers, policy analysts, teaching faculty, and students will ever have access. Hence, it is the version from which much of the utility of the data is extracted and often it effectively becomes the historical record of the data collection. This underscores the importance that the disclosure review committee strikes a good balance between protection and utility. In this paper we describe our disclosure review committee's (DRC) analysis and resulting data protection plans for two national studies and one administrative data system. Three distinct disclosure limitation methods were employed, taking key uses of the data into consideration, to protect respondents while still providing statistically accurate and highly useful public-use data. The techniques include data swapping, microaggregation, and suppression of detailed geographic data. We describe the characteristics of the data sets that led to the selection of these methods, provide measures of the statistical impact, and give details of their implementations so that others may also utilize them. We briefly discuss the composition of our DRC, highlighting what we believe to be the important disciplines and experience represented by the group.

Richardson, Douglas B., Mei-Po Kwan, George Alter, and Jean E. McKendry.
Replication of Scientific Research: Addressing Geoprivacy, Confidentiality, and Data Sharing Challenges in Geospatial Research. Annals of GIS 21, no. 2 (April 3, 2015): 101-10.
DOI: 10.1080/19475683.2015.1027792

Abstract: The ability to replicate, or reproduce, research is fundamental to the scientific process. Research combining a variety of georeferenced data is spreading rapidly across scientific domains and international borders. This suggests a growing potential for the use and integration of new and existing data sets to create new multi-disciplinary scientific collaborations. Yet, the unique characteristics of georeferenced data present special challenges to such collaborations. These data are highly identifiable when presented in maps and other visualizations or when combined with sensor data or other related geospatial data sets. The potential opportunities of collaboration may thus be constrained by the need to protect the locational privacy (geoprivacy) and confidentiality of subjects in research using georeferenced data. This paper reviews the obstacles to and potential methods for sharing georeferenced data in order to support a growing and dynamic geospatial research community and build capacity for data-intensive research across the social and environmental sciences. The development and implementation of a geospatial virtual data enclave methodology is proposed as an innovative and viable solution to share and archive georeferenced data among researchers while protecting the geoprivacy of research subjects and the confidentiality of these data. The ability to share confidential geospatial data among researchers is crucial to ensuring replicability of scientific research, and to enable researchers to verify and build upon the research of others.

VanWey, Leah K., Ronald R. Rindfuss, Myron P. Gutmann, Barbara Entwisle, and Deborah L. Balk.
Confidentiality and Spatially Explicit Data: Concerns and Challenges. Proceedings of the National Academy of Sciences 102, no. 43 (October 25, 2005): 15337-42.
DOI: 10.1073/pnas.0507804102

Abstract: Recent theoretical, methodological, and technological advances in the spatial sciences create an opportunity for social scientists to address questions about the reciprocal relationship between context (spatial organization, environment, etc.) and individual behavior. This emerging research community has yet to adequately address the new threats to the confidentiality of respondent data in spatially explicit social survey or census data files, however. This paper presents four sometimes conflicting principles for the conduct of ethical and high-quality science using such data: protection of confidentiality, the social-spatial linkage, data sharing, and data preservation. The conflict among these four principles is particularly evident in the display of spatially explicit data through maps combined with the sharing of tabular data files. This paper reviews these two research activities and shows how current practices favor one of the principles over the others and do not satisfactorily resolve the conflict among them. Maps are indispensable for the display of results but also reveal information on the location of respondents and sampling clusters that can then be used in combination with shared data files to identify respondents. The current practice of sharing modified or incomplete data sets or using data enclaves is not ideal for either the advancement of science or the protection of confidentiality. Further basic research and open debate are needed to advance both understanding of and solutions to this dilemma.

Data Curation

Doty, Jennifer, Joel Herndon, Jared Lyle, and Libbie Stephenson.
Learning to Curate. Bulletin of the American Society for Information Science and Technology 40, no. 6 (August 1, 2014): 31-34.
DOI: 10.1002/bult.2014.1720400610

Abstract: Three data specialists reviewed their experiences learning about and applying the Inter-university Consortium for Political and Social Research (ICPSR) processes and tools for curating research data. A small virtual community discussed curation theories on data acquisitions, review, processing, metadata and dissemination and shared progress implementing the ICPSR workflows and tools. Curators at Duke University, dealing with data on political donors, found processing obstacles from incomplete and mismatched data and faced confidentiality questions. At Emory University, gaps in coded data on home schooling practices revealed problems arising from lack of preplanning for long-term archiving and research and the need to clarify data needs for later re-use. By applying ICPSR processes, curators at UCLA's Social Sciences Data Archive were able to improve their workflow and understand the work necessary for open archiving. All participants gained from the opportunity to practice ICPSR curation practices, realized the resource demands and saw the value librarians can provide by consulting with faculty on data management and preservation.

Gutmann, Myron, Kevin Schürer, Darrell Donakowski, and Hilary Beedham.
The Selection, Appraisal, and Retention of Social Science Data. Data Science Journal 3 (2004): 209-21.
DOI: 10.2481/dsj.3.209

Abstract: The number of data collections produced in the social sciences prohibits the archiving of every scientific study. It is therefore necessary to make decisions regarding what can be preserved and why it should be preserved. This paper reviews the processes used by two data archives, one from the United States and one from the United Kingdom, to illustrate how data are selected for archiving, how they are appraised, and what steps are required to retain the usefulness of the data for future use. It also presents new initiatives that seek to encourage an increase in the long-term preservation of digital resources.

Gutmann, Myron, Kristine Witkowski, Corey Colyer, JoAnne McFarland O'Rourke, and James McNally.
Providing Spatial Data for Secondary Analysis: Issues and Current Practices Relating to Confidentiality. Population Research and Policy Review 27, no. 6 (2008): 639-65.
DOI: 10.1007/s11113-008-9095-4

Abstract: Spatially explicit data pose a series of opportunities and challenges for all the actors involved in providing data for long-term preservation and secondary analysis -- the data producer, the data archive, and the data user. We report on opportunities and challenges for each of the three players, and then turn to a summary of current thinking about how best to prepare, archive, disseminate, and make use of social science data that have spatially explicit identification. The core issue that runs through the paper is the risk of the disclosure of the identity of respondents. If we know where they live, where they work, or where they own property, it is possible to find out who they are. Those involved in collecting, archiving, and using data need to be aware of the risks of disclosure and become familiar with best practices to avoid disclosures that will be harmful to respondents.

Inter-university Consortium for Political and Social Research (ICPSR). 
Guide to Archiving Social Science Data for Institutional Repositories (1st ed.). Ann Arbor, MI: Inter-university for Consortium for Political and Social Research, 2012. 
DOI: 10.3886/GuideForIRS

Brief description: This guide provides a detailed overview of what an institutional repository can do to appraise data, prepare them for storage, aand ensure that the preserved data are independently understandable.

Inter-university Consortium for Political and Social Research (ICPSR). 
Guide to Social Science Data Preparation and Archiving: Best Practice Throughout the Data Life Cycle (5th ed.). Ann Arbor, MI: Inter-university Consortium for Political and Social Research, 2012.
DOI: 10.3886/Guide To Social Science Data Preparation

Brief description: This guide describes the key considerations germane to archiving at each step in the data creation process, from proposal development to data collection to data archiving.

Lyle, Jared, George Alter, and Ann Green.
Partnering to Curate and Archive Social Science Data. In Research Data Management: Practical Strategies for Information Professionals, edited by Joyce Ray, 203-21. Charleston Insights in Library, Information Sciences. West Lafayette, IN: Purdue University Press, 2014.

Abstract: Improvements in data processing and storage technology are resulting in an increase in research data on a variety of social, economic, and political subjects. Many datasets could be profitably reanalyzed, but they are at danger of being lost since they are never properly archived. Institutional repositories (IRs) are charged to preserve the scholarly products of their faculty or institution and are increasingly tapped to add data services, but not all feel prepared to curate and archive data. In some cases, IRs work closely with local experts who have a history of supporting data on their campuses (for example, data specialists who serve as Inter-university Consortium for Political and Social Research (ICPSR) official representatives (ORs) or statistical computing consultants), but in many cases we have found that IR managers would welcome additional support for their data curation efforts. ICPSR, a research center in the Institute for Social Research at the University of Michigan and the world's largest archive of social science data, led an Institute of Museum and Library Services (IMLS)-funded project exploring how specialized domain repositories can partner with IRs to curate and preserve data. In this chapter, we describe why partnerships are important, explore guidelines we developed to help IRs and others partner to curate and archive data, and discuss the resulting services and tools we propose repositories will find most helpful when working with data. While our project was directed at IRs, we emphasize the relevance of the guidance and proposed tools and services to anyone who works with data, especially information professionals.

Data Preservation

Altman, Micah, Margaret Adams, Jonathan Crabtree, Darrell Donakowski, Marc Maynard, Amy Pienta, and Copeland Young.
Digital Preservation through Archival Collaboration: The Data Preservation Alliance for the Social Sciences. The American Archivist 72, no. 1 (April 2009): 170-84.
DOI: 10.17723/aarc.72.1.eu7252lhnrp7h188

Abstract: The Data Preservation Alliance for the Social Sciences (Data-PASS) is a partnership of five major U.S. institutions with a strong focus on archiving social science research. The Library of Congress supports the partnership through its National Digital Information Infrastructure and Preservation Program (NDIIPP). The goal of Data-PASS is to acquire and preserve data from opinion polls, voting records, large-scale surveys, and other social science studies at risk of being lost to the research community. This paper discusses the agreements, processes, and infrastructure that provide a foundation for the collaboration.

Gutmann, Myron P., Mark Abrahamson, Margaret O. Adams, Micah Altman, Caroline R. Arms, Kenneth Bollen, Michael Carlson, et al.
From Preserving the Past to Preserving the Future: The Data-PASS Project and the Challenges of Preserving Digital Social Science Data. Library Trends 57, no. 3 (Winter 2009): 315-37.

Abstract: Social science data are an unusual part of the past, present, and future of digital preservation. They are both an unqualified success, due to long-lived and sustainable archival organizations, and in need of further development because not all digital content is being preserved. This article is about the Data Preservation Alliance for the Social Sciences (Data-PASS), a project supported by the National Digital Information Infrastructure and Preservation Program (NDIIPP), which is a partnership of five major U.S. social science data archives. Broadly speaking, Data-PASS has the goal of ensuring that at-risk social science data are identified, acquired, and preserved, and that we have a future-oriented organization that could collaborate on those preservation tasks for the future. Throughout the life of the Data-PASS project we have worked to identify digital materials that have never been systematically archived, and to appraise and acquire them. As the project has progressed, however, it has increasingly turned its attention from identifying and acquiring legacy and at-risk social science data to identifying ongoing and future research projects that will produce data. This article is about the project's history, with an emphasis of the issues that underlay the transition from looking backward to looking forward.

Data Sharing

Kanous, Alex; Brock, Elaine. 
Contractual Limitations on Data Sharing. Report prepared for ICPSR as part of the "Building Community Engagement for Open Access to Data" project. Ann Arbor, MI: Inter-university Consortium for Political and Social Research, 2015.
George Alter (Principal Investigator). Alfred P. Sloan Foundation Grant Number 2012-6-11. 
DOI: 10.3886/ContractualLimitationsDataSharing

Purpose: This report was commissioned by the Inter-university Consortium for Political and Social Research at the University of Michigan Inter-University Consortium for Political and Social Research to engage in an in-depth review of exemplar data sharing, data license, non-disclosure, and other forms of agreements under which data are made available for research use. It is part of a project on "Building Community Engagement for Open Access to Data" sponsored by the Alfred P. Sloan Foundation.1 The intent of the review was to identify common limitations imposed on the use and re-disclosure of data, variations on those common limitations, and the implications of such limitations on the researcher. Finally, cognizant of the varying reasons for imposing these conditions of use, such as proprietary or privacy concerns, the review sought to identify approaches to conditional data use that represent the data discloser's compelling concerns and the data user's need for latitude in use, in a standardized way in order to facilitate data transfer and reduce the administrative burden of tracking a multitude of varying data use limitations.

Kanous, Alex; Brock, Elaine.
Model Data Sharing Agreement. Customizable model created as part of the "Building Community Engagement for Open Access to Data" project. Ann Arbor, MI: Inter-university Consortium for Political and Social Research, 2015.
George Alter (Principal Investigator), Alfred P. Sloan Foundation Grant Number 2012-6-11. 
DOI: 10.3886/ModelDataSharingAgreement

Moss, Elizabeth, Christin Cave, and Jared Lyle.
Sharing and Citing Research Data: A Repository's Perspective. In Big Data, Big Challenges in Evidence-Based Policymaking / H. Kumar Jayasuriya, Editor; Kathryn Ritcheske, Contributing Editor., 2015.

Summary: Formal data citation is a key element of the growing data-sharing infrastructure, not only facilitating sharing, discovery, and proper use, but also enabling data impact tracking that allows researchers to receive credit for their contributions. Specialized data repositories, such as ICPSR, integrate data citations within study metadata to enhance access and encourage data sharing. National and international efforts are underway to encourage adoption of these types of practices. The eventual result should be that more data creators will benefit from citations by receiving credit for their work. More researchers will benefit by readily finding reproducible research. And more funding agencies will benefit by tracking supported projects' usage and gauging impact beyond the initial funding. ICPSR and its topical archives, like NACJD, provide an example of how data citation can encourage data archiving and secondary use. They support the growth of the ICPSR Bibliography of Data-related Literature and see the collection as evidence of new scientific findings for consideration in shaping public policy. The Bibliography's two-way linkages between data and data-associated publications have improved the discovery and the chances of good secondary use of ICPSR data. Due to inconsistent and inadequate data-citing practices in the scholarly literature, tracking data reuse is costly and labor-intensive. Despite this, ICPSR continues to value and invest in the collection of data-related publications, while promoting the creation and use of standards for citing and sharing research data according to best practices.

Pienta, Amy M., George C. Alter, and Jared A. Lyle.
The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data. Presented at the "BRICK, DIME, STRIKE Workshop," The Organisation, Economics, and Policy of Scientific Research, Turin, Italy, April 23-24, 2010.

Abstract: The goal of this paper is to examine the extent to which social science research data are shared and assess whether data sharing affects research productivity tied to the research data themselves. We construct a database from administrative records containing information about thousands of social science studies that have been conducted over the last 40 years. Included in the database are descriptions of social science data collections funded by the National Science Foundation and the National Institutes of Health. A survey of the principal investigators of a subset of these social science awards was also conducted. We report that very few social science data collections are preserved and disseminated by an archive or institutional repository. Informal sharing of data in the social sciences is much more common. The main analysis examines publication metrics that can be tied to the research data collected with NSF and NIH funding - total publications, primary publications (including PI), and secondary publications (non-research team). Multivariate models of count of publications suggest that data sharing, especially sharing data through an archive, leads to many more times the publications than not sharing data. This finding is robust even when the models are adjusted for PI characteristics, grant award features, and institutional characteristics.

Alter, George C., and Mary Vardigan
Addressing Global Data Sharing Challenges. Journal of Empirical Research on Human Research Ethics 10, no. 1 (July 2014): 317-323.
DOI: 10.1177/1556264615591561

Abstract:This issue of the Journal of Empirical Research on Human Research Ethics highlights the ethical issues that arise when researchers conducting projects in low- and middle-income countries seek to share the data they produce. Although sharing data is considered a best practice, the barriers to doing so are considerable and there is a need for guidance and examples. To that end, the authors of this article reviewed the articles in this special issue to identify challenges common to the five countries and to offer some practical advice to assist researchers in navigating this "uncharted territory," as some termed it. Concerns around informed consent, data management, data dissemination, and validation of research contributions were cited frequently as particularly challenging areas, so the authors focused on these four topics with the goal of providing specific resources to consult as well as examples of successful projects attempting to solve many of the problems raised.


Vardigan, Mary, Darrell Donakowski, Pascal Heus, Sanda Ionescu, and Julia Rotondo.
Creating Rich, Structured Metadata: Lessons Learned in the Metadata Portal Project. IASSIST Quarterly 38, no. 3 (2014): 15-20.

Abstract: With support from the National Science Foundation, two long-running social science studies - the American National Election Study and the General Social Survey - partnered with the Inter-university Consortium for Political and Social Research (ICPSR) and NORC at the University of Chicago to improve their metadata and build demonstration tools to illustrate the value of structured, machine-actionable metadata. The partnership also involved evaluating the studies' data collection workflows to determine where in the data life cycle metadata could be captured at source to avoid metadata loss and costly procedures to recreate the metadata later. This article reports on the experience and knowledge gained over the course of the project and also includes recommendations for others undertaking similar work.

Vardigan, Mary, Pascal Heus, and Wendy Thomas.
Data Documentation Initiative: Toward a Standard for the Social Sciences. International Journal of Digital Curation 3, no. 1 (February 12, 2008): 107-13.
DOI: 10.2218/ijdc.v3i1.45

Abstract: The Data Documentation Initiative (DDI) is an emerging metadata standard for the social sciences. The DDI is in active use by many data specialists and archivists, but researchers themselves have been slow to recognize the benefits of the standards approach to metadata. This paper outlines how the DDI has evolved since its inception in 1995 and discusses ways to broaden its impact in the social science research community.

Vardigan, Mary, and Cole Whiteman.
ICPSR Meets OAIS: Applying the OAIS Reference Model to the Social Science Archive Context. Archival Science 7, no. 1 (March 2007): 73-87.
DOI: 10.1007/s10502-006-9037-z

Abstract: This paper reviews the archival process at the Inter-university Consortium for Political and Social Research (ICPSR), a repository of digital social science data, and maps ICPSR's Ingest and Access operations to the Open Archival Information System (OAIS) Reference Model. The paper also assesses ICPSR's conformance with the archival responsibilities of "trusted" OAIS repositories, with the proviso that audit criteria for archival certification are still under development. The ICPSR to OAIS mapping exercise has benefits for the larger social science archiving community because it provides an interpretation of the reference model in the quantitative social science environment and points to preservation-related issues that may be salient for other social science archives. Building on the archives' long tradition of shared norms and cooperation, we may ultimately be able to design a federated system of trusted social science repositories that provides access to the global heritage.

Research Transparency

Lupia, Arthur, and George Alter.
Data Access and Research Transparency in the Quantitative Tradition. PS: Political Science & Politics 47, no. 01 (January 2014): 54-59.
DOI: 10.1017/S1049096513001728

Abstract: The number of people conducting scientific analyses and the number of topics being studied are higher than ever. At the same time, there are questions about the public value of social scientific endeavors, particularly of federally funded quantitative research (Prewitt 2013). In this article, we contend that data access and research transparency are essential to the public value of the enterprise as a whole and to the credibility of the growing number of individuals who conduct such research.

Inter-university Consortium for Political and Social Research (ICPSR). 
Research Transparency, Data Access, and Data Citation: A Call to Action for Scholarly Publications. Position statement created at the "Data Citation and Research Transparency Standards for the Social Sciences Meeting," Ann Arbor, MI, June 13-14, 2013. Ann Arbor, MI: Inter-university Consortium for Political and Social Research, 2013.
DOI: 10.3886/ResearchTransparency

Brief description: This document summarizes the Data Citation and Research Transparency Standards for the Social Sciences Meeting held in Ann Arbor, MI, June 13-14, 2013. It describes the emergent consensus about research transparency and then challenges scholarly journals and publishers to play a leadership role in this movement. At the same time, journals cannot do this work alone. Success will require the coordinated efforts of all stakeholders including professional associations, funding agencies, universities and their constituent academic departments, data repositories, researcher training programs, and researchers themselves.

Sustaining Domain Repositories

Ember, Carol; Hanisch, Robert.  
Sustaining Domain Repositories for Digital Data: A White Paper. Output of the workshop, "Sustaining Domain Repositories for Digital Data," Ann Arbor, MI, June 24-25, 2013. Ann Arbor, MI: Inter-university Consortium for Political and Social Research, 2013. 
DOI: 10.3886/Sustaining Domain Repositories Digital Data

Executive summary: The last few years have seen a growing international movement to enhance research transparency, open access to data, and data sharing across the social and natural sciences. Meanwhile, new technologies and scientific innovations are vastly increasing the amount of data produced and the resultant potential for advancing knowledge. Domain repositories - data archives with ties to specific scientific communities - have an indispensable role to play in this changing data ecosystem. With both content-area and digital curation expertise, domain repositories are uniquely capable of ensuring that data and other research products are adequately preserved, enhanced, and made available for replication, collaboration, and cumulative knowledge building. However, the systems currently in place for funding repositories in the US are inadequate for these tasks. Effective and innovative funding models are needed to ensure that research data, so vital to the scientific enterprise, will be available for the future. Funding models also need to assure equal access to data preservation and curation services regardless of the researcher's institutional affiliation. Creating sustainable funding streams requires coordination amongst multiple stakeholders in the scientific, archival, academic, funding, and policy communities.

Inter-university Consortium for Political and Social Research (ICPSR). 
Response to RFI: 'Public Access to Digital Data Resulting From Federally Funded Scientific Research' Office of Science and Technology Policy. Response paper. Ann Arbor, MI: Inter-university Consortium for Political and Social Research, 2011.
DOI: /10.3886/ResponseTORFI

Brief description: In this response to the 2011 Office of Science and Technology Policy's request for information, ICPSR advocates Federal policies to improve the access and preservation of scientific data provides a detailed list of recommendations in response to the RFI.

Inter-university Consortium for Political and Social Research (ICPSR). 
Sustaining Domain Repositories for Digital Data: A Call for Change from an Interdisciplinary Working Group of Domain Repositories. Position statement. "Sustaining Domain Repositories Meeting," Ann Arbor, MI, June 24-25, 2013. Ann Arbor, MI: Inter-university Consortium for Political and Social Research, 2013.
DOI: 10.3886/Sustaining Domain Repositories Digital Data

Brief description: In February 2013, the U.S. Government's Office of Science and Technology Policy (OSTP) issued a memorandum calling for all federal agencies funding data collection to create plans for public access to research projects. On June 24-25, 2013, representatives from 22 data repositories spanning the social and natural sciences met in Ann Arbor, MI. The meeting, organized by the Inter-university Consortium for Political and Social Research (ICPSR) and supported by the Alfred P. Sloan Foundation, created a space to discuss the challenges facing repositories across domains, and to strategize around issues of sustainability. Attendees endorsed a unified call for change, stating that domain repositories must be funded as the essential piece of the U.S. research infrastructure that they are.