Managing Data for Reproducible Results


Over the last decades there has been a growing emphasis on open science and reproducible research. A key component of these efforts is the expectation that researchers distribute the data and computational scripts used to produce the results before a paper is reviewed. These expectations place new demands on how quantitative research is conducted to facilitate replication by others. The workshop teaches methods for creating reproducible results and efficiently managing complex data management and analysis. The class is ideal for students who are starting their dissertation, revising a master's thesis for publication, or preparing a paper for submission to a journal. Students who have completed their first year of graduate school will also find it valuable for developing skills to guide future research.

The workshop deals with the entire process of quantitative research from planning your work through publication. Most classes in statistics focus on fitting and interpreting models. These activities often involve less than 10% of the total work. This workshop is about the other 90%: planning, documenting, and organizing your work; creating, labeling, naming, and verifying variables; systematically performing and presenting statistical analyses; preserving your work; and, critically, producing results that can be reproduced by others. Lectures show you how to develop a workflow that is guided by the demands of producing reproducible and accurate results while working as quickly and efficiently as possible. Topics to be covered include:

  1. General principles that guide your research: reproducibility, accuracy, and efficiency.
  2. Efficient methods for planning, organizing, documenting, executing, and preserving your work.
  3. A workflow for computing that facilitates reproducibility while maintaining the provenance of your results.
  4. Writing robust, effective programs for data analysis that use simple programming methods to increase accuracy and efficiency.
  5. Methods for preparing data for analysis: importing data; developing consistent names and labels; documenting the sample and variables; and cleaning the data.
  6. Techniques for conducting sophisticated data analyses that are reproducible and efficient.
  7. Methods for accurately and quickly incorporating statistical results into your writing while maintaining the provenance of the findings, facilitating later revisions of your work.
  8. Ways to prevent the catastrophic loss of files.

The class assumes that you are planning to do quantitative data analysis and that you have completed at least one graduate class in statistics. Students starting their dissertation have found this class to be a great way to get their work organized, plan new analyses, and conduct analyses that are efficient and replicable. Students who are earlier in their graduate career develop a workflow that they will grow into as they undertake larger research projects. To complete exercises in the class you will need access to a dataset that you want to work with. While Stata is used to illustrate some of the ideas, the strategies and concepts apply to any statistical package and you are welcome to use programs such as SAS or R for your work. To do this, you will need to have that package installed on your laptop and know how to use the software since the instructor might not.

SPECIAL FEE: Participants who attend the first Four-Week Session of the 2017 ICPSR Summer Program (June 26, 2017 to July 21, 2017) or those who have attended Four-Week Sessions of the ICPSR Summer Program in the past are eligible for a special discounted fee of $900 to attend this five-day short workshop. To receive this special discounted fee, please email the Summer Program at

Fee: Members = $1200; Non-members = $2200

Tags: workflow, reproducibility

Course Sections