How to Find a Dataset
About this Guide
This ICPSR guide outlines a five-step process to help students and researchers identify and evaluate datasets for assignments or projects. It emphasizes defining a topic, listing data requirements, where and how to search for datasets, assessing datasets against those requirements, and finalizing the research question based on available data. Jump to specific steps using the links below.
A message from the author
For students and researchers new to searching for secondary data or new to the topic they are studying, finding an existing dataset that fits your specific needs can be an overwhelming and time consuming ordeal. Taking a strategic and creative approach to your search will save you time searching so you can move forward with your research.
Shannon Merillat
Macalester College
Steps for Finding a Dataset for Your Assignment
Step 1. Identify Your Topic of Interest
Action item:
- Ask yourself: What topic(s) interest me? For example, would you like to analyze data on, say, sports, US presidential elections, climate change, or public health?
Helpful tips:
- Keep an open mind and keep the topic broad in this step. You’ll narrow it down in later steps.
- It can be tricky to find a dataset that includes the specific variables you are interested in! Even though you might have an idea for a research question at this point, wait to finalize your research question until after you’ve identified a dataset that meets your needs.
- Think about the kinds of topics you’ve discussed in class or in your readings. What has interested you the most? What makes you think, “Hmmm… I wonder if…?” If all else fails, ask your instructor or a librarian for help!
Step 2. List Your Data Requirements
Action Item:
- Create a checklist of data requirements based on your assignment and the topic chosen in Step 1. Review and print an example of a checklist.
- Identify the key components of your research question. What are the independent and dependent variables or constructs? Are you interested in a particular group of people, type of setting, or time period? Do you need data collected at multiple points in time?
Helpful tips:
The checklist should include any attributes a dataset must have in order for you to complete your assignment. This includes things like:
- Minimum number of observations (rows). Example: To run a regression analysis with 3 independent variables, you should have at least 30 observations.
- Type of dependent (outcome) variable needed. Example: If you are going to perform a logistic regression, you need your dependent variable to be a binary variable. If you are going to perform an ANOVA test, you need a continuous variable to be your dependent variable
- Type of independent variables needed. Example: If you are going to perform a linear regression, you can use a variable that is nominal, ordinal, ration, or interval for your independent variable.
- Your previous experience and skill with preparing data for analysis. Example: If you don’t have much experience, you may want to avoid data sets that will require you to merge or append multiple files together, or that will require you to recode categorical variables to numeric variables.
Step 3. Search for Possible Datasets in Repositories and Article References
Action Item: Look for a dataset that meets your data requirements from step 2. This can be done a couple different ways:
- Option 1. Search data repositories and databases. You will find a selection of sources for micro data and datasets on a variety of topics from a variety of sources in our Social Sciences Data Research Guide and our Natural Sciences Research Guide.
- Option 2. Search for datasets cited in articles on your topic. You can also find articles in library databases that have conducted analysis on your topic and check what dataset they used. This information is typically found in the methods section, appendix, and/or reference list.
Helpful tips:
- Some repositories require a subscription to access data. A librarian can help you identify the sources to which your school has subscribed.
ICPSR tips:
- Look for a dataset at icpsr.umich.edu
- We make Option 2 easy! The ICPSR Bibliography can be searched like a library database and each of the entries link back to the data on which the research was based so you can easily read more about the data.
Step 4. Evaluate Possible Datasets: Are Your Data Requirements Met?
Action Item: Determine if a dataset fulfills your requirements (from Step 2).
- Review the title, author, and why the dataset was created. Does the answer fulfill your requirements? If yes proceed. If no, discard the dataset and move to the next.
- Review the metadata, including the data dictionary. Does the dataset still fulfill your requirements? If yes proceed. If no, discard the dataset and move to the next.
- Review the values and limitations of the dataset. Does the dataset still fulfill your requirements? If yes proceed. If no, discard the dataset and move to the next.
- Download and explore the dataset to be sure.
Helpful tips:
- You will probably have to review many datasets before finding one that comes close to meeting your requirements.
- Having trouble finding a dataset that fits your data requirements? Consider using a different search strategy, looking in other sources (i.e., different repositories or articles), or revising some of your data requirements. You can also consult a librarian or your instructor.
ICPSR tips:
- Most of the information you need to evaluate your datasets (“studies” in ICPSR-speak) is right at your fingertips. Information on each study’s “homepage” tells you why and how the data were collected, what topics are included, who was in the sample, and other important details. Reading this information will give you a sense of the values and limitations of the data.
- Use the Variables tab on the study homepage to quickly see the variables contained within the data. Clicking on a variable’s name will take you to a frequency distribution. If there are no variables on the tab, do not fear! The same information is found in the codebook. Simply go to the Data and Documentation tab and download the documentation.
- The Data-related Publications tab lists all of the research outputs we’ve found using the data you’re evaluating. Browse through the titles to see if anything looks relevant – maybe someone has already figured out how to operationalize your concepts!
- Some datasets can even be analyzed online, so you don’t need to download the data to get a sense of whether they are right for your project.
- Finally, ICPSR makes most data available in a variety of file formats. Select the one matching the statistical package you will be using or the delimited file for Excel/Google Sheets.
Step 5. Finalize Your Research Question
Action Item:
- Now that you’ve chosen a dataset in step 4, it’s time to finalize your research question!
Data Requirements Checklist
Creating a data requirements checklist before you start searching for datasets will help you search more efficiently. Because it can be difficult to find secondary data that will answer your research question, it’s important to build some flexibility into your checklist early on. Note what things you can and cannot compromise on.
Item | Description |
---|---|
Dependent (outcome) variable | You’ll need a dataset with a variable or set of variables that you can use as your dependent variable or use to create your dependent variable. You will want a dataset that includes a dependent variable that is:
|
Independent variable(s) | You will also need variables that influence your dependent variable. You will want a dataset that includes independent variables that are:
|
Number of observations (rows) | Determine the number of observations (i.e. rows or sample size) you need to conduct your analysis, and find a data set that meets or exceeds that number. Make sure there are enough observations for all of the variables that you need to include in your analysis! |
Skill level | Consider your skill level and experience in data preparation. Some datasets require extensive cleaning before they are ready for any kind of analysis, while others may only require minimal preparation. If you do not have much experience in data preparation, you may want to stick to file formats that won’t require merging or conversion, and files with clean data. |