Looking Back and Thinking Ahead: The 20th Anniversary of the ICPSR Bibliography of Data-related Literature
To mark the 20th anniversary of ICPSR’s Bibliography of Data Related Literature, Bibliography manager Elizabeth Moss interviewed Mary Vardigan, ICPSR Archivist Emerita, who retired after 30 years at ICPSR in 2015 as Assistant Director for Collection Delivery. This is their conversation about the Bibliography’s history and purpose.

Elizabeth:
When you retired, you said that one of the most rewarding things about working at ICPSR was learning about how many people used the data shared at ICPSR for new science. How did you learn about the scope of data use?
Mary:
We saw, of course, the monthly and yearly statistics on data and documentation downloads, which kept going up, and we saw data use increasing around the world.
But it’s the ICPSR Bibliography of Data-related Literature, which you have done so much to develop, Elizabeth, that really sheds light on how profoundly ICPSR data have influenced the course of quantitative social and behavioral science. The Bibliography brings together citations to publications based on uses of ICPSR data, whether the author acquired the data from ICPSR or from elsewhere. This resource is so valuable – it can be used for literature reviews and also to determine the return on investment for any individual dataset. It shows that the analytic potential of a dataset is never really exhausted and there are still new findings and discoveries to be made through secondary analysis.
Elizabeth:
You were instrumental in getting funding for the Bibliography from the National Science Foundation back in 1999. You and ICPSR’s then director, Richard Rockwell, had a vision for it. What do you remember about developing the proposal?
Mary:
In those days, we were not taking advantage of the potential for building a comprehensive Bibliography. Study descriptions along with the related publications were online, but those publications were pretty much limited to what the principal investigator sent along with the data to be archived. This did not capture the huge amount of data reuse and secondary analysis going on. We wanted a comprehensive resource and we wanted people to be able to link to and discover data via the literature written about it. We also wanted a freely available resource for the entire research community and the public. Anyone from students and researchers, to funders, journalists, and policy-makers, can search the Bibliography on any topic and get to publications that match that search. But crucially, they also get linked to the data used in those publications.
It was also important that ICPSR hired professional librarians like you to guide this effort, which was a departure for the organization. This enabled us to get new types of expertise and strengthen ties with the library and information school community.
Elizabeth, can you say more about how you find those related publications and make those linkages to the underlying data?
Elizabeth:
As you noted earlier, when principal investigators deposit their data with ICPSR, they sometimes also provide citations to publications they have already written using the data. And some authors who downloaded data and wrote about them also send us citations, since we ask them to do so as part of the terms of use.
But we mainly find data-related publications by going out to major health and social science literature databases where we create queries and set up email alerts anytime anything new matches that query. We have created so many of these queries over the years that we receive hundreds of emailed publications every month. If we determine a given publication uses data distributed by ICPSR, then it gets added to the Bibliography. Then we have to figure out which study to link to that publication. It really calls for the staff members to become expert in the various ways that our studies are described in our study metadata, as well as how they are referred to in the literature.
Mary:
That’s one reason these linkages are so valuable. They are difficult to make because it can be hard to identify exactly which data were used in a publication. Can you say more about that?
Elizabeth:
Many authors still do not acknowledge data in a standard way, and they often do not use the machine-readable Digital Object Identifier, or DOI, that ICPSR asks authors to use as part of the data citation we provide for each study. ICPSR started distributing all studies with DOIs back in 2008, under one of your initiatives, Mary. As you know, the DOI is a registered unique identifier that is persistent over time, unlike, say a plain URL. The key to collecting publications and linking them to the data they analyze lies with the authors. Whether they publish using data they created, or whether they reused someone else’s shared data, they should state clearly what data they used, where they found it, who created it, and provide a unique identifier that ensures long-term access to the date, whenever possible.
Mary:
In order to find those publications that use data, but do not cite it well, the Bibliography staff must resort to using what you call detective work, correct?
Elizabeth:
Yes! In the scholarly literature, we check for both explicitly cited data use and informal or incomplete references. When we find that an author has used one of ICPSR’s DOIs, we can be pretty sure that author used data.
But even though authors are getting more knowledgeable about citing data and using the DOIs we provide, as I mentioned, most do not. So we have to create the next best thing, imperfect queries based on the study names. I say “imperfect” because of the not-quite-unique query string you get when you use the name of a study. That query can bring back results that we don’t want. A name like the Detroit Area Study could have been used in the text as a passing reference to the study in the literature review. Or it could have been used as part of the title of some other publication in the references. What we are looking for is when authors use the study name when describing the data they analyzed. So, we have to skim the paper in order to sort the wheat from the chaff.
And as I mentioned before, we then have to figure out just what study they are describing. ICPSR houses many studies that are part of a series of studies conducted many times over the years. For instance, we have over 50 studies with “Detroit Area Study” in their titles. They have similar names and only differ in the year the data were collected, or in their subtitles. When we find a study mentioned in a paper’s Methods section, we hope that at least the author has provided basic clues, like the year or wave or sample size used. Then we know which study to associate with the paper, so its citation will appear with the correct study or studies on the ICPSR websites.
Luckily, the understanding of the importance of data citation is part of a broader movement that’s taken place in the past 20 years, toward data transparency and the standardization of data citation. Mary, you took part in ICPSR’s efforts to encourage authors to acknowledge and identify the data they used by citing it formally in the references of their publications, just like they would cite an article or book.
Mary:
Yes, that is so important. Researchers should get credit for the data they produce as it is a key research output along with the journal articles and other publications they author based on the data. ICPSR led the community in this regard by offering citations for its data holdings starting in 1989. I’m proud of that because hardly anyone was thinking about it at that point. I believe the Census Bureau and maybe the General Social Survey provided data citations, but that was it.
Elizabeth:
What do you think we could learn about the long tail of research if everyone cited data appropriately, so that data’s reuse could be better tracked, and its impact better measured?
Mary:
The Bibliography itself can be seen as a dataset to develop metrics of data reuse. To name a few things we could investigate, we could learn about the patterns of reuse – who reuses data, which data do they reuse, for what purpose, when do they archive their data, and when does the reuse occur through the overall data lifecycle. We could also learn which data are fruitful for analyses in other fields not envisioned by the original investigators. Related to that, we could learn more about which datasets are useful in combination with others, as combining and merging datasets is something more researchers are doing now. We could also learn about patterns of reuse for big data as opposed to survey data. The list goes on and on.
It seems as if the sponsors of the various projects at ICPSR have grown in their support of the Bibliography, correct?
Elizabeth:
I think in the beginning, only the general Membership archive and a few other topical archives funded Bibliography work. Nowadays, we have many more archives within ICPSR, and nearly all of them have realized the value of collecting the published findings using data they distribute to understand the impact of the data they provide. In the past few years, we’ve been able to fund a couple of full-time staff members mostly with topical archive support. It has really allowed us to add publications to an array of studies that have many, many publications written about them.
Mary:
How have things changed in the past 20 years, when it comes what the Bibliography staff do to find publications and link them to the underlying data?
Elizabeth:
Well, within ICPSR, back in 1999 and 2000, the librarian you hired to design the search tool and populate the citations database was Jeri Schneider. She is the person who created the Bibliography with thoughtfulness and quality — and made the vision into a full-fledged, working tool with the help of our IT staff.
You may recall that when she started, her staff took many trips to the University of Michigan’s graduate library, where they had their own carrels — they were there that much! They would have to go to the stacks to access bound journals and books and read through the hard copies without the help of digitized search. On their laptops, they would enter citations manually, instead of how it’s done today, by ingesting the formatted citation directly from the publisher’s webpage. That early work was to build the collection, finding many items that were published back in the 1960s when ICPSR first started sharing data. Gradually, we have moved on to searching only digitized peer reviewed and grey literature sources that contain books, articles, reports, dissertations, and more. And we don’t have carrels at the library anymore!
Another big change is the size of the catalog of studies held at ICPSR. The number of studies has probably tripled since then, making the job of tracking all the related publications more challenging.
Mary:
Do you see similar efforts to build comprehensive bibliographies in fields outside the social sciences?
Elizabeth:
The Bibliography is unique in its breadth, especially since so much of it has been manually curated over 20 years. But we do see more and more that researchers–and those who fund their research — have become interested in seeing the impact of their data on the advancement of science. They, too, are trying to track publications that use their data. I just got an inquiry from a researcher at CERN, in Switzerland, the other day, who wanted to learn more about how we track certain aspects so that they could track use of their data on particle physics. Like CERN, other stakeholder groups have come up with ways to text mine data citations.
Mary:
Is ICPSR working on new methods to enhance the Bibliography?
Elizabeth:
Yes — more informally cited data use is still difficult to track with computer scripts. I am working with a research project now, funded by the National Science Foundation, ironically the same government agency that funded the Bibliography two decades ago. In that project, we will be trying to “train” an algorithm to find those elusive informally cited instances of data use. Automating the finding and collecting of data-related publications will help ICPSR gather all the publications that should be in the Bibliography. And as you said, when that happens, we will get a better measure of the value of the shared data that ICPSR provides.