What Trusted Data Repositories Can Learn from ICPSR’s AI Explorations

By ...

May 29, 2026

ICPSR’s Aalap Doshi and Jeannette Jackson shared practical lessons for repositories navigating artificial intelligence at Dataverse Community Meeting 2026, held at the World Trade Center in Barcelona, May 12–15, 2026.

Their presentation, “AI in Trusted Data Repositories: Reflections from Exploration and Experimentation at ICPSR,” focused on a question facing many repository teams: How can AI make repository work easier and more effective without weakening trust?

“For trusted repositories, the question is not whether to use AI but how it should be explored, designed, and governed in ways that align with long-standing principles of stewardship, transparency, and trust,” the presenters noted.

Practical Lessons for Repository Teams

Drawing on ICPSR’s experience with more than 60 years of data stewardship, Doshi and Jackson described AI experiments in deposit workflows, metadata enrichment, discovery, variable harmonization, disclosure review, and access decisions.

A key message for repository teams was to make AI part of the workflow — not another tool users have to find. “Build it in, don’t bolt it on,” the presentation emphasized.

Examples included TurboCurator, which suggested metadata improvements before curator review, and Bucketizer, which classified deposited files as data, documentation, or other materials. For staff, these approaches could reduce repetitive work and let curators focus their expertise where it matters most.

The presenters also stressed that AI should support human judgment, not replace it. “AI should not do the work for the researcher. It should guide search strategy,” they said, citing ICPSR user research with data librarians and archivists.

Another takeaway: not every problem needs AI. In one rescued dataset, 58 percent of files were duplicates, identified through checksums rather than AI — a reminder that better processes or standards may be the right solution. Doshi, ICPSR’s Director of Information Technology, and Jackson, ICPSR’s Director of Strategic Initiatives, also noted that bigger models are not always better. An internally hosted small model outperformed OpenAI on variable recognition and frequency generation while keeping data within ICPSR’s security boundaries — an important point for restricted data that cannot be sent to external APIs.

“Experimenting is easy, operationalizing is hard,” the presenters noted, pointing to the costs of infrastructure, governance, security, and staff expertise.

Rethinking Reach in an AI Era

They closed by urging repositories to rethink impact in an AI-driven environment. “Reach = Human Visits + Machine Exposure,” the presentation stated. “When an AI answers a question using your dataset, that is usage — whether a visit is recorded or not.”

The bottom line: AI can improve discovery, curation, access, and measurement — but only when it solves real problems and reinforces the trust repositories are built to protect.