[DDI-SRG] Comparative Data Direction
I-Lin Kuo
ikuo at umich.edu
Thu May 12 05:24:47 EDT 2005
Lately, I've been thinking about the complex files comparison-by-design and
comparison-not-by-design issues -- not the technical aspects but rather the
motivation behind them. It seems to me that while these two have been united
under the umbrella issue of comparative data, and that we've also united the
complex data and comparative data issues, on second thought it would be better
to regroup these issues because the motivations behind the issues and their
implications are very different.
The questions that I've been asking myself are:
A)Who wants the comparison-by-design feature?
B)Who wants the comparison-not-by-design feature?
C)Who wants the complex files feature?
D)Who wants ISO11179?
There's an implicit assumption (at least I had that assumption) that
the answers
to A-D is "social science researchers." I now believe that assumption to be
completely false. The two social science researchers that I've casually spoken
to about this -- Amy Pienta of ICPSR and Zhen Zeng of University of Wisconsin,
Madison -- share the attitude that C) is nice to have, A is OK, but are
adamantly opposed to B and D. As I understand it, their reasons are as
follows:
A and C are nice to have because they can allow a researcher to conveniently
join datasets together. However, any serious researcher would be sufficiently
proficient with statistical packages that the join would be a simple operation
and not offer much value. Furthermore, there are so many other decisions
involved in joining datasets together, that not that much work would be saved.
A and C would be far more useful to a journalist or a public policy-maker, as
such a person would likely not have enough technical savvy to perform a
dataset
join operation. A and C would also be of value to data libraries whose major
customers are non-technical.
B and D are adamantly opposed by the researchers I talked to because they feel
it takes the decision of what is comparable out of the researchers'
control and
places them in the control of the data archives/libraries. This is
regarded as a
bad thing as the archives are not qualified to make this decision, and
to allow
this decision to be made by the archives is a serious violation of the
scientific integrity of the data. To allow journalists or policy makers to use
software which utilizes DDI markup for making these dataset joins and getting
statistics on the resulting joined dataset would be an even more egregious
violation of integrity. Thus, even the existence of the markup itself is a bad
thing. The standards for scientific integrity are so stringent that, for
example, even in the case of a repeated variable in a longitudinal study -- a
clear case of comparison-by-design -- researchers reserve the right to exclude
certain waves of the study if they feel that sufficient changes have occurred
to make the waves noncomparable.
I think that the above reasons for opposition to B and D have great validity,
and it should impact our DDI activity in the following way:
Currently, we've merged both the Complex Data and Comparative Data concerns
because we felt that technically they were similar in that they both involve
the joining of data files. The Complex Data Group has had a year-old proposal
in which they tackle the technical problem of how to specify complex file
joins. Also, I have written a short paper on how to think about the technical
problem of specify comparative data file joins in comparison-by-intent. The
Comparative Data Group is thinking about first solving the technical
problem of
comparison-by-intent, and then using that as a basis to attack the problem of
comparison-not-by-intent. The implication of this approach is that eventually,
we'd like to merge data files based on the results of a
comparison-not-by-intent.
I propose that the comparison-not-by-intent problem be completely
separated from
the comparison-by-intent problem and grouped together with ISO11179 into a new
concern "Search Enhancement". What this does is remove the
comparison-not-by-intent from the Comparative Data concern (which carries an
implication of joining data files) to a Search Enhancement concern (which does
not carry an implication of joining data files). Thus, the
comparison-not-by-intent problem would result in a specification which could
only be used to enhance searches, not to join datasets. I believe that this
would address the heart of the researchers' objections to placing the decision
to join datasets in the hands of those who have not been properly trained.
This is an issue of great concern to me, and I would appreciate any
feedback on
it, especially those of you who are also on the Comparative Data Group.
Thank you.
--
I-Lin Kuo
Programmer/Analyst II
ICPSR
More information about the DDI-SRG
mailing list