[DDI-SRG] [DDI-CDG] The dataset movie (fwd)
I-Lin Kuo
ikuo at umich.edu
Mon Jun 20 08:44:41 EDT 2005
Sorry to have taken so long to respond...
Quoting Reto Hadorn <reto.hadorn at sidos.unine.ch>:
<snip/>
>> 2. The linking concept for comparisons works well in the case that
>> you analyzed,
>> the time and geography dimensions in the repeated cross-national
>> survey with a
>> standard. Looking at it more closely, it works well for the
>> following reasons:
>> a. In the space dimension, there is a standard that all the studies are
>> compared to. Thus the pairwise linking concept works.
>> b. In the time dimension, there is a natural ordering via time, so
>> no standard
>> is necessary in this case, unlike the geography. The natural time ordering
>> determines a natural direction for the linking.
>>
>> - Q: What happens in the space dimension when there is no standard? In that
>> case, what should the link source and targets be (and in what
>> direction?), or
>> is a different comparison mechanism more appropriate, perhaps a
>> bidirectional
>> link or some other comparison mechanism? Perhaps this is Ingo's assignment?
>
> Your question refers to the so called 'harmonization study', where
> uncoordinated studies are compared. Since there is no standard, you
> will not enter any nor have information about the variations on the
> standard. You will neither need them. Having defined
> - a 'harmonisation study' (corresponding to the 'program study in the
> comparative by design case)
> - references from the single studies, which are candidates for the
> integration work, to the hramonization study (equivalent to the
> references from the country study descriptions to the program study
> description in the comparative by design case)
> - a harmonized dataset, for the case harmonization would work...
> (equivalent to the integrated dataset in the cross-national case)
> - a harmonized variable in the harmonized dataset - just a name and
> an identifier (the equivalent of a harmonized variable in the ...
> case)
> - references from the harmonized variables to the candidates for
> integration (analogue to the references used for integration in the
> ... case)
> the program should be able to confront you with all the information necessary
> - to decide on the comparability of the variables involved, on any
> level: variable, question, data collection method, sample design,
> non-response analysis, questionnaire, project summary etc. etc. and
> - to decide on the appropriate definition of the harmonized variable
> (now, you will have more than the name and the id)
> - to document fully the choices made for this harmonization operation.
>
> The rest will be done similarly as in the ... case (copy of the
> harmonized variable in the single datasets, computation on the single
> dataset level and integration into the harmonized dataset.
>
> ... Yes, you are right, you would not do that in the original
> datasets. At a minimum, the program should create in the
> harmonization study a replica of the original single datasets (id and
> reference to the original), which will store the information about
> the variables as used for integration, original or constructed. The
> construction will be stored in the reference between those variables
> and the original ones in the original datasets. In this manner, a
> harmonization study is constructed with minimal redundancy.
>
> Actually, the variables compared are not compared directly, there is
> no need for any link between them. They are compared through their
> reference to the same potentially harmonized variable.
If I understand correctly, the reply implies that when neither a
standard nor a
harmonization exists for a collection of datasets (and when the relationship
between is not that between successive waves of a longitudinal study),
there is
not a need for comparison of variables. Indeed, in the scheme which you have
outlined, that comparison is not possible to construct. In particular, that
means for a loose collection of multi-national studies without a standard and
prior to harmonization, there is no way to document variable comparisons
between the studies. Is that correct?
>
>> - Q: You added additional coordinates "comp" and "integrated" on
>> the time and
>> space dimensions. What happens when there is more than one
>> integrated dataset
>> or semi-integrated ( I think Eurobarometer produces some intermediate
>> integrated datasets before the final?). I don't think these semi-integrated
>> datasets should be necessarily given different time coordinates as then they
>> lose their connection with the original one-time datasets.
>
> If my understanding of your question and of the EB case is correct,
> this would be a case for several versions of the integrated dataset.
> If you don't think so, pleas define 'semi-integrated'.
Yes, you are right. I didn't think about versioning.
>
>> - Q: In slide 132 "The country datasets appear to grow like the
>> branches of a
>> christmas tree" I understand that for each variable, there is only
>> at most one
>> link whose source is that variable. This does not seem sufficient to capture
>> all the comparisons information that might be desired. For example,
>> if there is
>> a variable V1 in T1/C3 which differs from the T1 standard but is
>> identical with
>> the variables V1 in T2/C3, T3/C3, T4/C3, that information is not captured in
>> the implicit process which you have described, as adding those links
>> would then
>> result in a graph which no longer resembles a Christmas tree. And it
>> seems to me
>> you would want to capture that information in order to properly synthesize a
>> cumulative time slice/longitudinal dataset for C3.
>
> In the construct I describe, the similarity between several varying
> country Q/V will appear as a similarity between the variations on the
> standard. The information is there to be discovered. This similarity
The "discovery" is the thing that I don't quite see. How do you capture or
discover the similarity between variations? Capturing, to me, would
imply links
between the branches, in which case the graph is no longer a christmas tree.
Discovery, I think, is not possible, as I was trying to illustrate in the
example above.
The fundamental problem with discovery is that equality is a transitive
relation, while similarity (in the sense of comparative datasets) is
not. If V1
in Country 5 at time T1 is identical to the V1 in Standard at time T1,
which in
turn is identical to V1 in standard at time T2, which in turn is identical to
V1 in country 5 at time T2, then we may infer that V1 in Country 5 at time T1
is identical to V1 in Country 5 at time T2 by transitivity.
However, consider the following two examples:
1) V1/C5/T1: What is your sex?
Male Female Other
V1/Standard/T1: What is your sex?
Male Female
V1/C5/T2: What is your sex?
Male Female Other
V1/Standard/T2: What is your sex?
Male Female
2) V1/C5/T1: What is your sex?
Male Female Other
V1/Standard/T1: What is your sex?
Male Female
V1/C5/T2: What is your sex?
Male Female homosexual bisexual
V1/Standard/T2: What is your sex?
Male Female
In both examples, the description of the links in the christmas tree are the
same:
- V1/Standard/T1 and V1/Standard/T2 are identical
- V1/C5/T1 and V1/Standard/T1 are identical in question wording but
different
in the categories.
- V1/C5/T2 and V1/Standard/T2 are identical in question wording but
different
in the categories.
However, in example 1, V1/C5/T1 and V1/C5/T2 are identical whereas
in example
2 they differ in the categories. The identification of V1/C5/T1 and
V1/C5/T2 in
example 1 cannot be inferred from the above links in the christmas tree and
would require some other descriptive information between V1/C5/T1 and
V1/C5/T2
such as a direct link.
Another case in which this problem appear is in comparison between the
repeated
multi-national case (slide 134) and the simple longitudinal study (slide 113).
Ideally, if the comparison links are identified in a repeated cross-national
study, then if I extract a slide of that for a single country, then I should
expect to be able to recreate the links between the datasets of that
slice just
as if I had only started with a repeated study for a single country and marked
that up. But, in the same way as above, I don't think the links between the
countries can be constructed from the links in the christmas tree.
<snip/>
More information about the DDI-CDG
mailing list