From ikuoikuo at gmail.com Sun Jul 3 13:01:17 2005 From: ikuoikuo at gmail.com (I-Lin Kuo) Date: Mon Jul 4 04:37:04 2005 Subject: [DDI-SRG] [DDI-CDG] The dataset movie (fwd) In-Reply-To: <6.0.3.0.2.20050628141432.01f374d8@webmail.unine.ch> References: <20050611092232.iay4q1zusgow4w4g@icpsr.mail.umich.edu> <6.0.3.0.2.20050613144937.01ed5310@webmail.unine.ch> <20050620084441.ermzp0nk2gc0kwgw@icpsr.mail.umich.edu> <6.0.3.0.2.20050628141432.01f374d8@webmail.unine.ch> Message-ID: <85ef306905070310011a29a532@mail.gmail.com> On 6/28/05, Reto Hadorn wrote: > At 20.06.2005, you wrote: > > Sorry to have taken so long to respond... > My turn... > > My impression is that you have the same understanding of the model under > discussion as I, i.e. of its possibilities and its limits. Let me yet > comment on 2-3 issues and then try to find out what separates us. > > 2. The linking concept for comparisons works well in the case that you > analyzed, > the time and geography dimensions in the repeated cross-national survey > with a > standard. Looking at it more closely, it works well for the following > reasons: > a. In the space dimension, there is a standard that all the studies are > compared to. Thus the pairwise linking concept works. > b. In the time dimension, there is a natural ordering via time, so no > standard > is necessary in this case, unlike the geography. The natural time ordering > determines a natural direction for the linking. > > - Q: What happens in the space dimension when there is no standard? In > that > case, what should the link source and targets be (and in what direction?), > or > is a different comparison mechanism more appropriate, perhaps a > bidirectional > link or some other comparison mechanism? Perhaps this is Ingo's assignment? > Your question refers to the so called 'harmonization study', where > uncoordinated studies are compared. Since there is no standard, you will not > enter any nor have information about the variations on the standard. You > will neither need them. Having defined > - a 'harmonisation study' (corresponding to the 'program study in the > comparative by design case) > - references from the single studies, which are candidates for the > integration work, to the hramonization study (equivalent to the references > from the country study descriptions to the program study description in the > comparative by design case) > - a harmonized dataset, for the case harmonization would work... > (equivalent to the integrated dataset in the cross-national case) > - a harmonized variable in the harmonized dataset - just a name and an > identifier (the equivalent of a harmonized variable in the ... case) > - references from the harmonized variables to the candidates for > integration (analogue to the references used for integration in the ... > case) > the program should be able to confront you with all the information > necessary > - to decide on the comparability of the variables involved, on any level: > variable, question, data collection method, sample design, non-response > analysis, questionnaire, project summary etc. etc. and > - to decide on the appropriate definition of the harmonized variable (now, > you will have more than the name and the id) > - to document fully the choices made for this harmonization operation. > > The rest will be done similarly as in the ... case (copy of the harmonized > variable in the single datasets, computation on the single dataset level and > integration into the harmonized dataset. > > ... Yes, you are right, you would not do that in the original datasets. At > a minimum, the program should create in the harmonization study a replica of > the original single datasets (id and reference to the original), which will > store the information about the variables as used for integration, original > or constructed. The construction will be stored in the reference between > those variables and the original ones in the original datasets. In this > manner, a harmonization study is constructed with minimal redundancy. > > Actually, the variables compared are not compared directly, there is no > need for any link between them. They are compared through their reference to > the same potentially harmonized variable. > If I understand correctly, the reply implies that when neither a standard > nor a > harmonization exists for a collection of datasets (and when the > relationship > between is not that between successive waves of a longitudinal study), > there is > not a need for comparison of variables. Indeed, in the scheme which you > have > outlined, that comparison is not possible to construct. In particular, that > means for a loose collection of multi-national studies without a standard > and > prior to harmonization, there is no way to document variable comparisons > between the studies. Is that correct? > Your understanding of the model is correct; usage of the model is what we > can discuss further. I am trying to understand the model, but I am also trying to understand the comparison problem itself. As there is no definition of what the comparison problem is and because I don't actually work with the data, I have to try to understand the problem by analyzing your proposed solution -- the model -- and then going back and asking whether the limitations and potentialities of your model are the same as the restrictions and intent of the original unstated comparison problem. In addition, even if the model is a complete solution to the comparison problem, there are still questions about how it fits into the overall lifecycle that should be answered. > > You write: "... there is not a need for comparison of variables" : of > course, there is a need, and you express it very well. The issue is: how? > > Two objects cannot be compared in abstracto. To compare two variables, you > need a point of view. Depending on the point of view, you will not take the > same decision on comparability. This is a critical principle that was was assumed but not stated until now -- a comparison needs a point of view or context. Context-free pairwise comparisons are not allowed. Is this something the DDI-CDG is in agreement with? For example, Pascal had mentioned in Edinburg a ranking system for comparisons, wanting a hands-free computer system for making comparable variable decisions. What is the point of view or context in this case? What are the possible types of context allowed in comparative data? Standardization context, harmonization, and time have already been discussed. > In the comparative study by design, you have two such points of view: the > standard definition and the integrated dataset (actually: the intention > behind the sandard definition and the assumptions, which have governed the > integration process). > > The so-called harmonization study is also such a point of view: you take > the variables as they are, you drive them through your comparison machine > (comparing all levels of metadat available), take measures (like > harmonization) and make a decision on the resulting comparability. > Comparability is not just a formal matter, it is the outcome of an > intellectual process, intellectual work. > > On the contrary of what you write: "comparison is not possible to > construct", I underscore that comparison is a construct which deserves best > documentation. > > When you write: "there is no way to document variable comparisons between > the studies", I would write: comparison of two variables from distinct > studies can be made in the frame of a (comparison) project (or study), where > later users of the information find a statement about the scope of the > comparison, which allows understanding the technical solutions used. I should make myself clearer. What I meant to say is that within the Christmas tree model which you proposed, there is no way to make a direct comparison between variables of two separate waves of a single country study. Both can only be compared with the standard for their respective waves, and then each standard compared to each other. What I need to understand is whether this is a limitation of the model you proposed or whether this is limited by intent. If it is the former, than perhaps we should spend a little effort in improving the model or finding an alternative. If it is the latter, then the additional effort is not necessary. From what I've gleaned from the discussion so far, I'm slightly inclined to think it is the latter, that within the context of a repeated cross-national study, there is no need or no meaning to a direct comparison between variables which do not involve comparison to a standard or an integrated dataset. > I see a problem in your conception of variable comparison, which seems to > be something abstract and formal, whereby the rationale for comparison, the > point of view of the one who compares are scotomized. You are correct in your observations about my line of questioning, which does de-emphasize the context of the comparison. I recall that Pascal had also emphasized that the authority behind the comparison was important, and in your responses, you seem to indicate that the authority/context is essential in a comparison. I take this approach because in the back of my mind, I am thinking of other use cases and other parts of the lifecycle which might make use of a comparison mechanism, and I am wondering if perhaps the imposition of a context unnecessarily constrains the potential of other uses in other parts of the lifecycle. Indeed, let me ask if even the requirement of a context might be too much even in the case of a single repeated study which is not cross-national. Allow me to explain. There does seem to me in your model a different treatment of the time and geography dimensions. To me this is somewhat problematic for the following reason. First, for the repeated study in your model, a variable is to be explicitly linked/compared to the latest previous variable to which it is not identical, or to the original. In the example below, there is a link from V1@W2 to V1@W1, V1@W3 to V1@W1, V1'@W4 to V1@W3, V1'@W5 to V1@W3, V1''@W6 to V1'@W4, V1'''@W7 to V1''@W6. (Please view diagram below in a monospaced font such as courier) Wave 1 Wave 2 Wave 3 Wave 4 Wave 5 Wave 6 Wave 7 V1 V1 V1 V1' V1' V1'' V1''' ^ | | ^ | ^ | | ^ | -----<---- | --------- ^ | | -------- ---<-----<-------<-- ------------------ | ^ | ----------------- It is implicit in the model that if there are intermediate waves between a variable in one wave and the variable it links to in a previous wave then the corresponding variables in the intermediate waves are identical to the latter variable. This is done in order to reduce the total number of links. The problem arises when such a collection of studies as the above arises as a country slice out of a repeated cross-national study. This is the scenario in your slides RCNS_Movie_ILINKUO.ppt and which you also mention later in your message. In that case, the majority (4/6) of the link markup above could not be "inherited" from the repeated cross-national study. They would have to be built by the national archive which houses the slice. Furthermore, once the links markup was done, the information could not be integrated back into the metadata of the original repeated cross-national study. So there is an incompatibility between these aspects of the model. If there were a cumulation constructed, I would like both the cumulation and the links to the cumulation to be able to imported back into the metadata of the original. On the other hand, if a time slice (i.e. a single wave) of the cross-national repeated study were taken from the larger collection, all the relevant variable linking could be "inherited". Similarly, if such a subcollection were enriched with additional metadata regarding comparisons, the metadata could be easily re-integrated back into the original. And it seems to me that the inequality between time and geography/nation as dimensions comes from the requirement of comparison context -- the repeated cross-national study and the single wave national study have compatible contexts, while the repeated cross-national study and the repeated single nation study do not. Thus, it raises the question of whether a more abstract approach to linking, one which "scotomizes" the point of view, might not have these problems. Also, let me note that in your model there are no links between variables within a single dataset. This kind of linking might be desired for a single nation multi-language study, which would have only one dataset. One might want to document and link the different language variants of a variable question. > Here again, the difference between us seems to be in the role granted to > the constraints of work organization and funding. In abstracto, you are > fully right and I am pleased to tell you that I planned to build all > possible relationships into the database. > > I a second stage, I thought about who considers which kind of variable > series and which comparison work has a chance to be funded. This is where I > came to that famous fir (well... we have to choose between being understood, > using Christmas tree, or culturally less marked). I think about how the > coordinators build their series of standards over time (the stem) and > coordinate the data collection for each wave (the branches). The integration > can actually be seen as the sap turning back from the branches to the stem > and down to the roots... Will they construct the series of variant variables > over time for each participating country? NO. They will be very happy if > they find the resources to build some cumulative file based on the > integrated datasets. So actually, you are just re-activating my first dream > but, sorry for that, I got deadly realistic. While funding realism may force a limitation on how much markup we may currently do, it should not limit what markup we may do in the future should the funding somehow arise. From the point of view of a standards body, the latter concern trumps the former. My reservations about the model are that it limits what is possible, and I would like to spend a little time examining alternatives without this limitation. I'd also like to see the way in which you would plan to build the relationships into the database. > Realism forces me yet to consider that some country may find an interest to > build that time-compound dataset and even the cumulated file for ist much > variating variables. I tend to consider this as a specific study, which is > independent from the overall comparative program (yet refers to it). Please > look at the slide in the attachment (numbers change...). In addition, this > cumulation will probably be made on another system than the integration and > cumulation of the standard/harmonized data, so we will have some tricky > problems to solve about the idenfiers of the database elements, if we want > that the work don in country C3 is some times added to the overall > comparative program. >From the point of view of the DDI standard, this is a very important question, and should be addressed earlier rather than later. Choosing one alternative without addressing this question may render a later solution impossible. > In the construct I describe, the similarity between several varying country > Q/V will appear as a similarity between the variations on the standard. The > information is there to be discovered. This similarity This similarity may > play a role a) while defining the candidate harmonized variables > b) while defining the computations on dataset level - maybe the code > defined for one can be used for the others of the group ? The group concept itself may need to undergo some revision in order to allow this. That is open. > The question to answer first concerns the most appropriate way to define > that special kind of group: > - ... is it really necessary to store that communality in the DB? > Let's suppose the answer is yes: > - should this similarity be stored as links between those variables > (exponential growth of the number of links) The growth factor might possibly be mitigated through a more clever linking mechanism. For example, in the repeated single nation study of you model, there is the implicit similarity linking of variables in the intermediate waves between the source and target wave which I had noted earlier. Such mechanisms mean that not all links need to be explicit. > - should this be store as a kind of group? > - should the group be defined on the Q/V themselves or on the links to the > standard, which show the similarity? > The answer depends upon what helps best for a and b above. More analysis is > needed here. Agreed. >> The "discovery" is the thing that I don't quite see. How do you capture or >> discover the similarity between variations? Capturing, to me, would imply >> links >> between the branches, in which case the graph is no longer a christmas >> tree. >> Discovery, I think, is not possible, as I was trying to illustrate in the >> example above. > Since you have entered in the database information on the type of variation > on the standard, you will be able to identify all variables with the same > type of variation and request for them a report on the varying part, so you > will soon discover which ones are really identical. Would you then just > compare that subgroup of countries because the value structure (for example) > is strictly identical? I don't think that analysis follows such ways. You If I understand it correctly, ISO11179 seems to suggest such comparisons are desirable. While I don't agree, nonetheless ISO11179 seems to be a loud voice of funding and cannot be completely ignored. > have yet the possibility of making some analyses while selecting just those > countries. In my view, this kind of similarity does not have to be further > documented. I consider it a bit as a fantasy, to have a database, which > would tell you of all what is directly comparable. > > I fully agree with you on this and the example you give. As I discovered > this property of the model (a model built on a work methodology, > constructing the country metadata on the standard metadata), I was deceived, > too. But, there again I am not sure that this is a problem. It is only a > problem if you consider that a user should instantly know anything of > everything, which could be compared across the database. This is not the > kind of comparative research I would personnally encourage. Again, I think ISO11179 may have a different conception. Judging from what Pascal said at the meeting, I think he has a point of view closer to ISO11179 than yours. On the other hand, I think that Roger Jowell would agree with you for the reason that this kind of comparative research is extremely lacking in rigor. > You are fully right, and this is why you find a specific study for that on > the slides attached to this e-mail (the green frame). You actually cannot > deduce everything from anything... ;-) From reto.hadorn at sidos.unine.ch Wed Jul 6 13:10:19 2005 From: reto.hadorn at sidos.unine.ch (Reto Hadorn) Date: Wed Jul 6 13:11:48 2005 Subject: [DDI-SRG] [DDI-CDG] The dataset movie (fwd) Message-ID: <6.1.2.0.2.20050706190930.020a5ad8@webmail.unine.ch> Skipped content of type multipart/alternative-------------- next part -------------- A non-text attachment was scrubbed... Name: RCNS_Text_050706.doc Type: application/msword Size: 387584 bytes Desc: not available Url : http://www.icpsr.umich.edu/pipermail/ddi-cdg/attachments/20050706/8867f3a7/RCNS_Text_050706-0001.doc -------------- next part -------------- A non-text attachment was scrubbed... Name: RCNS_Text_050706_050411.doc Type: application/msword Size: 458752 bytes Desc: not available Url : http://www.icpsr.umich.edu/pipermail/ddi-cdg/attachments/20050706/8867f3a7/RCNS_Text_050706_050411-0001.doc