[DDI-SRG] inline NCube: open issues and improvements

Joachim Wackerow joachim.wackerow at gesis.org
Mon Jan 21 13:58:34 EST 2008


Wendy,

The approach is interesting. I try understand by writing some XML fragments:

<l:CategoryScheme>
   <l:Identifier>
     <r:ID>ISCO_plus_total</r:ID>
   </l:Identifier>
   <l:Category>
     <l:Identifier>
       <r:ID>ISCO_total</r:ID>
     </l:Identifier>
     <r:Label>[Total]</r:Label>
   </l:Category>
</l:CategoryScheme>

<l:CodeScheme>
   ...
   <l:CategorySchemeReference>
     <r:ID>ISCO_plus_total</r:ID>
   </l:CategorySchemeReference>
   <l:HierarchyType>Regular</l:HierarchyType>
   <l:Level levelNumber="1">
     <l:Name>Total over all ISCO codes</l:Name>
     <l:RelationshipType>Nominal</l:RelationshipType>
   </l:Level>
   <l:Level levelNumber="2">
     <l:Name>Major group of ISCO</l:Name>
     <l:RelationshipType>Nominal</l:RelationshipType>
   </l:Level>
   <l:Level levelNumber="3">
     <l:Name>Sub-major group of ISCO</l:Name>
     <l:RelationshipType>Nominal</l:RelationshipType>
   </l:Level>
   <l:Level levelNumber="4">
     <l:Name>Minor group of ISCO</l:Name>
     <l:RelationshipType>Nominal</l:RelationshipType>
   </l:Level>
   <l:Level levelNumber="5">
     <l:Name>Unit group of ISCO</l:Name>
     <l:RelationshipType>Nominal</l:RelationshipType>
   </l:Level>
   ...
   <l:Code levelNumber="1" isDiscrete="false">
     <l:CategoryReference>
       <r:ID>ISCO_total</r:ID>
     </l:CategoryReference>
     <l:Value>10000</l:Value> <!-- not existent in ISCO -->
   </l:Code>
   ...
   [repetition of all ISCO codes on original level 1-4 with references 
to original ISCO category scheme but now with changed levels (for 
example instead of 1 now 2)]

Did you mean to go this way? Where is inclusion possible? The repetition 
of the ISCO codes is not nice and the changed level numbers neither. 
Ideally an inclusion of a coding scheme into another should be possible 
and the automatic integration into the hierarchy, but it isn't. Did I 
miss something?

Further question below.


Wendy Thomas wrote:
> re: Code Schemes with added totals
> You can create a new CodeScheme that uses the ISCO Category Scheme plus 
> additional categories from outside the ISCO structure. These can be 
> appropriately idenfied by level and inclusion.
> 
> Wendy
> 
> 
> On Fri, 18 Jan 2008, Wendy Thomas wrote:
> 
>> Achim
>>
>> In reference to Add... I was talking my way through a process and
>> discovered that the use of this element was only for local overides in
>> inherited information. In terms of CodeScheme I can reference an existing
>> CodeScheme (Say ISCO maintained by ISCO) and add additional locally
>> described categories (addtional totals, etc.) The problem here is that I
>> cannot create the nesting because I can't change the original ISCO
>> structure (I can add missing value categories or something that does not
>> affect the hierarchical structure easily but not totals). In this case
>> totals would end up being described in a separate NCube with a generation
>> description.
>>
>> As for adding an "ad-hoc" binary to the codeScheme you once again run 
>> into
>> the problem that this is a data dependent rather than a code dependent
>> designation. You may generate it from cell contents, but I can take that
>> same table structure and fill it with collected information.
I'm not sure, if I understand this correctly, but the main question 
remains: how to differentiate between codes which are provided with the 
data or table and codes which are just invented to describe cells which 
have no provided codes?

>>
>> You may want to explore attributes as a means of making this
>> identification.
Yes I'll do that, perhaps this is a solution.

Achim

>>
>> Wendy
>>
>>
>> On Fri, 18 Jan 2008, Joachim Wackerow wrote:
>>
>>> Wendy,
>>>
>>> This is regarding codes which are just invented for the coding scheme
>>> (like total), but are not existent in a table.
>>>
>>> You mentioned at Wednesday the element 'Add' in identification. I think
>>> this has really another semantic. I'm not sure if this would be
>>> appropriate to use. Anyway with CR2 it wouldn't be possible on the code
>>> level because codes are not identifiable in CR2.
>>>
>>> The use case would be a table where all categories of the table
>>> dimensions have category labels and codes, but no codes exist for the
>>> margin sums, the totals. For the markup the total must have a code. For
>>> some application it would be useful to know if the code did already
>>> exist or was just invented. With the current solution it is not possible
>>> to make a difference. For a (ad-hoc coded) total such an indicator would
>>> be necessary on the code level not on the code scheme level.
>>>
>>> I think the issue is not really solvable by changing the ownership or
>>> version of a code scheme (we talked about that related to ISCO).
>>>
>>> I would suggest an attribute for 'Code', which indicates if the code is
>>> just invented for the markup or is already existing for the data/table.
>>> The a good term must be found for that (ad-hoc="true" doesn't seem be a
>>> choice).
>>>
>>> What do you think?
>>>
>>> Achim
>>>
>>> Joachim Wackerow wrote:
>>>> Wendy,
>>>>
>>>> In terms of better machine-actionability here a my notes on the open 
>>>> issues.
>>>>
>>>> Regarding the weighted measure, should we decide to use the approach in
>>>> the example below?
>>>>
>>>> Joachim Wackerow wrote:
>>>>> Currently WeightVariableReference is not repeatable. So two weights 
>>>>> are
>>>>> possible, the standard one and an additional one, but not more.
>>>>> Therefore WeightVariableReference should be repeatable.
>>>>>
>>>>> For Measure in NCubeLogicalProduct also WeightVariableReference should
>>>>> be added but only 0..1
>>>>>
>>>>> Example
>>>>>
>>>>>         <!-- unweighted frequency ->
>>>>>         <l:Measure>
>>>>>           <l:Identifier>
>>>>>             <r:ID>Measur1</r:ID>
>>>>>           </l:Identifier>
>>>>>           <l:VariableReference>
>>>>>             <r:ID>Frequency</r:ID>
>>>>>           </l:VariableReference>
>>>>>         </l:Measure>
>>>>>         <!-- weighted frequency by w1 ->
>>>>>         <l:Measure>
>>>>>           <l:Identifier>
>>>>>             <r:ID>Measure2</r:ID>
>>>>>           </l:Identifier>
>>>>>           <l:VariableReference>
>>>>>             <r:ID>Frequency</r:ID>
>>>>>           </l:VariableReference>
>>>>>           <WeightVariableReference>
>>>>>             <r:ID>Frequency_w1</r:ID>
>>>>>           <WeightVariableReference>
>>>>>         </l:Measure>
>>>>>         <!-- weighted frequency by w2 ->
>>>>>         <l:Measure>
>>>>>           <l:Identifier>
>>>>>             <r:ID>Measure3</r:ID>
>>>>>           </l:Identifier>
>>>>>           <l:VariableReference>
>>>>>             <r:ID>Frequency</r:ID>
>>>>>           </l:VariableReference>
>>>>>           <WeightVariableReference>
>>>>>             <r:ID>Frequency_w2</r:ID>
>>>>>           <WeightVariableReference>
>>>>>         </l:Measure>
>>>>>         <!-- combined weighted frequency by w1 and w2 ->
>>>>>         <l:Measure>
>>>>>           <l:Identifier>
>>>>>             <r:ID>Measure3</r:ID>
>>>>>           </l:Identifier>
>>>>>           <l:VariableReference>
>>>>>             <r:ID>Frequency</r:ID>
>>>>>           </l:VariableReference>
>>>>>           <WeightVariableReference>
>>>>>             <r:ID>Frequency_w1</r:ID>
>>>>>           <WeightVariableReference>
>>>>>           <WeightVariableReference>
>>>>>             <r:ID>Frequency_w2</r:ID>
>>>>>           <WeightVariableReference>
>>>>>         </l:Measure>
>>>>>
>>>>> The weight variables Frequency_w1 and Frequency_w2 can be mentioned at
>>>>> the variable Frequency or not. As weights can usually be applied for a
>>>>> large range of variables they are sometimes not mentioned at each 
>>>>> variable.
>>>>> Weights
>>>>> When weighting is done by a derived variable the relationship between
>>>>> the variable Frequency and the weight is not clearly to recognize 
>>>>> for an
>>>>> application. With a solution like in the example above it would be 
>>>>> clear.
>>>>>
>>>>
>>>> Regarding aggregated measure I would suggest to invent in Measure a
>>>> reference to the dimension which represents the dependent variable for
>>>> the aggregation method. (DependentDimensionRank or 
>>>> RankOfDependentDimension)
>>>>
>>>> in NCubeLogicalProduct/NCube
>>>>
>>>> <l:Measure>
>>>>    <l:Identifier>
>>>>      <r:ID>Measure1</r:ID>
>>>>    </l:Identifier>
>>>>    <l:VariableReference>
>>>>      <r:ID>Measure_Frequency</r:ID>
>>>>    </l:VariableReference>
>>>>    <l:DependentDimensionRank>1</l:DependentDimensionRank>
>>>> </l:Measure>
>>>>
>>>> The measure expresses aggregation values over all codes of the 
>>>> dependent
>>>> variable by all codes of the other dimension (two-dimensional case) or
>>>> by all combinations of all codes of the other dimensions
>>>> (multi-dimensional).
>>>>
>>>>> Aggregated measure
>>>>> Same problem with the aggregation method in representation. The
>>>>> aggregation method can be stated, but not the dependent variable over
>>>>> which the aggregation took place. This is only possible in some 
>>>>> variable
>>>>> definition but not as machine-actionable field. That was the reason I
>>>>> was thinking about an indicator for the dimension (dependent="true").
>>>>> The use case is:
>>>>> two dimensions: satisfaction by salary (1..5), age group (1..4)
>>>>> measure: mean
>>>>> Is the measure the mean over all categories of salary (dependent
>>>>> variable) for every category of age group or vice versa. The 
>>>>> application
>>>>> cannot recognize it.
>>>>
>>>>
>>>> Regarding total I think it works out good if two things exist:
>>>> 1. According to the discussion in the conference call an indicator will
>>>> be invented in physical which states if the ncube data is clean or not.
>>>> Clean means any cell of the logical definition of the ncube has a 
>>>> value.
>>>>
>>>> 2. It would be necessary to indicate in a hierarchical CodeScheme if a
>>>> Code on a higher lever comprehends all codes on the level below or not.
>>>> If yes that would be "clean" in the relationship between the upper 
>>>> level
>>>> code and the included codes on the level below. I would suggest an
>>>> attribute for Code: comprehensive=" false | true (Default) ". This way
>>>> the relationships can be described on a detailed granularity. Or would
>>>> it be sufficient to have this indicator at the description of Level?
>>>>
>>>> The already existing HierarchyType ( Regular | Irregular ) seems to be
>>>> another thing:
>>>> "Identifies the type of hierarchy used in the nesting of categories
>>>> within the code scheme. Possible values are Regular and Irregular. A
>>>> regular nesting indicates that the category hierarchy is consistent to
>>>> the lowest levels of the hierarchy, i.e. the lowest levels of the
>>>> hierarchy are at the same level for every branch on the hierarchy."
>>>>
>>>>> Total
>>>>> Same problem exists at the code definition for total. For an 
>>>>> application
>>>>> it doesn't seem to be possible to recognize that the total is really a
>>>>> total.
>>>>>
>>>>> Looking at the furniture example the program can just learn that 
>>>>> this is
>>>>> a not-discrete value on the first level. Then a assumption can be made
>>>>> if no other not-discrete value exists on the first level, then it 
>>>>> should
>>>>> be a total. This seems to be quite complicated. That was the reason I
>>>>> would prefer an indicator for total at the code level. Totals are very
>>>>> special codes.
>>>>>
>>>>> An alternative with the existing possibilities would be to have a
>>>>> hierarchical code scheme with just one top-level code. This must be 
>>>>> the
>>>>> total. The assumption for this working correctly would be that all
>>>>> existing codes of the data are really mentioned in the coding scheme
>>>>> below the total. Is this the solution?
>>>> ...
>>>>
>>>>
>>>>>> On Wed, 16 Jan 2008, Joachim Wackerow wrote:
>>>>>>
>>>>>>> Wendy,
>>>>>>>
>>>>>>> I would propose to make WeightVariableReference repeatable (in
>>>>>>> /LogicalProduct/VariableScheme/Variable/Representation).
>>>>>>> This way multiple weights can be associated with one variable.
>>>>>>> Applying weights on a variable is dependent from the research
>>>>>>> question. For example a study can have multiple weights to adjust 
>>>>>>> the
>>>>>>> number of respondents regarding education. This can have several
>>>>>>> different reasons: education system, household types, geographical
>>>>>>> region. Usually there is indeed a default weight, which can be 
>>>>>>> single
>>>>>>> weight or a combined weight. Sometimes the researcher would like to
>>>>>>> use a specific weight which is appropriate for the research question
>>>>>>> or not weight at all.
>>>>>>>
>>>>>>> Regarding ncubes two measures must be defined for the case with an
>>>>>>> unweighted count of one dimension and a weighted count of the same
>>>>>>> dimension. The weights which are mentioned in the variable
>>>>>>> description can be seen as optional weights, not as mandatory ones.
>>>>>>> So an additional WeightVariableReference must be added to the 
>>>>>>> measure
>>>>>>> definition.
>>>>>>>
>>>>>>> Then the question raises, should that be repeatable or not. Usually
>>>>>>> just one weight is applied to one measure, but it would be also
>>>>>>> possible to apply several weights. By definition a weight is 
>>>>>>> combined
>>>>>>> with the unweighted measure count by multiplication. Multiple 
>>>>>>> weights
>>>>>>> get combined the same way by multiplication. For other measures than
>>>>>>> count the weighted count gets computed first and then the 
>>>>>>> aggregation
>>>>>>> method will be applied like mean or standard deviation. Multiple
>>>>>>> weights can be defined by a repeatable WeightVariableReference. This
>>>>>>> would be straightforward like in Variable/Representation. Another 
>>>>>>> way
>>>>>>> would be to define a virtual derived variable which represents the
>>>>>>> combination of the weights. But that would be a longer way. I would
>>>>>>> opt for the repeatable WeightVariableReference in measure and a 
>>>>>>> clear
>>>>>>> documentation what it means.
>>>>>>>
>>>>>>> Achim
>>>>>>>
>>>>>>> Joachim Wackerow wrote:
>>>>>>> snip
>>>>>>>
>>>>>>>> 5. Weighted measures
>>>>>>>>
>>>>>>>> Measures can be weighted. The measure definition references a 
>>>>>>>> variable
>>>>>>>> definition. In the variable definition one related weight 
>>>>>>>> variable can
>>>>>>>> be mentioned. This doesn't seems to be sufficient. A measure 
>>>>>>>> variable
>>>>>>>> can be weighted or not. It can be weighted by different 
>>>>>>>> variables. This
>>>>>>>> should be probably stated in the measure definition with an 
>>>>>>>> optional
>>>>>>>> element "WeightVariableReference" as in the variable description.
>>>>>>> snip
>>>>>>>
>>>>
>>>> _______________________________________________
>>>> DDI-SRG mailing list
>>>> DDI-SRG at icpsr.umich.edu
>>>> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>>
>>> _______________________________________________
>>> DDI-SRG mailing list
>>> DDI-SRG at icpsr.umich.edu
>>> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>>
>>
>> Wendy L. Thomas                          Phone: +1 612.624.4389
>> Data Access Core Director         Fax:   +1 612.626.8375
>> Minnesota Population Center              Email: wlt at pop.umn.edu
>> University of Minnesota
>> 50 Willey Hall
>> 225 19th Avenue South
>> Minneapolis, MN 55455
>> _______________________________________________
>> DDI-SRG mailing list
>> DDI-SRG at icpsr.umich.edu
>> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>
> 
> Wendy L. Thomas                          Phone: +1 612.624.4389
> Data Access Core Director         Fax:   +1 612.626.8375
> Minnesota Population Center              Email: wlt at pop.umn.edu
> University of Minnesota
> 50 Willey Hall
> 225 19th Avenue South
> Minneapolis, MN 55455


-- 
GESIS - German Social Science Infrastructure Services
http://www.gesis.org/en/


More information about the DDI-SRG mailing list