[DDI-SRG] inline NCube: open issues and improvements

Joachim Wackerow joachim.wackerow at gesis.org
Fri Jan 18 12:16:13 EST 2008


Wendy,

This is regarding codes which are just invented for the coding scheme 
(like total), but are not existent in a table.

You mentioned at Wednesday the element 'Add' in identification. I think 
this has really another semantic. I'm not sure if this would be 
appropriate to use. Anyway with CR2 it wouldn't be possible on the code 
level because codes are not identifiable in CR2.

The use case would be a table where all categories of the table 
dimensions have category labels and codes, but no codes exist for the 
margin sums, the totals. For the markup the total must have a code. For 
some application it would be useful to know if the code did already 
exist or was just invented. With the current solution it is not possible 
to make a difference. For a (ad-hoc coded) total such an indicator would 
be necessary on the code level not on the code scheme level.

I think the issue is not really solvable by changing the ownership or 
version of a code scheme (we talked about that related to ISCO).

I would suggest an attribute for 'Code', which indicates if the code is 
just invented for the markup or is already existing for the data/table. 
The a good term must be found for that (ad-hoc="true" doesn't seem be a 
choice).

What do you think?

Achim

Joachim Wackerow wrote:
> Wendy,
> 
> In terms of better machine-actionability here a my notes on the open issues.
> 
> Regarding the weighted measure, should we decide to use the approach in 
> the example below?
> 
> Joachim Wackerow wrote:
>> Currently WeightVariableReference is not repeatable. So two weights are
>> possible, the standard one and an additional one, but not more.
>> Therefore WeightVariableReference should be repeatable.
>>
>> For Measure in NCubeLogicalProduct also WeightVariableReference should 
>> be added but only 0..1
>>
>> Example
>>
>>         <!-- unweighted frequency ->
>>         <l:Measure>
>>           <l:Identifier>
>>             <r:ID>Measur1</r:ID>
>>           </l:Identifier>
>>           <l:VariableReference>
>>             <r:ID>Frequency</r:ID>
>>           </l:VariableReference>
>>         </l:Measure>
>>         <!-- weighted frequency by w1 ->
>>         <l:Measure>
>>           <l:Identifier>
>>             <r:ID>Measure2</r:ID>
>>           </l:Identifier>
>>           <l:VariableReference>
>>             <r:ID>Frequency</r:ID>
>>           </l:VariableReference>
>>           <WeightVariableReference>
>>             <r:ID>Frequency_w1</r:ID>
>>           <WeightVariableReference>
>>         </l:Measure>
>>         <!-- weighted frequency by w2 ->
>>         <l:Measure>
>>           <l:Identifier>
>>             <r:ID>Measure3</r:ID>
>>           </l:Identifier>
>>           <l:VariableReference>
>>             <r:ID>Frequency</r:ID>
>>           </l:VariableReference>
>>           <WeightVariableReference>
>>             <r:ID>Frequency_w2</r:ID>
>>           <WeightVariableReference>
>>         </l:Measure>
>>         <!-- combined weighted frequency by w1 and w2 ->
>>         <l:Measure>
>>           <l:Identifier>
>>             <r:ID>Measure3</r:ID>
>>           </l:Identifier>
>>           <l:VariableReference>
>>             <r:ID>Frequency</r:ID>
>>           </l:VariableReference>
>>           <WeightVariableReference>
>>             <r:ID>Frequency_w1</r:ID>
>>           <WeightVariableReference>
>>           <WeightVariableReference>
>>             <r:ID>Frequency_w2</r:ID>
>>           <WeightVariableReference>
>>         </l:Measure>
>>
>> The weight variables Frequency_w1 and Frequency_w2 can be mentioned at 
>> the variable Frequency or not. As weights can usually be applied for a 
>> large range of variables they are sometimes not mentioned at each variable.
>> Weights
>> When weighting is done by a derived variable the relationship between 
>> the variable Frequency and the weight is not clearly to recognize for an 
>> application. With a solution like in the example above it would be clear.
>>
> 
> Regarding aggregated measure I would suggest to invent in Measure a 
> reference to the dimension which represents the dependent variable for 
> the aggregation method. (DependentDimensionRank or RankOfDependentDimension)
> 
> in NCubeLogicalProduct/NCube
> 
> <l:Measure>
>    <l:Identifier>
>      <r:ID>Measure1</r:ID>
>    </l:Identifier>
>    <l:VariableReference>
>      <r:ID>Measure_Frequency</r:ID>
>    </l:VariableReference>
>    <l:DependentDimensionRank>1</l:DependentDimensionRank>
> </l:Measure>
> 
> The measure expresses aggregation values over all codes of the dependent 
> variable by all codes of the other dimension (two-dimensional case) or 
> by all combinations of all codes of the other dimensions 
> (multi-dimensional).
> 
>> Aggregated measure
>> Same problem with the aggregation method in representation. The 
>> aggregation method can be stated, but not the dependent variable over 
>> which the aggregation took place. This is only possible in some variable 
>> definition but not as machine-actionable field. That was the reason I 
>> was thinking about an indicator for the dimension (dependent="true"). 
>> The use case is:
>> two dimensions: satisfaction by salary (1..5), age group (1..4)
>> measure: mean
>> Is the measure the mean over all categories of salary (dependent 
>> variable) for every category of age group or vice versa. The application 
>> cannot recognize it.
> 
> 
> Regarding total I think it works out good if two things exist:
> 1. According to the discussion in the conference call an indicator will 
> be invented in physical which states if the ncube data is clean or not. 
> Clean means any cell of the logical definition of the ncube has a value.
> 
> 2. It would be necessary to indicate in a hierarchical CodeScheme if a 
> Code on a higher lever comprehends all codes on the level below or not. 
> If yes that would be "clean" in the relationship between the upper level 
> code and the included codes on the level below. I would suggest an 
> attribute for Code: comprehensive=" false | true (Default) ". This way 
> the relationships can be described on a detailed granularity. Or would 
> it be sufficient to have this indicator at the description of Level?
> 
> The already existing HierarchyType ( Regular | Irregular ) seems to be 
> another thing:
> "Identifies the type of hierarchy used in the nesting of categories 
> within the code scheme. Possible values are Regular and Irregular. A 
> regular nesting indicates that the category hierarchy is consistent to 
> the lowest levels of the hierarchy, i.e. the lowest levels of the 
> hierarchy are at the same level for every branch on the hierarchy."
> 
>> Total
>> Same problem exists at the code definition for total. For an application 
>> it doesn't seem to be possible to recognize that the total is really a 
>> total.
>>
>> Looking at the furniture example the program can just learn that this is 
>> a not-discrete value on the first level. Then a assumption can be made 
>> if no other not-discrete value exists on the first level, then it should 
>> be a total. This seems to be quite complicated. That was the reason I 
>> would prefer an indicator for total at the code level. Totals are very 
>> special codes.
>>
>> An alternative with the existing possibilities would be to have a 
>> hierarchical code scheme with just one top-level code. This must be the 
>> total. The assumption for this working correctly would be that all 
>> existing codes of the data are really mentioned in the coding scheme 
>> below the total. Is this the solution?
> ...
> 
> 
>>> On Wed, 16 Jan 2008, Joachim Wackerow wrote:
>>>
>>>> Wendy,
>>>>
>>>> I would propose to make WeightVariableReference repeatable (in 
>>>> /LogicalProduct/VariableScheme/Variable/Representation).
>>>> This way multiple weights can be associated with one variable. 
>>>> Applying weights on a variable is dependent from the research 
>>>> question. For example a study can have multiple weights to adjust the 
>>>> number of respondents regarding education. This can have several 
>>>> different reasons: education system, household types, geographical 
>>>> region. Usually there is indeed a default weight, which can be single 
>>>> weight or a combined weight. Sometimes the researcher would like to 
>>>> use a specific weight which is appropriate for the research question 
>>>> or not weight at all.
>>>>
>>>> Regarding ncubes two measures must be defined for the case with an 
>>>> unweighted count of one dimension and a weighted count of the same 
>>>> dimension. The weights which are mentioned in the variable 
>>>> description can be seen as optional weights, not as mandatory ones. 
>>>> So an additional WeightVariableReference must be added to the measure 
>>>> definition.
>>>>
>>>> Then the question raises, should that be repeatable or not. Usually 
>>>> just one weight is applied to one measure, but it would be also 
>>>> possible to apply several weights. By definition a weight is combined 
>>>> with the unweighted measure count by multiplication. Multiple weights 
>>>> get combined the same way by multiplication. For other measures than 
>>>> count the weighted count gets computed first and then the aggregation 
>>>> method will be applied like mean or standard deviation. Multiple 
>>>> weights can be defined by a repeatable WeightVariableReference. This 
>>>> would be straightforward like in Variable/Representation. Another way 
>>>> would be to define a virtual derived variable which represents the 
>>>> combination of the weights. But that would be a longer way. I would 
>>>> opt for the repeatable WeightVariableReference in measure and a clear 
>>>> documentation what it means.
>>>>
>>>> Achim
>>>>
>>>> Joachim Wackerow wrote:
>>>> snip
>>>>
>>>>> 5. Weighted measures
>>>>>
>>>>> Measures can be weighted. The measure definition references a variable
>>>>> definition. In the variable definition one related weight variable can
>>>>> be mentioned. This doesn't seems to be sufficient. A measure variable
>>>>> can be weighted or not. It can be weighted by different variables. This
>>>>> should be probably stated in the measure definition with an optional
>>>>> element "WeightVariableReference" as in the variable description.
>>>> snip
>>>>
> 
> _______________________________________________
> DDI-SRG mailing list
> DDI-SRG at icpsr.umich.edu
> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg



More information about the DDI-SRG mailing list