[DDI-SRG] ncube discussion?

Joachim Wackerow joachim.wackerow at gesis.org
Fri Dec 14 16:43:20 EST 2007


Wendy,

Thanks for the example, that is now much clearer. That is exactly what I 
proposed for the sparse matrices. I assumed that the intention of the 
current XML Schema design was NOT to be explicit in terms of the 
dimensions for the reason of the size of the XML instance.

It was a problem that the XML Schema didn't allow something other, so I 
tried something. I didn't know what the intention was :).

This explains now the misunderstanding in the discussion yesterday, when 
I was talking about implicit order.

With this design sparse matrices would be no problem. Cells which are 
not mentioned have a value of zero.

The marginal sums can be also added with a value of NULL for the 
coordinate number where the rollup takes place. This would be nicely in 
line with OLAP and database systems. The question would raise, how the 
NULL value should be defined in the XML, perhaps as a optional attribute 
for Value like 'null="true"' and not just as content.

I looked at some cells in the example. I noticed several issues, for 
example, that the coordinates for 2,2,1 have the measure 572, but should 
have 950, or the other way around the measure 572 should have the 
coordinates 1,3,1.
Is this a typo, or do I have another understanding of this?

Finally I see a naming issue. At the definition of the dimensions the 
term is Dimension, regarding the DataItem the term is Coordinate. 
Perhaps it should be in same places the same as Dimension.

Achim

Wendy Thomas wrote:
> Ok...the problem is that you have put all measures into a single data 
> item. There is one data item per cell and it can contain multiple 
> measures of different types.
> 
> So what you really SHOULD have is attached. However, DataItem needs to 
> be repeatable which it is currently NOT. That's the bug. I've filed it.
> 
> Wendy
> 
> 
> On Thu, 13 Dec 2007, Joachim Wackerow wrote:
> 
>> Wendy,
>>
>> I think, this is not really the case in the current XML Schema. See an 
>> excerpt of the attached XML sample:
>>
>> <nci:DataItem>
>>          <nci:NCubeInstanceReference>
>>            <r:ID>NCube</r:ID>
>>          </nci:NCubeInstanceReference>
>>          <!-- - - - - - -->
>>          <nci:MeasureValue>
>>            <nci:MeasureReference>
>>              <r:ID>Measure</r:ID>
>>            </nci:MeasureReference>
>>            <nci:Value>670</nci:Value>
>>          </nci:MeasureValue>
>>          <!-- - - - - - -->
>>          <nci:MeasureValue>
>>            <nci:MeasureReference>
>>              <r:ID>Measure</r:ID>
>>            </nci:MeasureReference>
>>            <nci:Value>1442</nci:Value>
>>          </nci:MeasureValue>
>>
>> If I understand your description correctly it is exactly the same as 
>> my proposal for the improvements for sparse matrices. But again this 
>> is not possible in the current scheme
>>
>> With the totals and subtotals I'm looking forward to your examples.
>>
>> Achim
>>
>> Wendy Thomas wrote:
>>> Achim
>>>
>>> The notation for identifying the coordinate value is to state both 
>>> its rank and value. So in essense it doesn't matter to the system 
>>> what is changing the fastest:
>>>
>>> dimension rank 1   SEX (2 categories)
>>> dimension rank 2   AGE (3 categories)
>>> dimension rank 3   MARITAL STATUS (4 categories)
>>>
>>>
>>> I could list my values in any order in the storage structure and you 
>>> could still place the value in its appropriate cell
>>>
>>> rank 1 dmnsValue 2
>>> rank 2 dmnsValue 1
>>> rank 3 dmnsValue 4
>>> cell value 1889
>>> measure count
>>> measure unit persons
>>>
>>> Would be 1889 persons of SEX=2, AGE=1, MARITAL STATUS=4
>>>
>>> There is no assumption about the order following a "standard matrix 
>>> order" if one made that assuption then the order would be
>>>
>>>
>>> 1 1 1
>>> 1 1 2
>>> 1 1 3
>>> 1 1 4
>>> 1 2 1
>>> 1 2 1
>>> 1 2 2
>>> 1 2 3
>>> 1 2 4
>>> 2 1 1
>>> etc
>>>
>>> While that assumption is convenient for automating the creation of 
>>> the old locMap (its easier to restructure the data than hand create 
>>> the metadata). It was never an assumption. I don't believe we made 
>>> that assumption in 3.0 (but I will verify this).
>>>
>>> We talked in 2.1 about using the NULL as a special case for 
>>> hierarchical categories. However, in many cases we had predetermined 
>>> values for the codes (like occupation codes) where the occupation 
>>> code was stored as a field and needed to be referenced to determine 
>>> the value of that particualr dimension. In this case there were both 
>>> marginal totals and intermediate totals. With the hierarchy levels 
>>> and description of relationships of those levels in the code scheme 
>>> the subtotals and totals are clear.
>>>
>>> Also you can define the sub-region of the cube (for example column 
>>> totals, row totals, column/row total and assign them an idivdual 
>>> measurement unit (since this is described by a variable you can also 
>>> provide the derivation or calculation process for obtaining it).
>>>
>>> Wendy
>>>
>>>
>>>
>>> On Thu, 13 Dec 2007, Joachim Wackerow wrote:
>>>
>>>> Wendy,
>>>>
>>>> OK, then I'm looking forward to your examples and we continue the 
>>>> discussion in January.
>>>>
>>>> The example tables are available at:
>>>> Counts
>>>> http://exanda.zuma-mannheim.de/Study_Wohlfahrtssurvey/Independent_SCHULAB/Dependent_ALTER1/Group_GESCHL/Weight_Weight/OutputType_Value/FileType_HTML/Trivariate.html 
>>>> Row percentage
>>>> http://exanda.zuma-mannheim.de/Study_Wohlfahrtssurvey/Independent_SCHULAB/Dependent_ALTER1/Group_GESCHL/Weight_Weight/OutputType_RowPercent/FileType_HTML/Trivariate.html 
>>>>
>>>> I'm assuming that count and row percentage would be two different 
>>>> measures and weighted counts a third one.
>>>>
>>>> I have still difficulties to understand how the marginal sums can be 
>>>> represented. I don't think it is a matter of the category scheme. A 
>>>> sum like "1938" in the first table for "18 bis 30 Jahre" (1) and 
>>>> "m?nnlich" (1) is dependent from the other two variables. Actually 
>>>> it is a cell of a two-dimensional cube "Geschlecht" by "Alter". 
>>>> Perhaps these marginal sums should be represented as an additional 
>>>> two-dimensional ncube?
>>>>
>>>> In OLAP cubes generated by database systems this sum would be 
>>>> represented by an additional record with the dimensions 1,1,NULL. 
>>>> NULL as a null value with the special meaning in an ncube, that this 
>>>> represents a marginal sum. So the overall count of men in the upper 
>>>> table "8249" would be represented as 1,NULL,NULL.
>>>>
>>>> Perhaps we can use this approach also in DDI? NULL or something 
>>>> similar would have been to be defined with this special meaning in a 
>>>> ncube. This is different to a missing value. In a specific cell a 
>>>> value can be missing, that would be a missing value in the measure 
>>>> (which would be represented in a database system also as a NULL). A 
>>>> NULL value in the sense above is limited to the dimensions. With 
>>>> this approach a rollup (OLAP talk) in every dimension would be 
>>>> possible.
>>>>
>>>> Attribute is just a string, so everything can stay there. This is no
>>>> machine-actionable description for the fact, if the table is 
>>>> weighted or
>>>> not, and no reference to a weight variable can be made (the 
>>>> inclusion of
>>>> a weight variable can make sense in a structured ncube/table).
>>>>
>>>> The type of the measure is described by the label of the measure
>>>> variable in logicalproduct or by the related 'ConceptReference'. I 
>>>> think
>>>> it would make sense to describe it by a controlled vocabulary like for
>>>> category statistics.
>>>>
>>>> The order of the measure values should be documented well. Defining 
>>>> "Geschlecht" as dimension 1, "Schulabschluss" as dimension 2, 
>>>> "Alter" as dimension 3 is not enough. The documentation should say 
>>>> that the dimension 1 is changing as slowest, otherwise the order is 
>>>> not defined clear enough for an application.
>>>>
>>>> Explicit example from the first table:
>>>> 1,1,1,670
>>>> 1,1,2,1442
>>>> 1,1,3,1238
>>>> 1,1,4,769
>>>> 1,2,1,696
>>>> 1,2,2,950
>>>> 1,2,3,370
>>>> 1,2,4,205
>>>> 1,3,1,572
>>>> 1,3,2,753
>>>> 1,3,3,378
>>>> 1,3,4,206
>>>>
>>>> Implicit example, how it would be represented in DDI, assuming the 
>>>> dimension with the highest rank is changing the quickest.
>>>> 670
>>>> 1442
>>>> 1238
>>>> 769
>>>> 696
>>>> 950
>>>> 370
>>>> 205
>>>> 572
>>>> 753
>>>> 378
>>>> 206
>>>>
>>>> Assuming the dimension with rank 1 is changing the quickest, the 
>>>> above representation would result in nonsense.
>>>>
>>>> Have a good time until Januar. I'm going with Ingrid to the Alps 
>>>> starting at Saturday until the end of the year.
>>>>
>>>> Best wishes, Achim
>>>>
>>>> Wendy Thomas wrote:
>>>>> Achim
>>>>>
>>>>> I think it would be more useful after I've done the examples. Right 
>>>>> now we're going on my memory. Sorry I cut out before you got a 
>>>>> chance to say hang on. I have a cold and really needed to blow my 
>>>>> nose!
>>>>>
>>>>> I've got a stack of ddi xml I'm working on and trying to wedge in 
>>>>> some time at work to do this. It's a bit hectic there right now 
>>>>> (finals) but seems to slacking off so there is hope.
>>>>>
>>>>> Wendy
>>>>>
>>>>> On Thu, 13 Dec 2007, Joachim Wackerow wrote:
>>>>>
>>>>>> Wendy,
>>>>>>
>>>>>> you dropped off? I tried to discuss further the ncube stuff. I 
>>>>>> think it is possible without Arofan.
>>>>>>
>>>>>> Should we phone or should we continue in January?
>>>>>>
>>>>>> Achim
>>>>>>
>>>>>
>>>>> Wendy L. Thomas                          Phone: +1 612.624.4389
>>>>> Data Access Core Director         Fax:   +1 612.626.8375
>>>>> Minnesota Population Center              Email: wlt at pop.umn.edu
>>>>> University of Minnesota
>>>>> 50 Willey Hall
>>>>> 225 19th Avenue South
>>>>> Minneapolis, MN 55455
>>>>
>>>
>>> Wendy L. Thomas                          Phone: +1 612.624.4389
>>> Data Access Core Director         Fax:   +1 612.626.8375
>>> Minnesota Population Center              Email: wlt at pop.umn.edu
>>> University of Minnesota
>>> 50 Willey Hall
>>> 225 19th Avenue South
>>> Minneapolis, MN 55455
>>
>>
>> -- 
>> GESIS - German Social Science Infrastructure Services
>> http://www.gesis.org/en/
>>
> 
> Wendy L. Thomas                          Phone: +1 612.624.4389
> Data Access Core Director         Fax:   +1 612.626.8375
> Minnesota Population Center              Email: wlt at pop.umn.edu
> University of Minnesota
> 50 Willey Hall
> 225 19th Avenue South
> Minneapolis, MN 55455



More information about the DDI-SRG mailing list