[DDI-SRG] ncube discussion?

Wendy Thomas wlt at pop.umn.edu
Fri Dec 14 21:16:31 EST 2007


Achim

I still think the roll-ups are handled by the codeScheme. I'll make some 
examples. As for the relationship of the coordinate values to the content, 
I didn't look at the example, just put in coordinates in a standard matrix 
order. So these can easily be changed to match the content of your 
example.

This difference in terms of dimension and coordiante is that in the NCube 
the Dimension identifies the Variable that describes the dimension. In the 
physical data product the Coordinate give the value of the coordinates for 
each dimension. I think they are named appropriately and also they match 
the names from DDI 2.0

Wendy


On Fri, 14 Dec 2007, Joachim Wackerow wrote:

> Wendy,
>
> Thanks for the example, that is now much clearer. That is exactly what I
> proposed for the sparse matrices. I assumed that the intention of the
> current XML Schema design was NOT to be explicit in terms of the
> dimensions for the reason of the size of the XML instance.
>
> It was a problem that the XML Schema didn't allow something other, so I
> tried something. I didn't know what the intention was :).
>
> This explains now the misunderstanding in the discussion yesterday, when
> I was talking about implicit order.
>
> With this design sparse matrices would be no problem. Cells which are
> not mentioned have a value of zero.
>
> The marginal sums can be also added with a value of NULL for the
> coordinate number where the rollup takes place. This would be nicely in
> line with OLAP and database systems. The question would raise, how the
> NULL value should be defined in the XML, perhaps as a optional attribute
> for Value like 'null="true"' and not just as content.
>
> I looked at some cells in the example. I noticed several issues, for
> example, that the coordinates for 2,2,1 have the measure 572, but should
> have 950, or the other way around the measure 572 should have the
> coordinates 1,3,1.
> Is this a typo, or do I have another understanding of this?
>
> Finally I see a naming issue. At the definition of the dimensions the
> term is Dimension, regarding the DataItem the term is Coordinate.
> Perhaps it should be in same places the same as Dimension.
>
> Achim
>
> Wendy Thomas wrote:
>> Ok...the problem is that you have put all measures into a single data
>> item. There is one data item per cell and it can contain multiple
>> measures of different types.
>>
>> So what you really SHOULD have is attached. However, DataItem needs to
>> be repeatable which it is currently NOT. That's the bug. I've filed it.
>>
>> Wendy
>>
>>
>> On Thu, 13 Dec 2007, Joachim Wackerow wrote:
>>
>>> Wendy,
>>>
>>> I think, this is not really the case in the current XML Schema. See an
>>> excerpt of the attached XML sample:
>>>
>>> <nci:DataItem>
>>>          <nci:NCubeInstanceReference>
>>>            <r:ID>NCube</r:ID>
>>>          </nci:NCubeInstanceReference>
>>>          <!-- - - - - - -->
>>>          <nci:MeasureValue>
>>>            <nci:MeasureReference>
>>>              <r:ID>Measure</r:ID>
>>>            </nci:MeasureReference>
>>>            <nci:Value>670</nci:Value>
>>>          </nci:MeasureValue>
>>>          <!-- - - - - - -->
>>>          <nci:MeasureValue>
>>>            <nci:MeasureReference>
>>>              <r:ID>Measure</r:ID>
>>>            </nci:MeasureReference>
>>>            <nci:Value>1442</nci:Value>
>>>          </nci:MeasureValue>
>>>
>>> If I understand your description correctly it is exactly the same as
>>> my proposal for the improvements for sparse matrices. But again this
>>> is not possible in the current scheme
>>>
>>> With the totals and subtotals I'm looking forward to your examples.
>>>
>>> Achim
>>>
>>> Wendy Thomas wrote:
>>>> Achim
>>>>
>>>> The notation for identifying the coordinate value is to state both
>>>> its rank and value. So in essense it doesn't matter to the system
>>>> what is changing the fastest:
>>>>
>>>> dimension rank 1   SEX (2 categories)
>>>> dimension rank 2   AGE (3 categories)
>>>> dimension rank 3   MARITAL STATUS (4 categories)
>>>>
>>>>
>>>> I could list my values in any order in the storage structure and you
>>>> could still place the value in its appropriate cell
>>>>
>>>> rank 1 dmnsValue 2
>>>> rank 2 dmnsValue 1
>>>> rank 3 dmnsValue 4
>>>> cell value 1889
>>>> measure count
>>>> measure unit persons
>>>>
>>>> Would be 1889 persons of SEX=2, AGE=1, MARITAL STATUS=4
>>>>
>>>> There is no assumption about the order following a "standard matrix
>>>> order" if one made that assuption then the order would be
>>>>
>>>>
>>>> 1 1 1
>>>> 1 1 2
>>>> 1 1 3
>>>> 1 1 4
>>>> 1 2 1
>>>> 1 2 1
>>>> 1 2 2
>>>> 1 2 3
>>>> 1 2 4
>>>> 2 1 1
>>>> etc
>>>>
>>>> While that assumption is convenient for automating the creation of
>>>> the old locMap (its easier to restructure the data than hand create
>>>> the metadata). It was never an assumption. I don't believe we made
>>>> that assumption in 3.0 (but I will verify this).
>>>>
>>>> We talked in 2.1 about using the NULL as a special case for
>>>> hierarchical categories. However, in many cases we had predetermined
>>>> values for the codes (like occupation codes) where the occupation
>>>> code was stored as a field and needed to be referenced to determine
>>>> the value of that particualr dimension. In this case there were both
>>>> marginal totals and intermediate totals. With the hierarchy levels
>>>> and description of relationships of those levels in the code scheme
>>>> the subtotals and totals are clear.
>>>>
>>>> Also you can define the sub-region of the cube (for example column
>>>> totals, row totals, column/row total and assign them an idivdual
>>>> measurement unit (since this is described by a variable you can also
>>>> provide the derivation or calculation process for obtaining it).
>>>>
>>>> Wendy
>>>>
>>>>
>>>>
>>>> On Thu, 13 Dec 2007, Joachim Wackerow wrote:
>>>>
>>>>> Wendy,
>>>>>
>>>>> OK, then I'm looking forward to your examples and we continue the
>>>>> discussion in January.
>>>>>
>>>>> The example tables are available at:
>>>>> Counts
>>>>> http://exanda.zuma-mannheim.de/Study_Wohlfahrtssurvey/Independent_SCHULAB/Dependent_ALTER1/Group_GESCHL/Weight_Weight/OutputType_Value/FileType_HTML/Trivariate.html
>>>>> Row percentage
>>>>> http://exanda.zuma-mannheim.de/Study_Wohlfahrtssurvey/Independent_SCHULAB/Dependent_ALTER1/Group_GESCHL/Weight_Weight/OutputType_RowPercent/FileType_HTML/Trivariate.html
>>>>>
>>>>> I'm assuming that count and row percentage would be two different
>>>>> measures and weighted counts a third one.
>>>>>
>>>>> I have still difficulties to understand how the marginal sums can be
>>>>> represented. I don't think it is a matter of the category scheme. A
>>>>> sum like "1938" in the first table for "18 bis 30 Jahre" (1) and
>>>>> "m?nnlich" (1) is dependent from the other two variables. Actually
>>>>> it is a cell of a two-dimensional cube "Geschlecht" by "Alter".
>>>>> Perhaps these marginal sums should be represented as an additional
>>>>> two-dimensional ncube?
>>>>>
>>>>> In OLAP cubes generated by database systems this sum would be
>>>>> represented by an additional record with the dimensions 1,1,NULL.
>>>>> NULL as a null value with the special meaning in an ncube, that this
>>>>> represents a marginal sum. So the overall count of men in the upper
>>>>> table "8249" would be represented as 1,NULL,NULL.
>>>>>
>>>>> Perhaps we can use this approach also in DDI? NULL or something
>>>>> similar would have been to be defined with this special meaning in a
>>>>> ncube. This is different to a missing value. In a specific cell a
>>>>> value can be missing, that would be a missing value in the measure
>>>>> (which would be represented in a database system also as a NULL). A
>>>>> NULL value in the sense above is limited to the dimensions. With
>>>>> this approach a rollup (OLAP talk) in every dimension would be
>>>>> possible.
>>>>>
>>>>> Attribute is just a string, so everything can stay there. This is no
>>>>> machine-actionable description for the fact, if the table is
>>>>> weighted or
>>>>> not, and no reference to a weight variable can be made (the
>>>>> inclusion of
>>>>> a weight variable can make sense in a structured ncube/table).
>>>>>
>>>>> The type of the measure is described by the label of the measure
>>>>> variable in logicalproduct or by the related 'ConceptReference'. I
>>>>> think
>>>>> it would make sense to describe it by a controlled vocabulary like for
>>>>> category statistics.
>>>>>
>>>>> The order of the measure values should be documented well. Defining
>>>>> "Geschlecht" as dimension 1, "Schulabschluss" as dimension 2,
>>>>> "Alter" as dimension 3 is not enough. The documentation should say
>>>>> that the dimension 1 is changing as slowest, otherwise the order is
>>>>> not defined clear enough for an application.
>>>>>
>>>>> Explicit example from the first table:
>>>>> 1,1,1,670
>>>>> 1,1,2,1442
>>>>> 1,1,3,1238
>>>>> 1,1,4,769
>>>>> 1,2,1,696
>>>>> 1,2,2,950
>>>>> 1,2,3,370
>>>>> 1,2,4,205
>>>>> 1,3,1,572
>>>>> 1,3,2,753
>>>>> 1,3,3,378
>>>>> 1,3,4,206
>>>>>
>>>>> Implicit example, how it would be represented in DDI, assuming the
>>>>> dimension with the highest rank is changing the quickest.
>>>>> 670
>>>>> 1442
>>>>> 1238
>>>>> 769
>>>>> 696
>>>>> 950
>>>>> 370
>>>>> 205
>>>>> 572
>>>>> 753
>>>>> 378
>>>>> 206
>>>>>
>>>>> Assuming the dimension with rank 1 is changing the quickest, the
>>>>> above representation would result in nonsense.
>>>>>
>>>>> Have a good time until Januar. I'm going with Ingrid to the Alps
>>>>> starting at Saturday until the end of the year.
>>>>>
>>>>> Best wishes, Achim
>>>>>
>>>>> Wendy Thomas wrote:
>>>>>> Achim
>>>>>>
>>>>>> I think it would be more useful after I've done the examples. Right
>>>>>> now we're going on my memory. Sorry I cut out before you got a
>>>>>> chance to say hang on. I have a cold and really needed to blow my
>>>>>> nose!
>>>>>>
>>>>>> I've got a stack of ddi xml I'm working on and trying to wedge in
>>>>>> some time at work to do this. It's a bit hectic there right now
>>>>>> (finals) but seems to slacking off so there is hope.
>>>>>>
>>>>>> Wendy
>>>>>>
>>>>>> On Thu, 13 Dec 2007, Joachim Wackerow wrote:
>>>>>>
>>>>>>> Wendy,
>>>>>>>
>>>>>>> you dropped off? I tried to discuss further the ncube stuff. I
>>>>>>> think it is possible without Arofan.
>>>>>>>
>>>>>>> Should we phone or should we continue in January?
>>>>>>>
>>>>>>> Achim
>>>>>>>
>>>>>>
>>>>>> Wendy L. Thomas                          Phone: +1 612.624.4389
>>>>>> Data Access Core Director         Fax:   +1 612.626.8375
>>>>>> Minnesota Population Center              Email: wlt at pop.umn.edu
>>>>>> University of Minnesota
>>>>>> 50 Willey Hall
>>>>>> 225 19th Avenue South
>>>>>> Minneapolis, MN 55455
>>>>>
>>>>
>>>> Wendy L. Thomas                          Phone: +1 612.624.4389
>>>> Data Access Core Director         Fax:   +1 612.626.8375
>>>> Minnesota Population Center              Email: wlt at pop.umn.edu
>>>> University of Minnesota
>>>> 50 Willey Hall
>>>> 225 19th Avenue South
>>>> Minneapolis, MN 55455
>>>
>>>
>>> --
>>> GESIS - German Social Science Infrastructure Services
>>> http://www.gesis.org/en/
>>>
>>
>> Wendy L. Thomas                          Phone: +1 612.624.4389
>> Data Access Core Director         Fax:   +1 612.626.8375
>> Minnesota Population Center              Email: wlt at pop.umn.edu
>> University of Minnesota
>> 50 Willey Hall
>> 225 19th Avenue South
>> Minneapolis, MN 55455
>
> _______________________________________________
> DDI-SRG mailing list
> DDI-SRG at icpsr.umich.edu
> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>

Wendy L. Thomas                          Phone: +1 612.624.4389
Data Access Core Director		 Fax:   +1 612.626.8375
Minnesota Population Center              Email: wlt at pop.umn.edu
University of Minnesota
50 Willey Hall
225 19th Avenue South
Minneapolis, MN 55455


More information about the DDI-SRG mailing list