[DDI-SRG] inline NCube: open issues and improvements

Joachim Wackerow joachim.wackerow at gesis.org
Wed Jan 9 07:23:03 EST 2008


Wendy,

We should discuss this tomorrow when everybody had the chance to read 
this. Otherwise I have the impression we are going in circles. I think 
at some issues we look both from different angles on it. Both angles 
seem to be possible and the DDI ncubes should work for different 
purposes. So we need to discuss and make a solution.

Anyway, below are some notes.

Wendy Thomas wrote:
> a few response notes:
> 
> 1)There NCubeIntance is repeatable in physicaldataproduct/ncube/normal 
> because the physical file being described can contain multiple NCubes 
> (example of all the NHGIS data).
> 
> Inline NCube assumes a single NCube for each inlinencube described 
> (inline itself is repeatable as a BaseRecordLayout).
> 
OK, but WHY should it be not repeatable. Inline is just another storage 
of the data. All NCube modules should be consistent. When a deviation 
from an overall concept is necessary, there should be an argument for that.

> Tabular makes the same assumption I think, although in this case it may 
> be inaccurate as a display table may contain multiple ncubes with one or 
> more common dimensions.
So it should be repeatable for tabular.

> 
> 2) I still think this is handled by nested code schemes. For example: 
> the NAICS (North American Industrial Classification Scheme) has such 
> roll ups already in it
> 00    All Industries             LEVEL 1 11    Agriculture, forestry, 
> fishing and hunting  LEVEL 2
> 111   Crop Products LEVEL 3
> 1111  Oilseed and Grain Farming   LEVEL 4
> 11111 Soybean Farming  LEVEL 5
> 11112 Oilseed (except Soybean Farming) LEVEL 5
> ...
> 1112  Vegetable and Melon Farming LEVEL 4
> 
> etc.
> 
> You need to know the level information and structure for identifying 
> appropriate rollups and what they are rollups of. As long as I know that 
> a codescheme includes a "total" and that it contains all the other 
> levels then it doesn't matter if its code is "0" or "fred". Also saying 
> something is a total in a multi dimensional NCube is different than the 
> "total" of a single dimension (variable).
I think I don't understand the meaning of the last sentence. Shouldn't 
be the definition the same?

> 
> Sex (3) total=0 male=1 female=2
> Age (3) total=0 under 18=1 18 plus=2
> Race (3) total=0 white=1 other=2
> 
> all with a "total" designated as "0"  I could have.
> 
> 0,0,0
> 0,0,1
> 0,0,2
> 0,1,0
> 0,1,1
> 0,1,2
> 0,2,0
> 0,2,1
> 0,2,2
>  etc down to 0,0,0 as long as one of the coordinates was zero.
All the above is fine with me for tables and variables with a given code 
for totals/subtotals.

Another case is ISCO code. It has NO code for the overall total only for 
the major groups. I think this is pretty common with some hierarchic codes.

ISCO-88 defines four levels of aggregation, consisting of:
10 major groups
28 sub-major groups (subdivisions of major groups)
116 minor groups (subdivisions of sub-major groups)
390 unit groups (subdivisions of minor groups)

> 
> I guess I'm just trying to see what you are gaining by adding 
> information about the content description in the physical description 
> and noting that you really can't constrain the coding systems to using 
> '0' as a total identifier.
Actually there is nothing added in the physical description, just that 
dimension is omitted which doesn't play a role for the description of 
this cell (see below).

I was thinking on OLAP-like systems which can probably deal easier with 
this kind of approach.

I see the point that all metadata should stay in logicalproduct and no 
metadata in physical.

> 
> I may be misunderstanding what you are unable to do if the logical 
> product documentation is writen correctly so please tell me what I'm 
> missing.
> 
> Also, since all cells in an NCube need to be described when would you 
> have a table not containing predetermined codes for subtotals? no code, 
> no cell. The variable and NCube state if something is addative which is 
> a matter of the structure, the measure, and suppression.
The approach above with the hierarchy in the codescheme in 
logicalproduct is not possible for tables WITHOUT codes for 
totals/subtotals but the totals/subtotals do exist as margin 
columns/rows. I agree basically to "no code, no cell". But there are no 
codes for the dimension where the total is for. Actually this dimension 
is just not there. These cells are defined by the codes of the other 
remaining dimensions.

 From my view this kind of tables with included totals/subtotals without 
specific codes are quite common, see again at my example:
http://exanda.zuma-mannheim.de/Study_Wohlfahrtssurvey/Independent_SCHULAB/Dependent_ALTER1/Group_GESCHL/Weight_Weight/OutputType_Value/FileType_HTML/Trivariate.html

The bold rows in this table are the subtotals for GESCHL by ALTER1 
without a code in SCHULAB. It is indeed a two-dimensional table (here 
without missing data for SCHULAB) which is related to the 
three-dimensional table regarding the common variables GESCHL and ALTER1.

The missing data can be a problem in any case, the totals/subtotals are 
computed here over all valid codes in SCHULAB (a "case-wise" deletion 
happens for any combination of GESCHL and ALTER1 where SCHULAB has a 
missing value). The totals with a given code in the logicalproduct are 
totals for all underlying codes in the hierarchy dependent if a missing 
code is included or not. BTW it is not clear to me how a total is 
specified with included missing code and another without missing codes. 
By repeating the hierarchy:

total1 (excluding missing)
   code1
   code2

total2 (including missing)
   code1
   code2
   c3 missing

or by inventing another level:

total2 (including missing)
   total1 (excluding missing)
     code1
     code2
   c3 missing


When an existing total cell (but without related code) should be 
described in the CodeScheme, an indicator can be invented for this 
purpose like an attribute for the code definition:
<Code/Value total="true"/>
The content would be empty, because no code does exist. An application 
can then use an appropriate symbol or text for that.

> 
> "Currently an application can't indentify a subtotal code in NCube 
> Logical Product as a subtotal." Clearly in the above example of NAICS 
> Level 1 contains level 2 etc and everything is appropriately nested in 
> its parent. Irregular hierarchies (where not all level items have 
> sublevels) are identified with "descrete". all those not descrete items 
> are sub-totals.
I'm probably wrong with this. If the hierarchy is correctly built an 
application can identify a subtotal code as a subtotal for the 
underlying other codes (in the hierarchy).

A total code for all (no other codes allowed on this main level) is only 
identifiable as the total by the information that it is the highest in 
the hierarchy AND it is just a single code. That is not really nice but 
it is possible. My point is: would an indicator (like above: 
total="true") for total make the processing not much easier?

> 
> 3) need to clarify in both field level and in the high level "how to 
> apply" sections.
> 
> 4) do we match this or match how 2.0 does the location information. I 
> guess I've always looked at the cell location inforamtion in a matrix as 
> it coordiante value.
> 
> in Mathematical matrices you have
> Vector (our dimension)
> Coordinate are numbers which describe the location of the points in a a 
> plan or space so you have a coordinate value for each vector.
> 
> The element in phyiscal location is providing the coordinate point on 
> the dimension and tells you its rank or order in the defined array.
The main point is to use the same term in different places. As I said I 
think these terms have the same meaning in different domains: dimension, 
vector, coordinate, axis, index (in programming arrays). As the NCubes 
in a clean form are actually OLAP cubes I would suggest to use dimension 
in logicalproduct AND physical.

> 
> 
> ok....i have to go back to my real job now :)
> 
> wendy
> 
> On Tue, 8 Jan 2008, Joachim Wackerow wrote:
> 
>> Please read this in time. It is on the agenda for Thursday.
>>
>>
>> Based on the discussion on ncube issues in December Wendy identified the
>> bug that DataItem is NOT repeatable in RecordLayout of the modules
>> physicaldataproduct/ncube/inline and physicaldataproduct/ncube/tabular.
>> See bug 100 in mantis:
>> http://mantis.ddialliance.org/view.php?id=100
>>
>>
>> 1.
>> Now I noticed that the element NCubeInstance is repeatable in
>> physicaldataproduct/ncube/normal, but not in 'inline' and not in
>> 'tabular'. There seems to be an inconsistency.
>> I'm not sure about the meaning and relationship of NCubeInstance and
>> related elements.
>> The documentation says:
>>
>> RecordLayoutType/NCubeInstance
>> Container to group data from common NCubes - this allows for NCube level
>> attributes to be defined.
>> (why common NCubes, not just NCubes?)
>>
>> RecordLayoutType/NCubeInstance/NCubeReference
>> Reference to the logical NCube description.
>>
>> RecordLayoutType/DataItem/NCubeInstanceReference
>> References the NCube description in the NCubeLogicalProduct.
>> (should probably read: references the related NCubeInstance.)
>>
>>
>> I looked again at the inline ncube (not at the other ncube modules):
>>
>> 2. Totals/subtotals (or margin sums or rollups)
>>
>> They should be represented in a way which is in line with the approach
>> in OLAP systems. This can make sense especially when the table does not
>> contain predetermined codes for subtotals. An application needs some
>> machine-actionable information that the subtotal is a subtotal not just
>> another code.
>>
>> One approach would be to use the approach of SQL/OLAP systems to
>> represent totals/subtotals as cells with a NULL value for the dimension
>> (or combination of dimensions) which the subtotals represents. Then 
>> the Coordinate/Value should have the value NULL for a subtotal/total.
>> This can be represented by an attribute like 'null="true"' or probably
>> better 'total="true"'. I would suggest to go one step further and just
>> to omit the dimensions with the value NULL. This way on the one hand the
>> cell is described clear enough and on the other hand we have no
>> unnecessary XML in the instance. This way subregions can be represented
>> in the data representation without defining the subtotal codes in 
>> NCubeLogicalProduct. That would have be described very clear in the
>> documentation.
>>
>> Predetermined codes for subtotals, which can be described in
>> NCubeLogicalProduct, can be recognized this way in a 
>> machine-actionable way. Currently an application can't identify a 
>> subtotal code in NCubeLogicalProduct as a subtotal. Perhaps this 
>> should additionally be possible in NCubeLogicalProduct. Then an 
>> indicator would be necessary for the related code.
>>
>> Subtotals in the NCubeLogicalProduct seem to need further research. The
>> examples which Wendy planned to make will perhaps shed some light on 
>> this issue.
>>
>>
>> SQL-related background
>>
>> Description of the SQL GROUP operators CUBE/ROLLUP
>>
>> CUBE generates a result set that shows aggregates for all combinations
>> of values in the selected columns.
>>
>> ROLLUP generates a result set that shows aggregates for a hierarchy of
>> values in the selected columns.
>>
>> Example:
>> SELECT Item, Color, SUM(Quantity) AS QtySum
>> FROM Inventory
>> GROUP BY Item, Color WITH CUBE
>>
>> Sample representation:
>> Item                 Color                QtySum
>> -------------------- -------------------- --------------------------
>> Chair                Blue                 101.00
>> Chair                Red                  210.00
>> Chair                (null)               311.00
>> Table                Blue                 124.00
>> Table                Red                  223.00
>> Table                (null)               347.00
>> (null)               (null)               658.00
>> (null)               Blue                 225.00
>> (null)               Red                  433.00
>>
>> see for example at the documentation of MS SQL Server:
>> http://technet.microsoft.com/en-us/library/aa213224(SQL.80).aspx
>>
>>
>> 3. empty cells / sparse matrices
>>
>> Empty cells (especially in the case of an huge amount of empty cells in
>> sparse matrices) can be just omitted as a DataItem. Because each 
>> existing data cell is described by its dimension coordinates this 
>> approach is not error prone. That special meaning of omitted data 
>> cells should be described in the documentation very clear.
>>
>>
>> 4. Naming issue
>>
>> For consistency reasons and easy understanding I would opt for the same
>> terms in the inline ncube module as in the NCubeLogicalProduct. I 
>> suggest a
>> name change from 'Coordinate' to 'Dimension' and from "CoordinateNumber"
>> to "DimensionRank" or just "Rank". This way it is clear that coordinate
>> is not invented for some new meaning but just the same as in the
>> dimension definition in NCubeLogicalProduct, same with rank. "Value" 
>> is fine
>> with OLAP talk, "value of dimension" seems to be an accepted term.
>>
>> Dimension seems to be the most wide spread term in the field of cubes.
>> Other terms are vector or coordinate.
>>
>> I'm not sure that it is really important to use the same terms as in DDI
>> 2.0 (it was "CubeCoord").
>>
>> "MeasureValue" should be perhaps only be "Measure".
>>
>> Example with suggested terms:
>> NCubeLogicalProduct
>>   <l:Dimension rank="1">
>>     <l:VariableReference>
>>       <r:ID>Variable_ALTER1</r:ID>
>>     </l:VariableReference>
>>   </l:Dimension>
>>
>> inline ncube
>>   <nci:Dimension>
>>     <nci:Rank>1</nci:Rank>
>>     <nci:Value>3</nci:Value>
>>   </nci:Dimension>
>>   <nci:Measure>
>>     <nci:MeasureReference>
>>       <r:ID>Measure</r:ID>
>>     </nci:MeasureReference>
>>     <nci:Value>696</nci:Value>
>>   </nci:Measure>
>>
>>
>> 5. Weighted measures
>>
>> Measures can be weighted. The measure definition references a variable
>> definition. In the variable definition one related weight variable can
>> be mentioned. This doesn't seems to be sufficient. A measure variable
>> can be weighted or not. It can be weighted by different variables. This
>> should be probably stated in the measure definition with an optional
>> element "WeightVariableReference" as in the variable description.
>>
>>
>> 6. Type of measure
>>
>> The type of measure can be described in the related variable
>> description. Additionally it would be perhaps reasonable to have a
>> machine-actionable field in the measure definition with a related
>> controlled vocabulary (similar as in CategoryStatisticType).
>>
>> A common type of measure are counts for any combination of the values of
>> the dimensions. Additionally a measure can represent for example
>> percentage, means, standard deviations for a dependent variable within
>> groups defined by one or more control (independent) variables. A common
>> type is for example row percentages in a two-dimensional table (could be
>> part of a multi-dimensional table).
>>
>> For this type of measure the additional indication of the dependent
>> dimension would be necessary like 'dependent="true"'. This needs
>> probably more research. I'm not sure if it is really sufficient to 
>> describe this type of measure completely (row percentage in each 
>> two-dimensional table is pretty clear - that is percentage for a 
>> dependent variable with the categories in columns. But percentage for 
>> a dependent variable for a dependent variable across two-dimensional 
>> tables? Does this make sense?).
>>
>>
>> 7. GrossRecordStructure
>>
>> This is required with LogicalRecordReference and PhysicalRecordSegment.
>> I still don't see the sense of that. Is this really necessary in the
>> case of a inline ncube? A use case would be helpful.
>>
>>
>> 8. Grand total
>>
>> How is it possible to document the grand total of a cube? The cell count
>> is stored in l:NCube/@cellCount (can be seen as the record count of the
>> cube in vector format). Perhaps an attribute 'grandTotal' for the
>> element nci:NCubeInstance would make sense. 'cellCount' and 
>> 'grandTotal' (at least when measure is count) are nice for consistency 
>> checks.
>>
>> The grand total in a OLAP cube is stored in a record where all
>> dimensions have the value NULL. The example in number 2 has a grand
>> total of 658 (where item and color is NULL). But according the proposal
>> in number 2 these kind of dimensions would be omitted, so it would be
>> not possible to store the grand total this way.
>>
>>
>> 8. Attribute
>>
>> Used in NCubeInstance, DataItem, and CoordinateGroup.
>>
>> Attribute/AttributeReference is a reference to an attribute 
>> description. Where does the description stay?
>>
>> Attribute/Value is just a string field. Is this machine-actionable?
>>
>> I'm not sure but Attribute seems to be some metadata in the data 
>> section. If yes, this should stay in NCubeLogicalProduct.
>>
>> An use case or example would be helpful for clarification, also with 
>> CoordinateGroup.
>>
>>
>> 9. Defaults for NCubeInstanceReference and MeasureReference
>>
>> For a single NCube with one measure the elements NCubeInstanceReference
>> and MeasureReference are repeated for each cell (DataItem) with the same
>> content. Perhaps defaults would make sense for these elements. These
>> defaults would have to stay just before the first DataItem.
>>
>>
>> Example
>> (is attached, valid against a fixed inline_ncube_recordlayout.xsd CR2
>> with repeatable DataItem)
>>
>> The example is according these two three-dimensional tables:
>> http://exanda.zuma-mannheim.de/Study_Wohlfahrtssurvey/Independent_SCHULAB/Dependent_ALTER1/Group_GESCHL/Weight_Weight/OutputType_Value/FileType_HTML/Trivariate.html 
>>
>> http://exanda.zuma-mannheim.de/Study_Wohlfahrtssurvey/Independent_SCHULAB/Dependent_ALTER1/Group_GESCHL/Weight_Weight/OutputType_RowPercent/FileType_HTML/Trivariate.html 
>>
>>
>> The cube has two measures (one of each table): count and row percentage
>> (percentage for the dependent variable ALTER1 for each two-dimensional
>> table of GESCHL). The subtotals (or margin sums) are included 
>> according the proposal in no. 2 from above.
>>
>>
>> Any comments or thoughts?
>>
>> Achim
>>
>>
> 
> Wendy L. Thomas                          Phone: +1 612.624.4389
> Data Access Core Director         Fax:   +1 612.626.8375
> Minnesota Population Center              Email: wlt at pop.umn.edu
> University of Minnesota
> 50 Willey Hall
> 225 19th Avenue South
> Minneapolis, MN 55455



More information about the DDI-SRG mailing list