[DDI-SRG] inline NCube: open issues and improvements
Wendy Thomas
wlt at pop.umn.edu
Tue Jan 8 14:12:17 EST 2008
a few response notes:
1)There NCubeIntance is repeatable in physicaldataproduct/ncube/normal
because the physical file being described can contain multiple NCubes
(example of all the NHGIS data).
Inline NCube assumes a single NCube for each inlinencube described (inline
itself is repeatable as a BaseRecordLayout).
Tabular makes the same assumption I think, although in this case it may be
inaccurate as a display table may contain multiple ncubes with one or more
common dimensions.
2) I still think this is handled by nested code schemes. For example: the
NAICS (North American Industrial Classification Scheme) has such roll ups
already in it
00 All Industries LEVEL 1
11 Agriculture, forestry, fishing and hunting LEVEL 2
111 Crop Products LEVEL 3
1111 Oilseed and Grain Farming LEVEL 4
11111 Soybean Farming LEVEL 5
11112 Oilseed (except Soybean Farming) LEVEL 5
...
1112 Vegetable and Melon Farming LEVEL 4
etc.
You need to know the level information and structure for identifying
appropriate rollups and what they are rollups of. As long as I know that a
codescheme includes a "total" and that it contains all the other levels
then it doesn't matter if its code is "0" or "fred". Also saying something
is a total in a multi dimensional NCube is different than the "total" of a
single dimension (variable).
Sex (3) total=0 male=1 female=2
Age (3) total=0 under 18=1 18 plus=2
Race (3) total=0 white=1 other=2
all with a "total" designated as "0" I could have.
0,0,0
0,0,1
0,0,2
0,1,0
0,1,1
0,1,2
0,2,0
0,2,1
0,2,2
etc down to 0,0,0 as long as one of the coordinates was zero.
I guess I'm just trying to see what you are gaining by adding information
about the content description in the physical description and noting that
you really can't constrain the coding systems to using '0' as a total
identifier.
I may be misunderstanding what you are unable to do if the logical product
documentation is writen correctly so please tell me what I'm missing.
Also, since all cells in an NCube need to be described when would you have
a table not containing predetermined codes for subtotals? no code, no
cell. The variable and NCube state if something is addative which is a
matter of the structure, the measure, and suppression.
"Currently an application can't indentify a subtotal code in NCube Logical
Product as a subtotal." Clearly in the above example of NAICS Level 1
contains level 2 etc and everything is appropriately nested in its parent.
Irregular hierarchies (where not all level items have sublevels) are
identified with "descrete". all those not descrete items are sub-totals.
3) need to clarify in both field level and in the high level "how to
apply" sections.
4) do we match this or match how 2.0 does the location information. I
guess I've always looked at the cell location inforamtion in a matrix as
it coordiante value.
in Mathematical matrices you have
Vector (our dimension)
Coordinate are numbers which describe the location of the points in a a
plan or space so you have a coordinate value for each vector.
The element in phyiscal location is providing the coordinate point on the
dimension and tells you its rank or order in the defined array.
ok....i have to go back to my real job now :)
wendy
On Tue, 8 Jan 2008, Joachim Wackerow wrote:
> Please read this in time. It is on the agenda for Thursday.
>
>
> Based on the discussion on ncube issues in December Wendy identified the
> bug that DataItem is NOT repeatable in RecordLayout of the modules
> physicaldataproduct/ncube/inline and physicaldataproduct/ncube/tabular.
> See bug 100 in mantis:
> http://mantis.ddialliance.org/view.php?id=100
>
>
> 1.
> Now I noticed that the element NCubeInstance is repeatable in
> physicaldataproduct/ncube/normal, but not in 'inline' and not in
> 'tabular'. There seems to be an inconsistency.
> I'm not sure about the meaning and relationship of NCubeInstance and
> related elements.
> The documentation says:
>
> RecordLayoutType/NCubeInstance
> Container to group data from common NCubes - this allows for NCube level
> attributes to be defined.
> (why common NCubes, not just NCubes?)
>
> RecordLayoutType/NCubeInstance/NCubeReference
> Reference to the logical NCube description.
>
> RecordLayoutType/DataItem/NCubeInstanceReference
> References the NCube description in the NCubeLogicalProduct.
> (should probably read: references the related NCubeInstance.)
>
>
> I looked again at the inline ncube (not at the other ncube modules):
>
> 2. Totals/subtotals (or margin sums or rollups)
>
> They should be represented in a way which is in line with the approach
> in OLAP systems. This can make sense especially when the table does not
> contain predetermined codes for subtotals. An application needs some
> machine-actionable information that the subtotal is a subtotal not just
> another code.
>
> One approach would be to use the approach of SQL/OLAP systems to
> represent totals/subtotals as cells with a NULL value for the dimension
> (or combination of dimensions) which the subtotals represents. Then the
> Coordinate/Value should have the value NULL for a subtotal/total.
> This can be represented by an attribute like 'null="true"' or probably
> better 'total="true"'. I would suggest to go one step further and just
> to omit the dimensions with the value NULL. This way on the one hand the
> cell is described clear enough and on the other hand we have no
> unnecessary XML in the instance. This way subregions can be represented
> in the data representation without defining the subtotal codes in
> NCubeLogicalProduct. That would have be described very clear in the
> documentation.
>
> Predetermined codes for subtotals, which can be described in
> NCubeLogicalProduct, can be recognized this way in a machine-actionable way.
> Currently an application can't identify a subtotal code in
> NCubeLogicalProduct as a subtotal. Perhaps this should additionally be
> possible in NCubeLogicalProduct. Then an indicator would be necessary for the
> related code.
>
> Subtotals in the NCubeLogicalProduct seem to need further research. The
> examples which Wendy planned to make will perhaps shed some light on this
> issue.
>
>
> SQL-related background
>
> Description of the SQL GROUP operators CUBE/ROLLUP
>
> CUBE generates a result set that shows aggregates for all combinations
> of values in the selected columns.
>
> ROLLUP generates a result set that shows aggregates for a hierarchy of
> values in the selected columns.
>
> Example:
> SELECT Item, Color, SUM(Quantity) AS QtySum
> FROM Inventory
> GROUP BY Item, Color WITH CUBE
>
> Sample representation:
> Item Color QtySum
> -------------------- -------------------- --------------------------
> Chair Blue 101.00
> Chair Red 210.00
> Chair (null) 311.00
> Table Blue 124.00
> Table Red 223.00
> Table (null) 347.00
> (null) (null) 658.00
> (null) Blue 225.00
> (null) Red 433.00
>
> see for example at the documentation of MS SQL Server:
> http://technet.microsoft.com/en-us/library/aa213224(SQL.80).aspx
>
>
> 3. empty cells / sparse matrices
>
> Empty cells (especially in the case of an huge amount of empty cells in
> sparse matrices) can be just omitted as a DataItem. Because each existing
> data cell is described by its dimension coordinates this approach is not
> error prone. That special meaning of omitted data cells should be described
> in the documentation very clear.
>
>
> 4. Naming issue
>
> For consistency reasons and easy understanding I would opt for the same
> terms in the inline ncube module as in the NCubeLogicalProduct. I suggest a
> name change from 'Coordinate' to 'Dimension' and from "CoordinateNumber"
> to "DimensionRank" or just "Rank". This way it is clear that coordinate
> is not invented for some new meaning but just the same as in the
> dimension definition in NCubeLogicalProduct, same with rank. "Value" is fine
> with OLAP talk, "value of dimension" seems to be an accepted term.
>
> Dimension seems to be the most wide spread term in the field of cubes.
> Other terms are vector or coordinate.
>
> I'm not sure that it is really important to use the same terms as in DDI
> 2.0 (it was "CubeCoord").
>
> "MeasureValue" should be perhaps only be "Measure".
>
> Example with suggested terms:
> NCubeLogicalProduct
> <l:Dimension rank="1">
> <l:VariableReference>
> <r:ID>Variable_ALTER1</r:ID>
> </l:VariableReference>
> </l:Dimension>
>
> inline ncube
> <nci:Dimension>
> <nci:Rank>1</nci:Rank>
> <nci:Value>3</nci:Value>
> </nci:Dimension>
> <nci:Measure>
> <nci:MeasureReference>
> <r:ID>Measure</r:ID>
> </nci:MeasureReference>
> <nci:Value>696</nci:Value>
> </nci:Measure>
>
>
> 5. Weighted measures
>
> Measures can be weighted. The measure definition references a variable
> definition. In the variable definition one related weight variable can
> be mentioned. This doesn't seems to be sufficient. A measure variable
> can be weighted or not. It can be weighted by different variables. This
> should be probably stated in the measure definition with an optional
> element "WeightVariableReference" as in the variable description.
>
>
> 6. Type of measure
>
> The type of measure can be described in the related variable
> description. Additionally it would be perhaps reasonable to have a
> machine-actionable field in the measure definition with a related
> controlled vocabulary (similar as in CategoryStatisticType).
>
> A common type of measure are counts for any combination of the values of
> the dimensions. Additionally a measure can represent for example
> percentage, means, standard deviations for a dependent variable within
> groups defined by one or more control (independent) variables. A common
> type is for example row percentages in a two-dimensional table (could be
> part of a multi-dimensional table).
>
> For this type of measure the additional indication of the dependent
> dimension would be necessary like 'dependent="true"'. This needs
> probably more research. I'm not sure if it is really sufficient to describe
> this type of measure completely (row percentage in each two-dimensional table
> is pretty clear - that is percentage for a dependent variable with the
> categories in columns. But percentage for a dependent variable for a
> dependent variable across two-dimensional tables? Does this make sense?).
>
>
> 7. GrossRecordStructure
>
> This is required with LogicalRecordReference and PhysicalRecordSegment.
> I still don't see the sense of that. Is this really necessary in the
> case of a inline ncube? A use case would be helpful.
>
>
> 8. Grand total
>
> How is it possible to document the grand total of a cube? The cell count
> is stored in l:NCube/@cellCount (can be seen as the record count of the
> cube in vector format). Perhaps an attribute 'grandTotal' for the
> element nci:NCubeInstance would make sense. 'cellCount' and 'grandTotal' (at
> least when measure is count) are nice for consistency checks.
>
> The grand total in a OLAP cube is stored in a record where all
> dimensions have the value NULL. The example in number 2 has a grand
> total of 658 (where item and color is NULL). But according the proposal
> in number 2 these kind of dimensions would be omitted, so it would be
> not possible to store the grand total this way.
>
>
> 8. Attribute
>
> Used in NCubeInstance, DataItem, and CoordinateGroup.
>
> Attribute/AttributeReference is a reference to an attribute description.
> Where does the description stay?
>
> Attribute/Value is just a string field. Is this machine-actionable?
>
> I'm not sure but Attribute seems to be some metadata in the data section. If
> yes, this should stay in NCubeLogicalProduct.
>
> An use case or example would be helpful for clarification, also with
> CoordinateGroup.
>
>
> 9. Defaults for NCubeInstanceReference and MeasureReference
>
> For a single NCube with one measure the elements NCubeInstanceReference
> and MeasureReference are repeated for each cell (DataItem) with the same
> content. Perhaps defaults would make sense for these elements. These
> defaults would have to stay just before the first DataItem.
>
>
> Example
> (is attached, valid against a fixed inline_ncube_recordlayout.xsd CR2
> with repeatable DataItem)
>
> The example is according these two three-dimensional tables:
> http://exanda.zuma-mannheim.de/Study_Wohlfahrtssurvey/Independent_SCHULAB/Dependent_ALTER1/Group_GESCHL/Weight_Weight/OutputType_Value/FileType_HTML/Trivariate.html
> http://exanda.zuma-mannheim.de/Study_Wohlfahrtssurvey/Independent_SCHULAB/Dependent_ALTER1/Group_GESCHL/Weight_Weight/OutputType_RowPercent/FileType_HTML/Trivariate.html
>
> The cube has two measures (one of each table): count and row percentage
> (percentage for the dependent variable ALTER1 for each two-dimensional
> table of GESCHL). The subtotals (or margin sums) are included according the
> proposal in no. 2 from above.
>
>
> Any comments or thoughts?
>
> Achim
>
>
Wendy L. Thomas Phone: +1 612.624.4389
Data Access Core Director Fax: +1 612.626.8375
Minnesota Population Center Email: wlt at pop.umn.edu
University of Minnesota
50 Willey Hall
225 19th Avenue South
Minneapolis, MN 55455
More information about the DDI-SRG
mailing list