[DDI-SRG] inline NCube: open issues and improvements
Joachim Wackerow
joachim.wackerow at gesis.org
Tue Jan 8 13:03:11 EST 2008
Please read this in time. It is on the agenda for Thursday.
Based on the discussion on ncube issues in December Wendy identified the
bug that DataItem is NOT repeatable in RecordLayout of the modules
physicaldataproduct/ncube/inline and physicaldataproduct/ncube/tabular.
See bug 100 in mantis:
http://mantis.ddialliance.org/view.php?id=100
1.
Now I noticed that the element NCubeInstance is repeatable in
physicaldataproduct/ncube/normal, but not in 'inline' and not in
'tabular'. There seems to be an inconsistency.
I'm not sure about the meaning and relationship of NCubeInstance and
related elements.
The documentation says:
RecordLayoutType/NCubeInstance
Container to group data from common NCubes - this allows for NCube level
attributes to be defined.
(why common NCubes, not just NCubes?)
RecordLayoutType/NCubeInstance/NCubeReference
Reference to the logical NCube description.
RecordLayoutType/DataItem/NCubeInstanceReference
References the NCube description in the NCubeLogicalProduct.
(should probably read: references the related NCubeInstance.)
I looked again at the inline ncube (not at the other ncube modules):
2. Totals/subtotals (or margin sums or rollups)
They should be represented in a way which is in line with the approach
in OLAP systems. This can make sense especially when the table does not
contain predetermined codes for subtotals. An application needs some
machine-actionable information that the subtotal is a subtotal not just
another code.
One approach would be to use the approach of SQL/OLAP systems to
represent totals/subtotals as cells with a NULL value for the dimension
(or combination of dimensions) which the subtotals represents. Then the
Coordinate/Value should have the value NULL for a subtotal/total.
This can be represented by an attribute like 'null="true"' or probably
better 'total="true"'. I would suggest to go one step further and just
to omit the dimensions with the value NULL. This way on the one hand the
cell is described clear enough and on the other hand we have no
unnecessary XML in the instance. This way subregions can be represented
in the data representation without defining the subtotal codes in
NCubeLogicalProduct. That would have be described very clear in the
documentation.
Predetermined codes for subtotals, which can be described in
NCubeLogicalProduct, can be recognized this way in a machine-actionable
way. Currently an application can't identify a subtotal code in
NCubeLogicalProduct as a subtotal. Perhaps this should additionally be
possible in NCubeLogicalProduct. Then an indicator would be necessary
for the related code.
Subtotals in the NCubeLogicalProduct seem to need further research. The
examples which Wendy planned to make will perhaps shed some light on
this issue.
SQL-related background
Description of the SQL GROUP operators CUBE/ROLLUP
CUBE generates a result set that shows aggregates for all combinations
of values in the selected columns.
ROLLUP generates a result set that shows aggregates for a hierarchy of
values in the selected columns.
Example:
SELECT Item, Color, SUM(Quantity) AS QtySum
FROM Inventory
GROUP BY Item, Color WITH CUBE
Sample representation:
Item Color QtySum
-------------------- -------------------- --------------------------
Chair Blue 101.00
Chair Red 210.00
Chair (null) 311.00
Table Blue 124.00
Table Red 223.00
Table (null) 347.00
(null) (null) 658.00
(null) Blue 225.00
(null) Red 433.00
see for example at the documentation of MS SQL Server:
http://technet.microsoft.com/en-us/library/aa213224(SQL.80).aspx
3. empty cells / sparse matrices
Empty cells (especially in the case of an huge amount of empty cells in
sparse matrices) can be just omitted as a DataItem. Because each
existing data cell is described by its dimension coordinates this
approach is not error prone. That special meaning of omitted data cells
should be described in the documentation very clear.
4. Naming issue
For consistency reasons and easy understanding I would opt for the same
terms in the inline ncube module as in the NCubeLogicalProduct. I suggest a
name change from 'Coordinate' to 'Dimension' and from "CoordinateNumber"
to "DimensionRank" or just "Rank". This way it is clear that coordinate
is not invented for some new meaning but just the same as in the
dimension definition in NCubeLogicalProduct, same with rank. "Value" is
fine
with OLAP talk, "value of dimension" seems to be an accepted term.
Dimension seems to be the most wide spread term in the field of cubes.
Other terms are vector or coordinate.
I'm not sure that it is really important to use the same terms as in DDI
2.0 (it was "CubeCoord").
"MeasureValue" should be perhaps only be "Measure".
Example with suggested terms:
NCubeLogicalProduct
<l:Dimension rank="1">
<l:VariableReference>
<r:ID>Variable_ALTER1</r:ID>
</l:VariableReference>
</l:Dimension>
inline ncube
<nci:Dimension>
<nci:Rank>1</nci:Rank>
<nci:Value>3</nci:Value>
</nci:Dimension>
<nci:Measure>
<nci:MeasureReference>
<r:ID>Measure</r:ID>
</nci:MeasureReference>
<nci:Value>696</nci:Value>
</nci:Measure>
5. Weighted measures
Measures can be weighted. The measure definition references a variable
definition. In the variable definition one related weight variable can
be mentioned. This doesn't seems to be sufficient. A measure variable
can be weighted or not. It can be weighted by different variables. This
should be probably stated in the measure definition with an optional
element "WeightVariableReference" as in the variable description.
6. Type of measure
The type of measure can be described in the related variable
description. Additionally it would be perhaps reasonable to have a
machine-actionable field in the measure definition with a related
controlled vocabulary (similar as in CategoryStatisticType).
A common type of measure are counts for any combination of the values of
the dimensions. Additionally a measure can represent for example
percentage, means, standard deviations for a dependent variable within
groups defined by one or more control (independent) variables. A common
type is for example row percentages in a two-dimensional table (could be
part of a multi-dimensional table).
For this type of measure the additional indication of the dependent
dimension would be necessary like 'dependent="true"'. This needs
probably more research. I'm not sure if it is really sufficient to
describe this type of measure completely (row percentage in each
two-dimensional table is pretty clear - that is percentage for a
dependent variable with the categories in columns. But percentage for a
dependent variable for a dependent variable across two-dimensional
tables? Does this make sense?).
7. GrossRecordStructure
This is required with LogicalRecordReference and PhysicalRecordSegment.
I still don't see the sense of that. Is this really necessary in the
case of a inline ncube? A use case would be helpful.
8. Grand total
How is it possible to document the grand total of a cube? The cell count
is stored in l:NCube/@cellCount (can be seen as the record count of the
cube in vector format). Perhaps an attribute 'grandTotal' for the
element nci:NCubeInstance would make sense. 'cellCount' and 'grandTotal'
(at least when measure is count) are nice for consistency checks.
The grand total in a OLAP cube is stored in a record where all
dimensions have the value NULL. The example in number 2 has a grand
total of 658 (where item and color is NULL). But according the proposal
in number 2 these kind of dimensions would be omitted, so it would be
not possible to store the grand total this way.
8. Attribute
Used in NCubeInstance, DataItem, and CoordinateGroup.
Attribute/AttributeReference is a reference to an attribute description.
Where does the description stay?
Attribute/Value is just a string field. Is this machine-actionable?
I'm not sure but Attribute seems to be some metadata in the data
section. If yes, this should stay in NCubeLogicalProduct.
An use case or example would be helpful for clarification, also with
CoordinateGroup.
9. Defaults for NCubeInstanceReference and MeasureReference
For a single NCube with one measure the elements NCubeInstanceReference
and MeasureReference are repeated for each cell (DataItem) with the same
content. Perhaps defaults would make sense for these elements. These
defaults would have to stay just before the first DataItem.
Example
(is attached, valid against a fixed inline_ncube_recordlayout.xsd CR2
with repeatable DataItem)
The example is according these two three-dimensional tables:
http://exanda.zuma-mannheim.de/Study_Wohlfahrtssurvey/Independent_SCHULAB/Dependent_ALTER1/Group_GESCHL/Weight_Weight/OutputType_Value/FileType_HTML/Trivariate.html
http://exanda.zuma-mannheim.de/Study_Wohlfahrtssurvey/Independent_SCHULAB/Dependent_ALTER1/Group_GESCHL/Weight_Weight/OutputType_RowPercent/FileType_HTML/Trivariate.html
The cube has two measures (one of each table): count and row percentage
(percentage for the dependent variable ALTER1 for each two-dimensional
table of GESCHL). The subtotals (or margin sums) are included according
the proposal in no. 2 from above.
Any comments or thoughts?
Achim
-------------- next part --------------
A non-text attachment was scrubbed...
Name: inline_ncube_2008-01-07.xml
Type: text/xml
Size: 52230 bytes
Desc: not available
Url : http://www.icpsr.umich.edu/pipermail/ddi-srg/attachments/20080108/8a1366b5/attachment-0001.xml
More information about the DDI-SRG
mailing list