[DDI-ADG] More on aggregate data.

J Gager j.b.gager at gmail.com
Fri Aug 26 16:25:15 EDT 2005


OK Everyone, 

I am trying to wrap all of my responses to the last two days of emails
into this one message.  If I miss any specific question, please ask
again.

I would like to expand and comment on what Sanda has outlined below:

Module 1).  I think we are on the right track here, assuming we make the
additions we discussed this week (also described in the eMail from me
with the subject "Aggregate Data Notes").  The changes I am suggesting
are to make it SDMX compatible.  Also, as I responded to an earlier
email from Jostein, I agree we don't want to take SDMX's TimeSeries
centric approach - and that is not an issue.  In SDMX, the time *is* a
dimension, they just treat it special for organization of data.  As long
as we allow for one to designate a variable as time, which as we
discussed today is already there, we will be able to interact nicely
with SDMX.  One last point about compatability that I would like to make
is the concept of Groups in SDMX.  I am going to talk with Arofan a bit
more about this, so there may be another change - I just wanted to get
it out there - I will do the work to determine if we really need to do
anything.

Module 2 & 3).  I am wondering if we may want to roll all of this into
one flexible module, to disambiguate the situation.
If I am understanding all of the previous discussions correctly, we are
basically suggesting 3 options for describing the actual physical data
(I am ignoring any attribute discussion for the sake of simplicity):

1.  The existing 2d way of saying for dimension 1 = value X, dimension 2
= value Y, the the measure is located som place in a table.
2.  A new way to say that for a nCube, the data is stuctured as this,
dimension 1 value is at location X, dimension 2 value is at lociation Y,
and the value for this observation is at location Z, where the location
could be text delimitation or cells.
3.  SDMX type approach where we follow approach 1. for supplying
dimension values, and put the actual value of the observation in the
tag.

I would suggest we have something like this (and PLEASE keep in mind
this is just a rough sample to open discussion, and again I am ignoring
any new structure we would add for attributes).

DataItem
	Type				1..1  This is a controlled list
to describe which of the 3 cases above is being described:
						FlatDataFile (case 1),
CompleteCubeTable (case 2), or Inline (case 3)
	CubeCoord			0..N
		CubeID		1..1	nCube ID
		DimID			1..1	Dimension ID
		DimValue		1..1  Value of the dimension
			Choice	1..1	Choice of either actual value or
location in file
				Loc	1..1	Location of dimension
value in file
				Value	1..1  Actual value of dimesnion
			End Choice
	MeasureValue		1..1	Value of the measured phenomenon
		Choice		1..1  Choice between pointing to value
in file of providing actual value
			Loc		1..1	Location of value in
file
			Value		1..1  Actual value of measure
		End Choice

So for case 1, it would pretty much be as it exists today.  For case 2,
you would really only have 1 data item, which would descirbe every row
of data in the file (no need to repeat for each set of dimension
values).  For case 3, it would look about the same as case 1, only the
values would be inline (similar to SDMX).

Obviously we will need to work out the details fo the Loc and Value (for
instance including multi layered files in Loc, or whether to state the
value by refernece in Value or use actual values) but the basic concept
is what is important right now.

Another similar approach would be to take the same basic concept, but
move it up a level to the FileDescription - pretty much the same as
Sanda is suggesting, but that we separate case 1 and 2.  That is you can
have either:
	The "classic" file description (with some added bonuses new to
3.0 such as attributes) - case 1
	A self describing file (Josteins used case with the dimension
values in the file) - case 2
	No File, values in line - SDMX like behaivor - case 3
Within these 3 type, we could have stricter control over making sure the
structures are properly used - for example is case 2, you no longer have
a choice between Loc and Value, you only have Loc.  In a way, this may
be even better.

So those are my thoughts.  Please digest and comment.

J

-----Original Message-----
From: ddi-adg-bounces at icpsr.umich.edu
[mailto:ddi-adg-bounces at icpsr.umich.edu] On Behalf Of Sanda Ionescu
Sent: Friday, August 26, 2005 3:16 PM
To: Jostein.Ryssevik at nsd.uib.no; DDI-ADG
Subject: [DDI-ADG] More on aggregate data.


Hi, all.

First of all, thank you, Jostein, for your messages - I think they're 
really helpful in moving us along.
While we talk about aggregate data, I think it is important to keep in
mind 
the modular structure we envisage for Version 3.0.
A desirable scenario for covering aggregate data might be to end up with

three different modules:
1) a module documenting the logical structure of the data (provided for
in 
the Logical Product- nCube package in the V 3.0 spreadsheet) and
including 
dimensions and cubes descriptions.
2) a module primarily designed for data exchange, containing both data
and 
some metadata, modeled after the SDMX specification - I particularly
liked 
the example I sent yesterday, extracted from the Generic Sample,
although I 
am not sure what level of validation we would actually need.
3) finally, a module describing the physical structure of an external
data 
file, that we (the archive) might choose to describe and distribute in a

legacy format (like Census data, etc.)
(this would be an (improved?) version of the Phys. Rec. Structure
Package 
in the V 3.0 spreadsheet).
Obviously, there will be links (cross-references) between the modules, 
particularly between 1) and 3) and 1) and 2).

With these three modules, producers or distributors would have the 
flexibility to use any combination of data and metadata they would find 
suitable to their purposes, and the data could sit either within or
outside 
the DDI instance.

Some questions:

Module 1) -- what do we need to add to make it more functional, (and 
SDMX-compatible) ? J ??
also, while I'm looking at the above-mentioned spreadsheet, I notice
that 
variables are described twice, once as "variables" and once as "variable

dimensions". I think that's probably a mistake -- in this module we only

need to describe "dimensions."

Module 2) -- I fully agree with Jostein's remark that time variables
need 
to be accounted for as "dimensions". Other than that, what other 
changes/adjustments do we need? And, I'm sure others will agree, even if
we 
adopt a structure similar to SDMX, we might want tag names that are more

suggestive of their contents. (J, I'm afraid we might need to rely on
you 
to provide an outline of this section, when we agree on what goes in.)

Module 3) -- Right now the LocMap only provides for identifying cells in
a 
flat delimited file. Do we want to add anything here?

Sanda.



Sanda Ionescu,
Research Associate
Inter-university Consortium for Political and Social Research (ICPSR)
The University of Michigan P.O. Box 1248 Ann Arbor, MI 48106

Phone: (734) 615-7890
Fax: (734) 615-7890
        (734) 647-8200

_______________________________________________
DDI-ADG mailing list
DDI-ADG at icpsr.umich.edu
http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg




More information about the DDI-ADG mailing list