[DDI-ADG] More on aggregate data.
Katherine McNeill-Harman
mcneillh at MIT.EDU
Mon Aug 29 12:57:11 EDT 2005
A couple of comments/questions in advance of tomorrow. More on the bigger
picture than the specific tags/elements used (w/which others are more
familiar):
1) Just to make sure I understand, Sanda had originally (8/25) proposed a
format like the SDMX generic, but as we discussed last week, that has the
downside of requiring a separate file describing the structure. However,
it seems that in Jostein's response (8/26) I believe he included the
structure info. in the same file. I think it's important to include them
together. It seems like we're all on the same page w/that, but just wanted
to reaffirm it.
2) Another thing just to double-check. I know that we're not treating time
in the same way as is SDMX (per recent discussion), but we still will be
able to accommodate time series, correct? I believe the answer is yes, but
double-checking.
3) Question about the modules. I'm not sure about J's suggestion of
combining modules 2/3. From Sanda's description, (8/26), if I understand
it correctly, module 2 is used only when the data/metadata are in the same
file, whereas module 3 will be used for "describing the physical structure
of an external data file." In my mind these are mutually exclusive; one
would use either one or the other, which would lend them to be
separate. But this may also relate to my question below.
4) A question which might be most easily answered over the phone. I'm not
quite understanding what is the CompleteCubeTable (J's case 2). I can wrap
my mind around a separate flat file and the data included w/in, but don't
understand the difference between cases 2 and 3 (what makes the
"self-describing" file different?). However, if everyone else is crystal
clear on this, we don't need to spend a lot of time at this late date
explaining it just to me.
I'll send separately later any notes I have down that we might want to
include in the notes column of the ultimate aggregate data spreadsheet.
Kate
At 04:25 PM 8/26/2005 -0400, J Gager wrote:
>OK Everyone,
>
>I am trying to wrap all of my responses to the last two days of emails
>into this one message. If I miss any specific question, please ask
>again.
>
>I would like to expand and comment on what Sanda has outlined below:
>
>Module 1). I think we are on the right track here, assuming we make the
>additions we discussed this week (also described in the eMail from me
>with the subject "Aggregate Data Notes"). The changes I am suggesting
>are to make it SDMX compatible. Also, as I responded to an earlier
>email from Jostein, I agree we don't want to take SDMX's TimeSeries
>centric approach - and that is not an issue. In SDMX, the time *is* a
>dimension, they just treat it special for organization of data. As long
>as we allow for one to designate a variable as time, which as we
>discussed today is already there, we will be able to interact nicely
>with SDMX. One last point about compatability that I would like to make
>is the concept of Groups in SDMX. I am going to talk with Arofan a bit
>more about this, so there may be another change - I just wanted to get
>it out there - I will do the work to determine if we really need to do
>anything.
>
>Module 2 & 3). I am wondering if we may want to roll all of this into
>one flexible module, to disambiguate the situation.
>If I am understanding all of the previous discussions correctly, we are
>basically suggesting 3 options for describing the actual physical data
>(I am ignoring any attribute discussion for the sake of simplicity):
>
>1. The existing 2d way of saying for dimension 1 = value X, dimension 2
>= value Y, the the measure is located som place in a table.
>2. A new way to say that for a nCube, the data is stuctured as this,
>dimension 1 value is at location X, dimension 2 value is at lociation Y,
>and the value for this observation is at location Z, where the location
>could be text delimitation or cells.
>3. SDMX type approach where we follow approach 1. for supplying
>dimension values, and put the actual value of the observation in the
>tag.
>
>I would suggest we have something like this (and PLEASE keep in mind
>this is just a rough sample to open discussion, and again I am ignoring
>any new structure we would add for attributes).
>
>DataItem
> Type 1..1 This is a controlled list
>to describe which of the 3 cases above is being described:
> FlatDataFile (case 1),
>CompleteCubeTable (case 2), or Inline (case 3)
> CubeCoord 0..N
> CubeID 1..1 nCube ID
> DimID 1..1 Dimension ID
> DimValue 1..1 Value of the dimension
> Choice 1..1 Choice of either actual value or
>location in file
> Loc 1..1 Location of dimension
>value in file
> Value 1..1 Actual value of dimesnion
> End Choice
> MeasureValue 1..1 Value of the measured phenomenon
> Choice 1..1 Choice between pointing to value
>in file of providing actual value
> Loc 1..1 Location of value in
>file
> Value 1..1 Actual value of measure
> End Choice
>
>So for case 1, it would pretty much be as it exists today. For case 2,
>you would really only have 1 data item, which would descirbe every row
>of data in the file (no need to repeat for each set of dimension
>values). For case 3, it would look about the same as case 1, only the
>values would be inline (similar to SDMX).
>
>Obviously we will need to work out the details fo the Loc and Value (for
>instance including multi layered files in Loc, or whether to state the
>value by refernece in Value or use actual values) but the basic concept
>is what is important right now.
>
>Another similar approach would be to take the same basic concept, but
>move it up a level to the FileDescription - pretty much the same as
>Sanda is suggesting, but that we separate case 1 and 2. That is you can
>have either:
> The "classic" file description (with some added bonuses new to
>3.0 such as attributes) - case 1
> A self describing file (Josteins used case with the dimension
>values in the file) - case 2
> No File, values in line - SDMX like behaivor - case 3
>Within these 3 type, we could have stricter control over making sure the
>structures are properly used - for example is case 2, you no longer have
>a choice between Loc and Value, you only have Loc. In a way, this may
>be even better.
>
>So those are my thoughts. Please digest and comment.
>
>J
___________________________________________
Katherine McNeill-Harman
Data Services Librarian
Dewey Library for Management and Social Sciences
Massachusetts Institute of Technology
77 Massachusetts Avenue, E53-100
Cambridge, MA 02139
mcneillh at mit.edu
617-253-0787
More information about the DDI-ADG
mailing list