[DDI-ADG] More on aggregate data.

Katherine McNeill-Harman mcneillh at MIT.EDU
Mon Aug 29 12:57:11 EDT 2005


A couple of comments/questions in advance of tomorrow.  More on the bigger 
picture than the specific tags/elements used (w/which others are more 
familiar):

1) Just to make sure I understand, Sanda had originally (8/25) proposed a 
format like the SDMX generic, but as we discussed last week, that has the 
downside of requiring a separate file describing the structure.  However, 
it seems that in Jostein's response (8/26) I believe he included the 
structure info. in the same file.  I think it's important to include them 
together.  It seems like we're all on the same page w/that, but just wanted 
to reaffirm it.

2) Another thing just to double-check.  I know that we're not treating time 
in the same way as is SDMX (per recent discussion), but we still will be 
able to accommodate time series, correct?  I believe the answer is yes, but 
double-checking.

3) Question about the modules.  I'm not sure about J's suggestion of 
combining modules 2/3.  From Sanda's description, (8/26), if I understand 
it correctly, module 2 is used only when the data/metadata are in the same 
file, whereas module 3 will be used for "describing the physical structure 
of an external data file."  In my mind these are mutually exclusive; one 
would use either one or the other, which would lend them to be 
separate.  But this may also relate to my question below.

4) A question which might be most easily answered over the phone.  I'm not 
quite understanding what is the CompleteCubeTable (J's case 2).  I can wrap 
my mind around a separate flat file and the data included w/in, but don't 
understand the difference between cases 2 and 3 (what makes the 
"self-describing" file different?).  However, if everyone else is crystal 
clear on this, we don't need to spend a lot of time at this late date 
explaining it just to me.

I'll send separately later any notes I have down that we might want to 
include in the notes column of the ultimate aggregate data spreadsheet.

Kate

At 04:25 PM 8/26/2005 -0400, J Gager wrote:
>OK Everyone,
>
>I am trying to wrap all of my responses to the last two days of emails
>into this one message.  If I miss any specific question, please ask
>again.
>
>I would like to expand and comment on what Sanda has outlined below:
>
>Module 1).  I think we are on the right track here, assuming we make the
>additions we discussed this week (also described in the eMail from me
>with the subject "Aggregate Data Notes").  The changes I am suggesting
>are to make it SDMX compatible.  Also, as I responded to an earlier
>email from Jostein, I agree we don't want to take SDMX's TimeSeries
>centric approach - and that is not an issue.  In SDMX, the time *is* a
>dimension, they just treat it special for organization of data.  As long
>as we allow for one to designate a variable as time, which as we
>discussed today is already there, we will be able to interact nicely
>with SDMX.  One last point about compatability that I would like to make
>is the concept of Groups in SDMX.  I am going to talk with Arofan a bit
>more about this, so there may be another change - I just wanted to get
>it out there - I will do the work to determine if we really need to do
>anything.
>
>Module 2 & 3).  I am wondering if we may want to roll all of this into
>one flexible module, to disambiguate the situation.
>If I am understanding all of the previous discussions correctly, we are
>basically suggesting 3 options for describing the actual physical data
>(I am ignoring any attribute discussion for the sake of simplicity):
>
>1.  The existing 2d way of saying for dimension 1 = value X, dimension 2
>= value Y, the the measure is located som place in a table.
>2.  A new way to say that for a nCube, the data is stuctured as this,
>dimension 1 value is at location X, dimension 2 value is at lociation Y,
>and the value for this observation is at location Z, where the location
>could be text delimitation or cells.
>3.  SDMX type approach where we follow approach 1. for supplying
>dimension values, and put the actual value of the observation in the
>tag.
>
>I would suggest we have something like this (and PLEASE keep in mind
>this is just a rough sample to open discussion, and again I am ignoring
>any new structure we would add for attributes).
>
>DataItem
>         Type                            1..1  This is a controlled list
>to describe which of the 3 cases above is being described:
>                                                 FlatDataFile (case 1),
>CompleteCubeTable (case 2), or Inline (case 3)
>         CubeCoord                       0..N
>                 CubeID          1..1    nCube ID
>                 DimID                   1..1    Dimension ID
>                 DimValue                1..1  Value of the dimension
>                         Choice  1..1    Choice of either actual value or
>location in file
>                                 Loc     1..1    Location of dimension
>value in file
>                                 Value   1..1  Actual value of dimesnion
>                         End Choice
>         MeasureValue            1..1    Value of the measured phenomenon
>                 Choice          1..1  Choice between pointing to value
>in file of providing actual value
>                         Loc             1..1    Location of value in
>file
>                         Value           1..1  Actual value of measure
>                 End Choice
>
>So for case 1, it would pretty much be as it exists today.  For case 2,
>you would really only have 1 data item, which would descirbe every row
>of data in the file (no need to repeat for each set of dimension
>values).  For case 3, it would look about the same as case 1, only the
>values would be inline (similar to SDMX).
>
>Obviously we will need to work out the details fo the Loc and Value (for
>instance including multi layered files in Loc, or whether to state the
>value by refernece in Value or use actual values) but the basic concept
>is what is important right now.
>
>Another similar approach would be to take the same basic concept, but
>move it up a level to the FileDescription - pretty much the same as
>Sanda is suggesting, but that we separate case 1 and 2.  That is you can
>have either:
>         The "classic" file description (with some added bonuses new to
>3.0 such as attributes) - case 1
>         A self describing file (Josteins used case with the dimension
>values in the file) - case 2
>         No File, values in line - SDMX like behaivor - case 3
>Within these 3 type, we could have stricter control over making sure the
>structures are properly used - for example is case 2, you no longer have
>a choice between Loc and Value, you only have Loc.  In a way, this may
>be even better.
>
>So those are my thoughts.  Please digest and comment.
>
>J

___________________________________________
Katherine McNeill-Harman
Data Services Librarian
Dewey Library for Management and Social Sciences
Massachusetts Institute of Technology
77 Massachusetts Avenue, E53-100
Cambridge, MA 02139
mcneillh at mit.edu
617-253-0787 



More information about the DDI-ADG mailing list