[DDI-ADG] progress on aggregate data?
Mary Vardigan
maryv at icpsr.umich.edu
Thu Aug 25 13:30:16 EDT 2005
Jostein,
This sounds very promising. One question: When you say "2) links the
coordinates of the cell to the relevant variable and category descriptions
in the DDI-file," how does one determine the coordinates of the cell?
J, could you maybe provide XML of Jostein's example?
Mary
At 09:58 AM 8/25/2005, Jostein Ryssevik wrote:
>Hi,
>
>Regarding a physical storage/exchange format for aggregated data files I
>am not totally convinced that using the locMap element represents the
>ideal solution. A fairly standard way to exchange this type of data is a
>delimited file with one record per table cell. The first columns in this
>file provides the location of the cell on all the table dimensions (one
>column per dimension). The last column of holds the cell value (that is
>the value of the meassure variable located in that cell). For
>multi-measure cubes (they do exist, believe me), you will simply have more
>than one cell-value column.
>
>So a simple example with a geography, time, and gender dimension (and a
>population measure) might look something like this:
>
>101, 2000, 1, 123456
>101, 2000, 2, 154327
>101, 2001, 1, 365427
>etc.
>
>First record holds the population number for geography 101, year 2000 and
>gender 1 etc. (entries being category values on the dimension variables)
>
>To provide an xml-structure that:
>
>1) links the datafile to a specific ncube description in a specific DDI
>instance
>2) links the coordinates of the cell to the relevant variable and category
>descriptions in the DDI-file
>3) links the cell values to the relevant meassure variables in the DDI-file,
>
>should all be fairly simple and straight forward.
>
>The Structure with allow validation of data against the more detailed
>descriptions in the DDI-file.
>It could potentially also support hinging of tables across ncubes (at
>least for ncubes enclosed within a single DDI-file)
>It will also allow the data-file to live outside the DDI-file. I do not
>think that we should provide a structure that forces the data to be
>located within the DDI-instance (as would be the case with a solution
>based on locmap).
>By doing something like this we are also structuring the data in a fairly
>conventional way that easily could be understood by a variety of software
>that need to read this type of data.
>As far as I can see this approach is also more or less in line with the
>SDMX data exchange format.
>
>Jostein
>
>
>
>
>
>At 08:45 25.08.2005 -0400, Mary Vardigan wrote:
>>Thanks much, Kate. Responses below. --Mary and Sanda
>>
>>At 10:29 AM 8/24/2005, Katherine McNeill-Harman wrote:
>>>At the end of yesterday's phone conversation, I mentioned to J that I
>>>thought that we had the least documentation on what changes/structures
>>>to suggest for aggregate data (i.e. no single spreadsheet from which we
>>>were working). So he's going to try to compile something but, as he
>>>said in his other email, we need to pool our thoughts on this. So I'm
>>>starting the ball rolling.
>>>
>>>Following are the goals for aggregate data we'd outlined in Edinburgh;
>>>what changes have we agreed upon that will accomplish these?
>>>- Accommodate data files in formats w/ integrated data and metadata
>>>(e.g. Excel files) self documenting.
>>
>>We agree that this is a worthy goal and suggest that data values be
>>incorporated into the cell description, which is Data Item within LocMap,
>>just where Wendy placed the physical table cell description. This sets up
>>two options: (1) describing the position of the data value in a separate
>>rectangular file (already in the spec), and (2) the position of the row
>>and column, plus the data value, to describe a self-documenting table.
>>Wendy, you mentioned that defining row and column is not enough to
>>specify a physical table. Can you and J and Jostein perhaps come up with
>>all the elements needed?
>>
>>>- Evaluate broad utility of nCubes
>>
>>Consensus seemed to be that the logical description of the aggregate data
>>nCubes is fine. What we need to improve is the physical description,
>>which we addressed above.
>>
>>>- Need ability to describe method of aggregation
>>
>>We have this -- an existing attribute called aggMeth.
>>
>>>- Need of additional tags to describe aggregate data (not nCubes)
>>
>>Not sure what this means. In SDMX there are attributes that J mentioned
>>in the last call (information about the measurement, like source and
>>observation status -- projection, actual count, etc.). Adding this
>>information could help to make the ncubes more robust and make the DDI
>>more compatible with SDMX.
>>
>>>- Review tag names
>>
>>This is important and we haven't done it yet. Sanda and I will try to
>>make a stab at this next week and we can all review. For example,
>>"cohort" is not exactly the right term for how it is used.
>>
>>>- Role of modules for different kinds of data
>>
>>Not sure what is meant by this. Have we covered the two main
>>possibilities through 1 and 2 above?
>>
>>>- Align to SDMX
>>
>>J, we need your help in determining what is needed here. If we include
>>the values and add the attributes you mentioned, are there other things
>>we need to do?
>>
>>
>>>I've been looking over my notes on our changes/proposals; here is what's
>>>been said in principle, but again, we'll want to document this and
>>>consider how to accomplish it (I put the date when I have it discussed):
>>>
>>>- DDI is missing a way to describe the physical structure of a
>>>spreadsheet; need physical description for rows/columns/layers to say
>>>how they relate to each other; this would enable machine-actionable
>>>collapsing of rows or columns (e.g. collapsing of age groups) and
>>>creation of subtotals and totals; it should also accommodate various
>>>levels and irregular nested categories, and be able to identify the
>>>lowest level (8/9)
>>
>>Not sure what layers are.
>>
>>>- (8/9)
>>>- need to be able to mark up and represent existing tables (e.g. from
>>>print volumes) (8/16)
>>>- enable creation of a single file containing both data and metadata;
>>>this format would be optional and could be applied when appropriate (8/16)
>>>- unlike SDMX, enable a single file that contains both the data and the
>>>structure (8/23)
>>>- be able to apply attribute information at all levels (from cells on
>>>up); could add to n-cube in the measure element a sub-element that
>>>defines attributes that can be attached to any level; provide a
>>>structure by which authors could define these. However, it's not the
>>>case as with other 3.0 features that items at a lower level override
>>>things at a higher level; therefore, the structure will need to be such
>>>that it's clear that attributes can be defined only at one chosen level
>>>(i.e. can't have conflicting attributes at different levels). (8/23)
>>
>>J, please help us determine the level at which the SDMX type attributes
>>should apply.
>>
>>>- ability to locate the desired cell within the cube (8/23)
>>>- hinging is important, yet may be addressed by comparative data group;
>>>SRG liaison will check (8/23)
>>
>>Do layers relate to hinging? Hinging is possible now but only within one
>>DDI instance. Between two instances, we have to establish comparability
>>at the variable level.
>>
>>One other point, related to Wendy's earlier message: We are still
>>confused when the discussion moves to 2- versus 3-dimensional structures.
>>We need examples. Is Census data format 2 or 3 dimensional? How about an
>>Excel spreadsheet? What are the differences in dimensionality?
>>
>>
>>
>>>I'd ask others to help add to and clarify these. In addition, many of
>>>the above I just have articulated as goals, so I'm not clear if we've
>>>yet figured out how to accomplish these.
>>>
>>>Kate
>>>
>>>___________________________________________
>>>Katherine McNeill-Harman
>>>Data Services Librarian
>>>Dewey Library for Management and Social Sciences
>>>Massachusetts Institute of Technology
>>>77 Massachusetts Avenue, E53-100
>>>Cambridge, MA 02139
>>>mcneillh at mit.edu
>>>617-253-0787
>>>_______________________________________________
>>>DDI-ADG mailing list
>>>DDI-ADG at icpsr.umich.edu
>>>http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>>
>>Mary Vardigan
>>Assistant Director
>>Inter-university Consortium for Political and Social Research (ICPSR)
>>University of Michigan
>>P.O. Box 1248, Ann Arbor, MI 48106-1248
>>Phone: 734-615-7908
>>Fax: 734-647-8200
>>www.icpsr.umich.edu
>>_______________________________________________
>>DDI-ADG mailing list
>>DDI-ADG at icpsr.umich.edu
>>http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>
>
Mary Vardigan
Assistant Director
Inter-university Consortium for Political and Social Research (ICPSR)
University of Michigan
P.O. Box 1248, Ann Arbor, MI 48106-1248
Phone: 734-615-7908
Fax: 734-647-8200
www.icpsr.umich.edu
More information about the DDI-ADG
mailing list