[DDI-ADG] progress on aggregate data?
Katherine McNeill-Harman
mcneillh at MIT.EDU
Thu Aug 25 17:08:53 EDT 2005
Jostein and others,
One point based on my rough understanding of Jostein's proposal. Jostein,
regrettably, something in the communication gave you the impression that we
were trying to "force the data to be located within the
DDI-instance." This is not the case. Rather, we would like to provide the
option to do so if the data lends itself to this structure. As Mary said:
"We agree that this is a worthy goal and suggest that data values be [able
to be] incorporated into the cell description, which is Data Item within
LocMap, just where Wendy placed the physical table cell description. This
sets up two options: (1) describing the position of the data value in a
separate rectangular file (already in the spec), and (2) the position of
the row and column, plus the data value, to describe a self-documenting
table." Again, two options, allowing the choice by the author.
I'm concerned from your description that your method only allows for the
first option, for documenting/storing the data separately. Can you
describe how your proposal could accommodate storage of them in a single
file? I saw Sanda's XML markup, but am still unclear on this point. If
not, I don't believe it's a feasible option (hate to be blunt, but we're
getting down to the wire in needing to make decisions!).
Also, comments on a couple of Mary's questions:
1) "Need of additional tags to describe aggregate data (not nCubes): Not
sure what this means."
I recall that this was part of the brainstorming when people were wondering
if nCubes could accommodate all likely types of aggregate data (e.g. option
2 above). As you mentioned, it seems like it does, so I think we can
ignore this point (I didn't comment on them, just transcribed!).
2) "Role of modules for different kinds of data: Not sure what is meant by
this." The idea (getting to some of what I understand to be broader goals
of 3.0) is that, given the fact that the DDI will allow for many different
types of files/scenarios, all represented in various elements, we could
group elements that would be needed in like scenarios so that the could be
used by people in that case and easily ignored by others (rather than
people having to examine each element to see if it applies to them).
For example, let's that the option 1/2 above. The author of a dataset
would likely have an early sense as to whether it should be stored with or
separate from the documentation. That known, if they waned to store them
separately, it would be easier for them if they started with (my
imagination here) some sort of template for files of type "store
documentation separate from data" that would have a base structure designed
for that and leave out placeholders for storage of data that they wouldn't
need. There might be other file types that could be modularized as
well. Not sure how this plays out in practicality or in our proposal, but
I believe that to be our goal.
Kate
At 01:30 PM 8/25/2005 -0400, Mary Vardigan wrote:
>Jostein,
>
>This sounds very promising. One question: When you say "2) links the
>coordinates of the cell to the relevant variable and category descriptions
>in the DDI-file," how does one determine the coordinates of the cell?
>
>J, could you maybe provide XML of Jostein's example?
>
>Mary
>
>At 09:58 AM 8/25/2005, Jostein Ryssevik wrote:
>>Hi,
>>
>>Regarding a physical storage/exchange format for aggregated data files I
>>am not totally convinced that using the locMap element represents the
>>ideal solution. A fairly standard way to exchange this type of data is a
>>delimited file with one record per table cell. The first columns in this
>>file provides the location of the cell on all the table dimensions (one
>>column per dimension). The last column of holds the cell value (that is
>>the value of the meassure variable located in that cell). For
>>multi-measure cubes (they do exist, believe me), you will simply have
>>more than one cell-value column.
>>
>>So a simple example with a geography, time, and gender dimension (and a
>>population measure) might look something like this:
>>
>>101, 2000, 1, 123456
>>101, 2000, 2, 154327
>>101, 2001, 1, 365427
>>etc.
>>
>>First record holds the population number for geography 101, year 2000 and
>>gender 1 etc. (entries being category values on the dimension variables)
>>
>>To provide an xml-structure that:
>>
>>1) links the datafile to a specific ncube description in a specific DDI
>>instance
>>2) links the coordinates of the cell to the relevant variable and
>>category descriptions in the DDI-file
>>3) links the cell values to the relevant meassure variables in the DDI-file,
>>
>>should all be fairly simple and straight forward.
>>
>>The Structure with allow validation of data against the more detailed
>>descriptions in the DDI-file.
>>It could potentially also support hinging of tables across ncubes (at
>>least for ncubes enclosed within a single DDI-file)
>>It will also allow the data-file to live outside the DDI-file. I do not
>>think that we should provide a structure that forces the data to be
>>located within the DDI-instance (as would be the case with a solution
>>based on locmap).
>>By doing something like this we are also structuring the data in a fairly
>>conventional way that easily could be understood by a variety of software
>>that need to read this type of data.
>>As far as I can see this approach is also more or less in line with the
>>SDMX data exchange format.
>>
>>Jostein
>>
>>
>>
>>
>>
>>At 08:45 25.08.2005 -0400, Mary Vardigan wrote:
>>>Thanks much, Kate. Responses below. --Mary and Sanda
>>>
>>>At 10:29 AM 8/24/2005, Katherine McNeill-Harman wrote:
>>>>At the end of yesterday's phone conversation, I mentioned to J that I
>>>>thought that we had the least documentation on what changes/structures
>>>>to suggest for aggregate data (i.e. no single spreadsheet from which we
>>>>were working). So he's going to try to compile something but, as he
>>>>said in his other email, we need to pool our thoughts on this. So I'm
>>>>starting the ball rolling.
>>>>
>>>>Following are the goals for aggregate data we'd outlined in Edinburgh;
>>>>what changes have we agreed upon that will accomplish these?
>>>>- Accommodate data files in formats w/ integrated data and metadata
>>>>(e.g. Excel files) self documenting.
>>>
>>>We agree that this is a worthy goal and suggest that data values be
>>>incorporated into the cell description, which is Data Item within
>>>LocMap, just where Wendy placed the physical table cell description.
>>>This sets up two options: (1) describing the position of the data value
>>>in a separate rectangular file (already in the spec), and (2) the
>>>position of the row and column, plus the data value, to describe a
>>>self-documenting table. Wendy, you mentioned that defining row and
>>>column is not enough to specify a physical table. Can you and J and
>>>Jostein perhaps come up with all the elements needed?
>>>
>>>>- Evaluate broad utility of nCubes
>>>
>>>Consensus seemed to be that the logical description of the aggregate
>>>data nCubes is fine. What we need to improve is the physical
>>>description, which we addressed above.
>>>
>>>>- Need ability to describe method of aggregation
>>>
>>>We have this -- an existing attribute called aggMeth.
>>>
>>>>- Need of additional tags to describe aggregate data (not nCubes)
>>>
>>>Not sure what this means. In SDMX there are attributes that J mentioned
>>>in the last call (information about the measurement, like source and
>>>observation status -- projection, actual count, etc.). Adding this
>>>information could help to make the ncubes more robust and make the DDI
>>>more compatible with SDMX.
>>>
>>>>- Review tag names
>>>
>>>This is important and we haven't done it yet. Sanda and I will try to
>>>make a stab at this next week and we can all review. For example,
>>>"cohort" is not exactly the right term for how it is used.
>>>
>>>>- Role of modules for different kinds of data
>>>
>>>Not sure what is meant by this. Have we covered the two main
>>>possibilities through 1 and 2 above?
>>>
>>>>- Align to SDMX
>>>
>>>J, we need your help in determining what is needed here. If we include
>>>the values and add the attributes you mentioned, are there other things
>>>we need to do?
>>>
>>>
>>>>I've been looking over my notes on our changes/proposals; here is
>>>>what's been said in principle, but again, we'll want to document this
>>>>and consider how to accomplish it (I put the date when I have it discussed):
>>>>
>>>>- DDI is missing a way to describe the physical structure of a
>>>>spreadsheet; need physical description for rows/columns/layers to say
>>>>how they relate to each other; this would enable machine-actionable
>>>>collapsing of rows or columns (e.g. collapsing of age groups) and
>>>>creation of subtotals and totals; it should also accommodate various
>>>>levels and irregular nested categories, and be able to identify the
>>>>lowest level (8/9)
>>>
>>>Not sure what layers are.
>>>
>>>>- (8/9)
>>>>- need to be able to mark up and represent existing tables (e.g. from
>>>>print volumes) (8/16)
>>>>- enable creation of a single file containing both data and metadata;
>>>>this format would be optional and could be applied when appropriate (8/16)
>>>>- unlike SDMX, enable a single file that contains both the data and the
>>>>structure (8/23)
>>>>- be able to apply attribute information at all levels (from cells on
>>>>up); could add to n-cube in the measure element a sub-element that
>>>>defines attributes that can be attached to any level; provide a
>>>>structure by which authors could define these. However, it's not the
>>>>case as with other 3.0 features that items at a lower level override
>>>>things at a higher level; therefore, the structure will need to be such
>>>>that it's clear that attributes can be defined only at one chosen level
>>>>(i.e. can't have conflicting attributes at different levels). (8/23)
>>>
>>>J, please help us determine the level at which the SDMX type attributes
>>>should apply.
>>>
>>>>- ability to locate the desired cell within the cube (8/23)
>>>>- hinging is important, yet may be addressed by comparative data group;
>>>>SRG liaison will check (8/23)
>>>
>>>Do layers relate to hinging? Hinging is possible now but only within one
>>>DDI instance. Between two instances, we have to establish comparability
>>>at the variable level.
>>>
>>>One other point, related to Wendy's earlier message: We are still
>>>confused when the discussion moves to 2- versus 3-dimensional
>>>structures. We need examples. Is Census data format 2 or 3 dimensional?
>>>How about an Excel spreadsheet? What are the differences in dimensionality?
>>>
>>>
>>>
>>>>I'd ask others to help add to and clarify these. In addition, many of
>>>>the above I just have articulated as goals, so I'm not clear if we've
>>>>yet figured out how to accomplish these.
>>>>
>>>>Kate
>>>>
>>>>___________________________________________
>>>>Katherine McNeill-Harman
>>>>Data Services Librarian
>>>>Dewey Library for Management and Social Sciences
>>>>Massachusetts Institute of Technology
>>>>77 Massachusetts Avenue, E53-100
>>>>Cambridge, MA 02139
>>>>mcneillh at mit.edu
>>>>617-253-0787
>>>>_______________________________________________
>>>>DDI-ADG mailing list
>>>>DDI-ADG at icpsr.umich.edu
>>>>http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>>>
>>>Mary Vardigan
>>>Assistant Director
>>>Inter-university Consortium for Political and Social Research (ICPSR)
>>>University of Michigan
>>>P.O. Box 1248, Ann Arbor, MI 48106-1248
>>>Phone: 734-615-7908
>>>Fax: 734-647-8200
>>>www.icpsr.umich.edu
>>>_______________________________________________
>>>DDI-ADG mailing list
>>>DDI-ADG at icpsr.umich.edu
>>>http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>>
>
>Mary Vardigan
>Assistant Director
>Inter-university Consortium for Political and Social Research (ICPSR)
>University of Michigan
>P.O. Box 1248, Ann Arbor, MI 48106-1248
>Phone: 734-615-7908
>Fax: 734-647-8200
>www.icpsr.umich.edu
___________________________________________
Katherine McNeill-Harman
Data Services Librarian
Dewey Library for Management and Social Sciences
Massachusetts Institute of Technology
77 Massachusetts Avenue, E53-100
Cambridge, MA 02139
mcneillh at mit.edu
617-253-0787
More information about the DDI-ADG
mailing list