[DDI-ADG] progress on aggregate data?

Sanda Ionescu sandai at icpsr.umich.edu
Thu Aug 25 15:47:54 EDT 2005


Hi, all.
I think Jostein's example could be described with an adaptation of SDMX 
generic sample markup, as sent by J:
<generic:Series>
          <generic:SeriesKey>
             <generic:Value concept="FREQ" value="A"/>
             <generic:Value concept="REGION_ETH" value="SOM"/>
             <generic:Value concept="GENDER" value="M"/>
             <generic:Value concept="AGE_GRP" value="0_4"/>
          </generic:SeriesKey>
          <generic:Attributes>
             <generic:Value concept="TIME_FORMAT" value="P1Y"/>
             <generic:Value concept="UNIT_MULT" value="6"/>
          </generic:Attributes>
          <generic:Obs>
             <generic:Time>2005</generic:Time>
             <generic:ObsValue value="2.314327"/>
             <generic:Attributes>
                <generic:Value concept="OBS_STATUS" value="A"/>
             </generic:Attributes>
          </generic:Obs>
       </generic:Series>

--- It seems to me this provides for including dimension names, and values, 
for each cell and the actual data (measure), as well as additional 
information about the measure.
(I hope I'm not misreading anything :-)
It is less clear to me how/where we fit in the cell coordinates, or how we 
link to them?
Sanda.


At 09:58 AM 8/25/2005, Jostein Ryssevik wrote:
>Hi,
>
>Regarding a physical storage/exchange format for aggregated data files I 
>am not totally convinced that using the locMap element represents the 
>ideal solution. A fairly standard way to exchange this type of data is a 
>delimited file with one record per table cell. The first columns in this 
>file provides the location of the cell on all the table dimensions (one 
>column per dimension). The last column of holds the cell value (that is 
>the value of the meassure variable located in that cell). For 
>multi-measure cubes (they do exist, believe me), you will simply have more 
>than one cell-value column.
>
>So a simple example with a geography, time, and gender dimension (and a 
>population measure) might look something like this:
>
>101, 2000, 1, 123456
>101, 2000, 2, 154327
>101, 2001, 1, 365427
>etc.
>
>First record holds the population number for geography 101, year 2000 and 
>gender 1 etc. (entries being category values on the dimension variables)
>
>To provide an xml-structure that:
>
>1) links the datafile to a specific ncube description in a specific DDI 
>instance
>2) links the coordinates of the cell to the relevant variable and category 
>descriptions in the DDI-file
>3) links the cell values to the relevant meassure variables in the DDI-file,
>
>should all be fairly simple and straight forward.
>
>The Structure with allow validation of data against the more detailed 
>descriptions in the DDI-file.
>It could potentially also support hinging of  tables across ncubes (at 
>least for ncubes enclosed within a single DDI-file)
>It will also allow the data-file to live outside the DDI-file. I do not 
>think that we should provide a structure that forces the data to be 
>located within the DDI-instance (as would be the case with a solution 
>based on locmap).
>By doing something like this we are also structuring the data in a fairly 
>conventional way that easily could be understood by a variety of software 
>that need to read this type of data.
>As far as I can see this approach is also more or less in line with the 
>SDMX data exchange format.
>
>Jostein
>
>
>
>
>
>At 08:45 25.08.2005 -0400, Mary Vardigan wrote:
>>Thanks much, Kate. Responses below. --Mary and Sanda
>>
>>At 10:29 AM 8/24/2005, Katherine McNeill-Harman wrote:
>>>At the end of yesterday's phone conversation, I mentioned to J that I 
>>>thought that we had the least documentation on what changes/structures 
>>>to suggest for aggregate data (i.e. no single spreadsheet from which we 
>>>were working).  So he's going to try to compile something but, as he 
>>>said in his other email, we need to pool our thoughts on this.  So I'm 
>>>starting the ball rolling.
>>>
>>>Following are the goals for aggregate data we'd outlined in Edinburgh; 
>>>what changes have we agreed upon that will accomplish these?
>>>- Accommodate data files in formats w/ integrated data and metadata 
>>>(e.g. Excel files) self documenting.
>>
>>We agree that this is a worthy goal and suggest that data values be 
>>incorporated into the cell description, which is Data Item within LocMap, 
>>just where Wendy placed the physical table cell description. This sets up 
>>two options: (1) describing the position of the data value in a separate 
>>rectangular file (already in the spec), and (2) the position of the row 
>>and column, plus the data value, to describe a self-documenting table. 
>>Wendy, you mentioned that defining row and column is not enough to 
>>specify a physical table. Can you and J and Jostein perhaps come up with 
>>all the elements needed?
>>
>>>- Evaluate broad utility of nCubes
>>
>>Consensus seemed to be that the logical description of the aggregate data 
>>nCubes is fine. What we need to improve is the physical description, 
>>which we addressed above.
>>
>>>- Need ability to describe method of aggregation
>>
>>We have this -- an existing attribute called aggMeth.
>>
>>>- Need of additional tags to describe aggregate data (not nCubes)
>>
>>Not sure what this means. In SDMX there are attributes that J mentioned 
>>in the last call (information about the measurement, like source and 
>>observation status -- projection, actual count, etc.). Adding this 
>>information could help to make the ncubes more robust and make the DDI 
>>more compatible with SDMX.
>>
>>>- Review tag names
>>
>>This is important and we haven't done it yet. Sanda and I will try to 
>>make a stab at this next week and we can all review. For example, 
>>"cohort" is not exactly the right term for how it is used.
>>
>>>- Role of modules for different kinds of data
>>
>>Not sure what is meant by this. Have we covered the two main 
>>possibilities through 1 and 2 above?
>>
>>>- Align to SDMX
>>
>>J, we need your help in determining what is needed here. If we include 
>>the values and add the attributes you mentioned, are there other things 
>>we need to do?
>>
>>
>>>I've been looking over my notes on our changes/proposals; here is what's 
>>>been said in principle, but again, we'll want to document this and 
>>>consider how to accomplish it (I put the date when I have it discussed):
>>>
>>>- DDI is missing a way to describe the physical structure of a 
>>>spreadsheet; need physical description for rows/columns/layers to say 
>>>how they relate to each other; this would enable machine-actionable 
>>>collapsing of rows or columns (e.g. collapsing of age groups) and 
>>>creation of subtotals and totals; it should also accommodate various 
>>>levels and irregular nested categories, and be able to identify the 
>>>lowest level (8/9)
>>
>>Not sure what layers are.
>>
>>>- (8/9)
>>>- need to be able to mark up and represent existing tables (e.g. from 
>>>print volumes) (8/16)
>>>- enable creation of a single file containing both data and metadata; 
>>>this format would be optional and could be applied when appropriate (8/16)
>>>- unlike SDMX, enable a single file that contains both the data and the 
>>>structure (8/23)
>>>- be able to apply attribute information at all levels (from cells on 
>>>up); could add to n-cube in the measure element a sub-element that 
>>>defines attributes that can be attached to any level; provide a 
>>>structure by which authors could define these.  However, it's not the 
>>>case as with other 3.0 features that items at a lower level override 
>>>things at a higher level; therefore, the structure will need to be such 
>>>that it's clear that attributes can be defined only at one chosen level 
>>>(i.e. can't have conflicting attributes at different levels). (8/23)
>>
>>J, please help us determine the level at which the SDMX type attributes 
>>should apply.
>>
>>>- ability to locate the desired cell within the cube (8/23)
>>>- hinging is important, yet may be addressed by comparative data group; 
>>>SRG liaison will check (8/23)
>>
>>Do layers relate to hinging? Hinging is possible now but only within one 
>>DDI instance. Between two instances, we have to establish comparability 
>>at the variable level.
>>
>>One other point, related to Wendy's earlier message: We are still 
>>confused when the discussion moves to 2- versus 3-dimensional structures. 
>>We need examples. Is Census data format 2 or 3 dimensional? How about an 
>>Excel spreadsheet? What are the differences in dimensionality?
>>
>>
>>
>>>I'd ask others to help add to and clarify these.  In addition, many of 
>>>the above I just have articulated as goals, so I'm not clear if we've 
>>>yet figured out how to accomplish these.
>>>
>>>Kate
>>>
>>>___________________________________________
>>>Katherine McNeill-Harman
>>>Data Services Librarian
>>>Dewey Library for Management and Social Sciences
>>>Massachusetts Institute of Technology
>>>77 Massachusetts Avenue, E53-100
>>>Cambridge, MA 02139
>>>mcneillh at mit.edu
>>>617-253-0787
>>>_______________________________________________
>>>DDI-ADG mailing list
>>>DDI-ADG at icpsr.umich.edu
>>>http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>>
>>Mary Vardigan
>>Assistant Director
>>Inter-university Consortium for Political and Social Research (ICPSR)
>>University of Michigan
>>P.O. Box 1248, Ann Arbor, MI 48106-1248
>>Phone: 734-615-7908
>>Fax: 734-647-8200
>>www.icpsr.umich.edu
>>_______________________________________________
>>DDI-ADG mailing list
>>DDI-ADG at icpsr.umich.edu
>>http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>
>
>_______________________________________________
>DDI-ADG mailing list
>DDI-ADG at icpsr.umich.edu
>http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg



Sanda Ionescu,
Research Associate
Inter-university Consortium for Political and Social Research (ICPSR)
The University of Michigan
P.O. Box 1248
Ann Arbor, MI 48106

Phone: (734) 615-7890
Fax: (734) 615-7890
        (734) 647-8200



More information about the DDI-ADG mailing list