[DDI-ADG] progress on aggregate data?

Jostein Ryssevik Jostein.Ryssevik at nsd.uib.no
Fri Aug 26 07:47:00 EDT 2005


Hi all

Thanks Sanda. Yes, your example could probably work. However, one problem I 
have with SDM is that it basically derives its logic from timeseries. Time 
is thus not treated as any other dimension but have specific tag-sets as 
illustrated in your example. I would prefer that we do not adopt this 
specific feature of SDMX, but treat any dimension as a dimension.

Another very simple way of doing this is listed below. I would believe that 
also this example could easily be mapped to SDMX:

First a standard data-description that describes the logic of the cube. 
Notting new here:

<dataDscr>
<var ID="V1" name="Country" files="C1" dcml="0" intrvl="discrete">
<labl>Country</labl>
<catgry><catValu>1</catValu><labl>Australia</labl></catgry>
<catgry><catValu>2</catValu><labl>Austria</labl></catgry>
<catgry><catValu>3</catValu><labl>Belgium</labl></catgry>
<catgry><catValu>4</catValu><labl>Canada</labl></catgry>
</var>

<var ID="V2" name="Year" files="C1" dcml="0" intrvl="discrete">
<labl>Year</labl>
<catgry><catValu>1960</catValu><labl>1960</labl></catgry>
<catgry><catValu>1970</catValu><labl>1970</labl></catgry>
<catgry><catValu>1980</catValu><labl>1980</labl></catgry>
<catgry><catValu>1990</catValu><labl>1990</labl></catgry>

<var ID="V3" name="Sex" files="C1" dcml="0" intrvl="discrete">
<labl>Sex</labl>
<catgry><catValu>1</catValu><labl>Females at birth</labl></catgry>
<catgry><catValu>2</catValu><labl>Males at birth</labl></catgry>
</var>

<var ID="V4" name="Lifeexpectancy" files="C1" dcml="0" intrvl="contin">
<labl>Life expectancy</labl>
</var>

<nCube ID="C1" name="OECD10" dmnsQnty="3">
<labl>Life expectancy at birth</labl>
<dmns varRef="V1"/>
<dmns varRef="V2"/>
<dmns varRef="V3"/>
<measure varRef="V4"/>
</nCube>
</dataDscr>

Then the new section that holds the data, in this example embedded in the 
DDI-instance. By a few additional tags it could also live in a seperate file.

<ddidata>
<data format="multidim" cubeid="C1>

<dataitem>
<dim varRef="V1" catRef="1"/>
<dim varRef="V2" catRef="1960"/>
<dim varRef="V3" catRef="1"/>
<obs varRef="V4">72</obs>
</dataitem>

<dataitem>
<dim varRef="V1" catRef="1"/>
<dim varRef="V2" catRef="1960"/>
<dim varRef="V3" catRef="2"/>
<obs varRef="V4">75</obs>
</dataitem>

<dataitem>
<dim varRef="V1" catRef="1"/>
<dim varRef="V2" catRef="1970"/>
<dim varRef="V3" catRef="1"/>
<obs varRef="V4">72</obs>
</dataitem>

<dataitem>
<dim varRef="V1" catRef="1"/>
<dim varRef="V2" catRef="1970"/>
<dim varRef="V3" catRef="2"/>
<obs varRef="V4">72</obs>
</dataitem>

<dataitem>
<dim varRef="V1" catRef="1"/>
<dim varRef="V2" catRef="1980"/>
<dim varRef="V3" catRef="1"/>
<obs varRef="V4">72</obs>
</dataitem>

etc.

Note that for multi-measure cubes, we will simply repeat the obs-tag nested 
within the dataitem tag. A dataitem tag points to a single cell in the 
multidimensional space, but this cell can contain more than one meassure.

</data>

You can have several datafiles embedded within a ddidata-section.

</ddidata>

Remember, this is just a very rough and simplified example. It illustrates 
however how logic and DDI-storage is connected.

Jostein


At 15:47 25.08.2005 -0400, Sanda Ionescu wrote:
>Hi, all.
>I think Jostein's example could be described with an adaptation of SDMX 
>generic sample markup, as sent by J:
><generic:Series>
>          <generic:SeriesKey>
>             <generic:Value concept="FREQ" value="A"/>
>             <generic:Value concept="REGION_ETH" value="SOM"/>
>             <generic:Value concept="GENDER" value="M"/>
>             <generic:Value concept="AGE_GRP" value="0_4"/>
>          </generic:SeriesKey>
>          <generic:Attributes>
>             <generic:Value concept="TIME_FORMAT" value="P1Y"/>
>             <generic:Value concept="UNIT_MULT" value="6"/>
>          </generic:Attributes>
>          <generic:Obs>
>             <generic:Time>2005</generic:Time>
>             <generic:ObsValue value="2.314327"/>
>             <generic:Attributes>
>                <generic:Value concept="OBS_STATUS" value="A"/>
>             </generic:Attributes>
>          </generic:Obs>
>       </generic:Series>
>
>--- It seems to me this provides for including dimension names, and 
>values, for each cell and the actual data (measure), as well as additional 
>information about the measure.
>(I hope I'm not misreading anything :-)
>It is less clear to me how/where we fit in the cell coordinates, or how we 
>link to them?
>Sanda.
>
>
>At 09:58 AM 8/25/2005, Jostein Ryssevik wrote:
>>Hi,
>>
>>Regarding a physical storage/exchange format for aggregated data files I 
>>am not totally convinced that using the locMap element represents the 
>>ideal solution. A fairly standard way to exchange this type of data is a 
>>delimited file with one record per table cell. The first columns in this 
>>file provides the location of the cell on all the table dimensions (one 
>>column per dimension). The last column of holds the cell value (that is 
>>the value of the meassure variable located in that cell). For 
>>multi-measure cubes (they do exist, believe me), you will simply have 
>>more than one cell-value column.
>>
>>So a simple example with a geography, time, and gender dimension (and a 
>>population measure) might look something like this:
>>
>>101, 2000, 1, 123456
>>101, 2000, 2, 154327
>>101, 2001, 1, 365427
>>etc.
>>
>>First record holds the population number for geography 101, year 2000 and 
>>gender 1 etc. (entries being category values on the dimension variables)
>>
>>To provide an xml-structure that:
>>
>>1) links the datafile to a specific ncube description in a specific DDI 
>>instance
>>2) links the coordinates of the cell to the relevant variable and 
>>category descriptions in the DDI-file
>>3) links the cell values to the relevant meassure variables in the DDI-file,
>>
>>should all be fairly simple and straight forward.
>>
>>The Structure with allow validation of data against the more detailed 
>>descriptions in the DDI-file.
>>It could potentially also support hinging of  tables across ncubes (at 
>>least for ncubes enclosed within a single DDI-file)
>>It will also allow the data-file to live outside the DDI-file. I do not 
>>think that we should provide a structure that forces the data to be 
>>located within the DDI-instance (as would be the case with a solution 
>>based on locmap).
>>By doing something like this we are also structuring the data in a fairly 
>>conventional way that easily could be understood by a variety of software 
>>that need to read this type of data.
>>As far as I can see this approach is also more or less in line with the 
>>SDMX data exchange format.
>>
>>Jostein
>>
>>
>>
>>
>>
>>At 08:45 25.08.2005 -0400, Mary Vardigan wrote:
>>>Thanks much, Kate. Responses below. --Mary and Sanda
>>>
>>>At 10:29 AM 8/24/2005, Katherine McNeill-Harman wrote:
>>>>At the end of yesterday's phone conversation, I mentioned to J that I 
>>>>thought that we had the least documentation on what changes/structures 
>>>>to suggest for aggregate data (i.e. no single spreadsheet from which we 
>>>>were working).  So he's going to try to compile something but, as he 
>>>>said in his other email, we need to pool our thoughts on this.  So I'm 
>>>>starting the ball rolling.
>>>>
>>>>Following are the goals for aggregate data we'd outlined in Edinburgh; 
>>>>what changes have we agreed upon that will accomplish these?
>>>>- Accommodate data files in formats w/ integrated data and metadata 
>>>>(e.g. Excel files) self documenting.
>>>
>>>We agree that this is a worthy goal and suggest that data values be 
>>>incorporated into the cell description, which is Data Item within 
>>>LocMap, just where Wendy placed the physical table cell description. 
>>>This sets up two options: (1) describing the position of the data value 
>>>in a separate rectangular file (already in the spec), and (2) the 
>>>position of the row and column, plus the data value, to describe a 
>>>self-documenting table. Wendy, you mentioned that defining row and 
>>>column is not enough to specify a physical table. Can you and J and 
>>>Jostein perhaps come up with all the elements needed?
>>>
>>>>- Evaluate broad utility of nCubes
>>>
>>>Consensus seemed to be that the logical description of the aggregate 
>>>data nCubes is fine. What we need to improve is the physical 
>>>description, which we addressed above.
>>>
>>>>- Need ability to describe method of aggregation
>>>
>>>We have this -- an existing attribute called aggMeth.
>>>
>>>>- Need of additional tags to describe aggregate data (not nCubes)
>>>
>>>Not sure what this means. In SDMX there are attributes that J mentioned 
>>>in the last call (information about the measurement, like source and 
>>>observation status -- projection, actual count, etc.). Adding this 
>>>information could help to make the ncubes more robust and make the DDI 
>>>more compatible with SDMX.
>>>
>>>>- Review tag names
>>>
>>>This is important and we haven't done it yet. Sanda and I will try to 
>>>make a stab at this next week and we can all review. For example, 
>>>"cohort" is not exactly the right term for how it is used.
>>>
>>>>- Role of modules for different kinds of data
>>>
>>>Not sure what is meant by this. Have we covered the two main 
>>>possibilities through 1 and 2 above?
>>>
>>>>- Align to SDMX
>>>
>>>J, we need your help in determining what is needed here. If we include 
>>>the values and add the attributes you mentioned, are there other things 
>>>we need to do?
>>>
>>>
>>>>I've been looking over my notes on our changes/proposals; here is 
>>>>what's been said in principle, but again, we'll want to document this 
>>>>and consider how to accomplish it (I put the date when I have it discussed):
>>>>
>>>>- DDI is missing a way to describe the physical structure of a 
>>>>spreadsheet; need physical description for rows/columns/layers to say 
>>>>how they relate to each other; this would enable machine-actionable 
>>>>collapsing of rows or columns (e.g. collapsing of age groups) and 
>>>>creation of subtotals and totals; it should also accommodate various 
>>>>levels and irregular nested categories, and be able to identify the 
>>>>lowest level (8/9)
>>>
>>>Not sure what layers are.
>>>
>>>>- (8/9)
>>>>- need to be able to mark up and represent existing tables (e.g. from 
>>>>print volumes) (8/16)
>>>>- enable creation of a single file containing both data and metadata; 
>>>>this format would be optional and could be applied when appropriate (8/16)
>>>>- unlike SDMX, enable a single file that contains both the data and the 
>>>>structure (8/23)
>>>>- be able to apply attribute information at all levels (from cells on 
>>>>up); could add to n-cube in the measure element a sub-element that 
>>>>defines attributes that can be attached to any level; provide a 
>>>>structure by which authors could define these.  However, it's not the 
>>>>case as with other 3.0 features that items at a lower level override 
>>>>things at a higher level; therefore, the structure will need to be such 
>>>>that it's clear that attributes can be defined only at one chosen level 
>>>>(i.e. can't have conflicting attributes at different levels). (8/23)
>>>
>>>J, please help us determine the level at which the SDMX type attributes 
>>>should apply.
>>>
>>>>- ability to locate the desired cell within the cube (8/23)
>>>>- hinging is important, yet may be addressed by comparative data group; 
>>>>SRG liaison will check (8/23)
>>>
>>>Do layers relate to hinging? Hinging is possible now but only within one 
>>>DDI instance. Between two instances, we have to establish comparability 
>>>at the variable level.
>>>
>>>One other point, related to Wendy's earlier message: We are still 
>>>confused when the discussion moves to 2- versus 3-dimensional 
>>>structures. We need examples. Is Census data format 2 or 3 dimensional? 
>>>How about an Excel spreadsheet? What are the differences in dimensionality?
>>>
>>>
>>>
>>>>I'd ask others to help add to and clarify these.  In addition, many of 
>>>>the above I just have articulated as goals, so I'm not clear if we've 
>>>>yet figured out how to accomplish these.
>>>>
>>>>Kate
>>>>
>>>>___________________________________________
>>>>Katherine McNeill-Harman
>>>>Data Services Librarian
>>>>Dewey Library for Management and Social Sciences
>>>>Massachusetts Institute of Technology
>>>>77 Massachusetts Avenue, E53-100
>>>>Cambridge, MA 02139
>>>>mcneillh at mit.edu
>>>>617-253-0787
>>>>_______________________________________________
>>>>DDI-ADG mailing list
>>>>DDI-ADG at icpsr.umich.edu
>>>>http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>>>
>>>Mary Vardigan
>>>Assistant Director
>>>Inter-university Consortium for Political and Social Research (ICPSR)
>>>University of Michigan
>>>P.O. Box 1248, Ann Arbor, MI 48106-1248
>>>Phone: 734-615-7908
>>>Fax: 734-647-8200
>>>www.icpsr.umich.edu
>>>_______________________________________________
>>>DDI-ADG mailing list
>>>DDI-ADG at icpsr.umich.edu
>>>http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>>
>>
>>_______________________________________________
>>DDI-ADG mailing list
>>DDI-ADG at icpsr.umich.edu
>>http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>
>
>
>Sanda Ionescu,
>Research Associate
>Inter-university Consortium for Political and Social Research (ICPSR)
>The University of Michigan
>P.O. Box 1248
>Ann Arbor, MI 48106
>
>Phone: (734) 615-7890
>Fax: (734) 615-7890
>        (734) 647-8200
>




More information about the DDI-ADG mailing list