[DDI-ADG] progress on aggregate data?
Jostein Ryssevik
Jostein.Ryssevik at nsd.uib.no
Fri Aug 26 07:47:00 EDT 2005
Hi all
Thanks Sanda. Yes, your example could probably work. However, one problem I
have with SDM is that it basically derives its logic from timeseries. Time
is thus not treated as any other dimension but have specific tag-sets as
illustrated in your example. I would prefer that we do not adopt this
specific feature of SDMX, but treat any dimension as a dimension.
Another very simple way of doing this is listed below. I would believe that
also this example could easily be mapped to SDMX:
First a standard data-description that describes the logic of the cube.
Notting new here:
<dataDscr>
<var ID="V1" name="Country" files="C1" dcml="0" intrvl="discrete">
<labl>Country</labl>
<catgry><catValu>1</catValu><labl>Australia</labl></catgry>
<catgry><catValu>2</catValu><labl>Austria</labl></catgry>
<catgry><catValu>3</catValu><labl>Belgium</labl></catgry>
<catgry><catValu>4</catValu><labl>Canada</labl></catgry>
</var>
<var ID="V2" name="Year" files="C1" dcml="0" intrvl="discrete">
<labl>Year</labl>
<catgry><catValu>1960</catValu><labl>1960</labl></catgry>
<catgry><catValu>1970</catValu><labl>1970</labl></catgry>
<catgry><catValu>1980</catValu><labl>1980</labl></catgry>
<catgry><catValu>1990</catValu><labl>1990</labl></catgry>
<var ID="V3" name="Sex" files="C1" dcml="0" intrvl="discrete">
<labl>Sex</labl>
<catgry><catValu>1</catValu><labl>Females at birth</labl></catgry>
<catgry><catValu>2</catValu><labl>Males at birth</labl></catgry>
</var>
<var ID="V4" name="Lifeexpectancy" files="C1" dcml="0" intrvl="contin">
<labl>Life expectancy</labl>
</var>
<nCube ID="C1" name="OECD10" dmnsQnty="3">
<labl>Life expectancy at birth</labl>
<dmns varRef="V1"/>
<dmns varRef="V2"/>
<dmns varRef="V3"/>
<measure varRef="V4"/>
</nCube>
</dataDscr>
Then the new section that holds the data, in this example embedded in the
DDI-instance. By a few additional tags it could also live in a seperate file.
<ddidata>
<data format="multidim" cubeid="C1>
<dataitem>
<dim varRef="V1" catRef="1"/>
<dim varRef="V2" catRef="1960"/>
<dim varRef="V3" catRef="1"/>
<obs varRef="V4">72</obs>
</dataitem>
<dataitem>
<dim varRef="V1" catRef="1"/>
<dim varRef="V2" catRef="1960"/>
<dim varRef="V3" catRef="2"/>
<obs varRef="V4">75</obs>
</dataitem>
<dataitem>
<dim varRef="V1" catRef="1"/>
<dim varRef="V2" catRef="1970"/>
<dim varRef="V3" catRef="1"/>
<obs varRef="V4">72</obs>
</dataitem>
<dataitem>
<dim varRef="V1" catRef="1"/>
<dim varRef="V2" catRef="1970"/>
<dim varRef="V3" catRef="2"/>
<obs varRef="V4">72</obs>
</dataitem>
<dataitem>
<dim varRef="V1" catRef="1"/>
<dim varRef="V2" catRef="1980"/>
<dim varRef="V3" catRef="1"/>
<obs varRef="V4">72</obs>
</dataitem>
etc.
Note that for multi-measure cubes, we will simply repeat the obs-tag nested
within the dataitem tag. A dataitem tag points to a single cell in the
multidimensional space, but this cell can contain more than one meassure.
</data>
You can have several datafiles embedded within a ddidata-section.
</ddidata>
Remember, this is just a very rough and simplified example. It illustrates
however how logic and DDI-storage is connected.
Jostein
At 15:47 25.08.2005 -0400, Sanda Ionescu wrote:
>Hi, all.
>I think Jostein's example could be described with an adaptation of SDMX
>generic sample markup, as sent by J:
><generic:Series>
> <generic:SeriesKey>
> <generic:Value concept="FREQ" value="A"/>
> <generic:Value concept="REGION_ETH" value="SOM"/>
> <generic:Value concept="GENDER" value="M"/>
> <generic:Value concept="AGE_GRP" value="0_4"/>
> </generic:SeriesKey>
> <generic:Attributes>
> <generic:Value concept="TIME_FORMAT" value="P1Y"/>
> <generic:Value concept="UNIT_MULT" value="6"/>
> </generic:Attributes>
> <generic:Obs>
> <generic:Time>2005</generic:Time>
> <generic:ObsValue value="2.314327"/>
> <generic:Attributes>
> <generic:Value concept="OBS_STATUS" value="A"/>
> </generic:Attributes>
> </generic:Obs>
> </generic:Series>
>
>--- It seems to me this provides for including dimension names, and
>values, for each cell and the actual data (measure), as well as additional
>information about the measure.
>(I hope I'm not misreading anything :-)
>It is less clear to me how/where we fit in the cell coordinates, or how we
>link to them?
>Sanda.
>
>
>At 09:58 AM 8/25/2005, Jostein Ryssevik wrote:
>>Hi,
>>
>>Regarding a physical storage/exchange format for aggregated data files I
>>am not totally convinced that using the locMap element represents the
>>ideal solution. A fairly standard way to exchange this type of data is a
>>delimited file with one record per table cell. The first columns in this
>>file provides the location of the cell on all the table dimensions (one
>>column per dimension). The last column of holds the cell value (that is
>>the value of the meassure variable located in that cell). For
>>multi-measure cubes (they do exist, believe me), you will simply have
>>more than one cell-value column.
>>
>>So a simple example with a geography, time, and gender dimension (and a
>>population measure) might look something like this:
>>
>>101, 2000, 1, 123456
>>101, 2000, 2, 154327
>>101, 2001, 1, 365427
>>etc.
>>
>>First record holds the population number for geography 101, year 2000 and
>>gender 1 etc. (entries being category values on the dimension variables)
>>
>>To provide an xml-structure that:
>>
>>1) links the datafile to a specific ncube description in a specific DDI
>>instance
>>2) links the coordinates of the cell to the relevant variable and
>>category descriptions in the DDI-file
>>3) links the cell values to the relevant meassure variables in the DDI-file,
>>
>>should all be fairly simple and straight forward.
>>
>>The Structure with allow validation of data against the more detailed
>>descriptions in the DDI-file.
>>It could potentially also support hinging of tables across ncubes (at
>>least for ncubes enclosed within a single DDI-file)
>>It will also allow the data-file to live outside the DDI-file. I do not
>>think that we should provide a structure that forces the data to be
>>located within the DDI-instance (as would be the case with a solution
>>based on locmap).
>>By doing something like this we are also structuring the data in a fairly
>>conventional way that easily could be understood by a variety of software
>>that need to read this type of data.
>>As far as I can see this approach is also more or less in line with the
>>SDMX data exchange format.
>>
>>Jostein
>>
>>
>>
>>
>>
>>At 08:45 25.08.2005 -0400, Mary Vardigan wrote:
>>>Thanks much, Kate. Responses below. --Mary and Sanda
>>>
>>>At 10:29 AM 8/24/2005, Katherine McNeill-Harman wrote:
>>>>At the end of yesterday's phone conversation, I mentioned to J that I
>>>>thought that we had the least documentation on what changes/structures
>>>>to suggest for aggregate data (i.e. no single spreadsheet from which we
>>>>were working). So he's going to try to compile something but, as he
>>>>said in his other email, we need to pool our thoughts on this. So I'm
>>>>starting the ball rolling.
>>>>
>>>>Following are the goals for aggregate data we'd outlined in Edinburgh;
>>>>what changes have we agreed upon that will accomplish these?
>>>>- Accommodate data files in formats w/ integrated data and metadata
>>>>(e.g. Excel files) self documenting.
>>>
>>>We agree that this is a worthy goal and suggest that data values be
>>>incorporated into the cell description, which is Data Item within
>>>LocMap, just where Wendy placed the physical table cell description.
>>>This sets up two options: (1) describing the position of the data value
>>>in a separate rectangular file (already in the spec), and (2) the
>>>position of the row and column, plus the data value, to describe a
>>>self-documenting table. Wendy, you mentioned that defining row and
>>>column is not enough to specify a physical table. Can you and J and
>>>Jostein perhaps come up with all the elements needed?
>>>
>>>>- Evaluate broad utility of nCubes
>>>
>>>Consensus seemed to be that the logical description of the aggregate
>>>data nCubes is fine. What we need to improve is the physical
>>>description, which we addressed above.
>>>
>>>>- Need ability to describe method of aggregation
>>>
>>>We have this -- an existing attribute called aggMeth.
>>>
>>>>- Need of additional tags to describe aggregate data (not nCubes)
>>>
>>>Not sure what this means. In SDMX there are attributes that J mentioned
>>>in the last call (information about the measurement, like source and
>>>observation status -- projection, actual count, etc.). Adding this
>>>information could help to make the ncubes more robust and make the DDI
>>>more compatible with SDMX.
>>>
>>>>- Review tag names
>>>
>>>This is important and we haven't done it yet. Sanda and I will try to
>>>make a stab at this next week and we can all review. For example,
>>>"cohort" is not exactly the right term for how it is used.
>>>
>>>>- Role of modules for different kinds of data
>>>
>>>Not sure what is meant by this. Have we covered the two main
>>>possibilities through 1 and 2 above?
>>>
>>>>- Align to SDMX
>>>
>>>J, we need your help in determining what is needed here. If we include
>>>the values and add the attributes you mentioned, are there other things
>>>we need to do?
>>>
>>>
>>>>I've been looking over my notes on our changes/proposals; here is
>>>>what's been said in principle, but again, we'll want to document this
>>>>and consider how to accomplish it (I put the date when I have it discussed):
>>>>
>>>>- DDI is missing a way to describe the physical structure of a
>>>>spreadsheet; need physical description for rows/columns/layers to say
>>>>how they relate to each other; this would enable machine-actionable
>>>>collapsing of rows or columns (e.g. collapsing of age groups) and
>>>>creation of subtotals and totals; it should also accommodate various
>>>>levels and irregular nested categories, and be able to identify the
>>>>lowest level (8/9)
>>>
>>>Not sure what layers are.
>>>
>>>>- (8/9)
>>>>- need to be able to mark up and represent existing tables (e.g. from
>>>>print volumes) (8/16)
>>>>- enable creation of a single file containing both data and metadata;
>>>>this format would be optional and could be applied when appropriate (8/16)
>>>>- unlike SDMX, enable a single file that contains both the data and the
>>>>structure (8/23)
>>>>- be able to apply attribute information at all levels (from cells on
>>>>up); could add to n-cube in the measure element a sub-element that
>>>>defines attributes that can be attached to any level; provide a
>>>>structure by which authors could define these. However, it's not the
>>>>case as with other 3.0 features that items at a lower level override
>>>>things at a higher level; therefore, the structure will need to be such
>>>>that it's clear that attributes can be defined only at one chosen level
>>>>(i.e. can't have conflicting attributes at different levels). (8/23)
>>>
>>>J, please help us determine the level at which the SDMX type attributes
>>>should apply.
>>>
>>>>- ability to locate the desired cell within the cube (8/23)
>>>>- hinging is important, yet may be addressed by comparative data group;
>>>>SRG liaison will check (8/23)
>>>
>>>Do layers relate to hinging? Hinging is possible now but only within one
>>>DDI instance. Between two instances, we have to establish comparability
>>>at the variable level.
>>>
>>>One other point, related to Wendy's earlier message: We are still
>>>confused when the discussion moves to 2- versus 3-dimensional
>>>structures. We need examples. Is Census data format 2 or 3 dimensional?
>>>How about an Excel spreadsheet? What are the differences in dimensionality?
>>>
>>>
>>>
>>>>I'd ask others to help add to and clarify these. In addition, many of
>>>>the above I just have articulated as goals, so I'm not clear if we've
>>>>yet figured out how to accomplish these.
>>>>
>>>>Kate
>>>>
>>>>___________________________________________
>>>>Katherine McNeill-Harman
>>>>Data Services Librarian
>>>>Dewey Library for Management and Social Sciences
>>>>Massachusetts Institute of Technology
>>>>77 Massachusetts Avenue, E53-100
>>>>Cambridge, MA 02139
>>>>mcneillh at mit.edu
>>>>617-253-0787
>>>>_______________________________________________
>>>>DDI-ADG mailing list
>>>>DDI-ADG at icpsr.umich.edu
>>>>http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>>>
>>>Mary Vardigan
>>>Assistant Director
>>>Inter-university Consortium for Political and Social Research (ICPSR)
>>>University of Michigan
>>>P.O. Box 1248, Ann Arbor, MI 48106-1248
>>>Phone: 734-615-7908
>>>Fax: 734-647-8200
>>>www.icpsr.umich.edu
>>>_______________________________________________
>>>DDI-ADG mailing list
>>>DDI-ADG at icpsr.umich.edu
>>>http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>>
>>
>>_______________________________________________
>>DDI-ADG mailing list
>>DDI-ADG at icpsr.umich.edu
>>http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>
>
>
>Sanda Ionescu,
>Research Associate
>Inter-university Consortium for Political and Social Research (ICPSR)
>The University of Michigan
>P.O. Box 1248
>Ann Arbor, MI 48106
>
>Phone: (734) 615-7890
>Fax: (734) 615-7890
> (734) 647-8200
>
More information about the DDI-ADG
mailing list