[DDI-ADG] progress on aggregate data?
J Gager
j.b.gager at gmail.com
Fri Aug 26 15:35:36 EDT 2005
I agree that we should not adopt the specific Time Series usage of SDMX,
and treat the TimeDimension as anyother dimension in any data
description, however, when noting the structure of a nCube, it would be
important to note that a particular dimension is time. Note that I am
distinguishing between defining the logical stucture of an nCube, and
represnetation of its data. By noting in the logical description that a
dimension is a time dimesion, we can easily map to an SDMX time series
data instance.
-----Original Message-----
From: Jostein Ryssevik [mailto:Jostein.Ryssevik at nsd.uib.no]
Sent: Friday, August 26, 2005 7:47 AM
To: Sanda Ionescu; Mary Vardigan; Katherine McNeill-Harman;
jgager at umich.edu; ddi-adg at icpsr.umich.edu
Subject: Re: [DDI-ADG] progress on aggregate data?
Hi all
Thanks Sanda. Yes, your example could probably work. However, one
problem I
have with SDM is that it basically derives its logic from timeseries.
Time
is thus not treated as any other dimension but have specific tag-sets as
illustrated in your example. I would prefer that we do not adopt this
specific feature of SDMX, but treat any dimension as a dimension.
Another very simple way of doing this is listed below. I would believe
that
also this example could easily be mapped to SDMX:
First a standard data-description that describes the logic of the cube.
Notting new here:
<dataDscr>
<var ID="V1" name="Country" files="C1" dcml="0" intrvl="discrete">
<labl>Country</labl>
<catgry><catValu>1</catValu><labl>Australia</labl></catgry>
<catgry><catValu>2</catValu><labl>Austria</labl></catgry>
<catgry><catValu>3</catValu><labl>Belgium</labl></catgry>
<catgry><catValu>4</catValu><labl>Canada</labl></catgry>
</var>
<var ID="V2" name="Year" files="C1" dcml="0" intrvl="discrete">
<labl>Year</labl>
<catgry><catValu>1960</catValu><labl>1960</labl></catgry>
<catgry><catValu>1970</catValu><labl>1970</labl></catgry>
<catgry><catValu>1980</catValu><labl>1980</labl></catgry>
<catgry><catValu>1990</catValu><labl>1990</labl></catgry>
<var ID="V3" name="Sex" files="C1" dcml="0" intrvl="discrete">
<labl>Sex</labl> <catgry><catValu>1</catValu><labl>Females at
birth</labl></catgry> <catgry><catValu>2</catValu><labl>Males at
birth</labl></catgry> </var>
<var ID="V4" name="Lifeexpectancy" files="C1" dcml="0" intrvl="contin">
<labl>Life expectancy</labl> </var>
<nCube ID="C1" name="OECD10" dmnsQnty="3">
<labl>Life expectancy at birth</labl>
<dmns varRef="V1"/>
<dmns varRef="V2"/>
<dmns varRef="V3"/>
<measure varRef="V4"/>
</nCube>
</dataDscr>
Then the new section that holds the data, in this example embedded in
the
DDI-instance. By a few additional tags it could also live in a seperate
file.
<ddidata>
<data format="multidim" cubeid="C1>
<dataitem>
<dim varRef="V1" catRef="1"/>
<dim varRef="V2" catRef="1960"/>
<dim varRef="V3" catRef="1"/>
<obs varRef="V4">72</obs>
</dataitem>
<dataitem>
<dim varRef="V1" catRef="1"/>
<dim varRef="V2" catRef="1960"/>
<dim varRef="V3" catRef="2"/>
<obs varRef="V4">75</obs>
</dataitem>
<dataitem>
<dim varRef="V1" catRef="1"/>
<dim varRef="V2" catRef="1970"/>
<dim varRef="V3" catRef="1"/>
<obs varRef="V4">72</obs>
</dataitem>
<dataitem>
<dim varRef="V1" catRef="1"/>
<dim varRef="V2" catRef="1970"/>
<dim varRef="V3" catRef="2"/>
<obs varRef="V4">72</obs>
</dataitem>
<dataitem>
<dim varRef="V1" catRef="1"/>
<dim varRef="V2" catRef="1980"/>
<dim varRef="V3" catRef="1"/>
<obs varRef="V4">72</obs>
</dataitem>
etc.
Note that for multi-measure cubes, we will simply repeat the obs-tag
nested
within the dataitem tag. A dataitem tag points to a single cell in the
multidimensional space, but this cell can contain more than one
meassure.
</data>
You can have several datafiles embedded within a ddidata-section.
</ddidata>
Remember, this is just a very rough and simplified example. It
illustrates
however how logic and DDI-storage is connected.
Jostein
At 15:47 25.08.2005 -0400, Sanda Ionescu wrote:
>Hi, all.
>I think Jostein's example could be described with an adaptation of SDMX
>generic sample markup, as sent by J:
><generic:Series>
> <generic:SeriesKey>
> <generic:Value concept="FREQ" value="A"/>
> <generic:Value concept="REGION_ETH" value="SOM"/>
> <generic:Value concept="GENDER" value="M"/>
> <generic:Value concept="AGE_GRP" value="0_4"/>
> </generic:SeriesKey>
> <generic:Attributes>
> <generic:Value concept="TIME_FORMAT" value="P1Y"/>
> <generic:Value concept="UNIT_MULT" value="6"/>
> </generic:Attributes>
> <generic:Obs>
> <generic:Time>2005</generic:Time>
> <generic:ObsValue value="2.314327"/>
> <generic:Attributes>
> <generic:Value concept="OBS_STATUS" value="A"/>
> </generic:Attributes>
> </generic:Obs>
> </generic:Series>
>
>--- It seems to me this provides for including dimension names, and
>values, for each cell and the actual data (measure), as well as
additional
>information about the measure.
>(I hope I'm not misreading anything :-)
>It is less clear to me how/where we fit in the cell coordinates, or how
we
>link to them?
>Sanda.
>
>
>At 09:58 AM 8/25/2005, Jostein Ryssevik wrote:
>>Hi,
>>
>>Regarding a physical storage/exchange format for aggregated data files
>>I
>>am not totally convinced that using the locMap element represents the
>>ideal solution. A fairly standard way to exchange this type of data is
a
>>delimited file with one record per table cell. The first columns in
this
>>file provides the location of the cell on all the table dimensions
(one
>>column per dimension). The last column of holds the cell value (that
is
>>the value of the meassure variable located in that cell). For
>>multi-measure cubes (they do exist, believe me), you will simply have
>>more than one cell-value column.
>>
>>So a simple example with a geography, time, and gender dimension (and
>>a
>>population measure) might look something like this:
>>
>>101, 2000, 1, 123456
>>101, 2000, 2, 154327
>>101, 2001, 1, 365427
>>etc.
>>
>>First record holds the population number for geography 101, year 2000
>>and
>>gender 1 etc. (entries being category values on the dimension
variables)
>>
>>To provide an xml-structure that:
>>
>>1) links the datafile to a specific ncube description in a specific
>>DDI
>>instance
>>2) links the coordinates of the cell to the relevant variable and
>>category descriptions in the DDI-file
>>3) links the cell values to the relevant meassure variables in the
DDI-file,
>>
>>should all be fairly simple and straight forward.
>>
>>The Structure with allow validation of data against the more detailed
>>descriptions in the DDI-file.
>>It could potentially also support hinging of tables across ncubes (at
>>least for ncubes enclosed within a single DDI-file)
>>It will also allow the data-file to live outside the DDI-file. I do
not
>>think that we should provide a structure that forces the data to be
>>located within the DDI-instance (as would be the case with a solution
>>based on locmap).
>>By doing something like this we are also structuring the data in a
fairly
>>conventional way that easily could be understood by a variety of
software
>>that need to read this type of data.
>>As far as I can see this approach is also more or less in line with
the
>>SDMX data exchange format.
>>
>>Jostein
>>
>>
>>
>>
>>
>>At 08:45 25.08.2005 -0400, Mary Vardigan wrote:
>>>Thanks much, Kate. Responses below. --Mary and Sanda
>>>
>>>At 10:29 AM 8/24/2005, Katherine McNeill-Harman wrote:
>>>>At the end of yesterday's phone conversation, I mentioned to J that
>>>>I
>>>>thought that we had the least documentation on what
changes/structures
>>>>to suggest for aggregate data (i.e. no single spreadsheet from which
we
>>>>were working). So he's going to try to compile something but, as he
>>>>said in his other email, we need to pool our thoughts on this. So
I'm
>>>>starting the ball rolling.
>>>>
>>>>Following are the goals for aggregate data we'd outlined in
>>>>Edinburgh;
>>>>what changes have we agreed upon that will accomplish these?
>>>>- Accommodate data files in formats w/ integrated data and metadata
>>>>(e.g. Excel files) self documenting.
>>>
>>>We agree that this is a worthy goal and suggest that data values be
>>>incorporated into the cell description, which is Data Item within
>>>LocMap, just where Wendy placed the physical table cell description.
>>>This sets up two options: (1) describing the position of the data
value
>>>in a separate rectangular file (already in the spec), and (2) the
>>>position of the row and column, plus the data value, to describe a
>>>self-documenting table. Wendy, you mentioned that defining row and
>>>column is not enough to specify a physical table. Can you and J and
>>>Jostein perhaps come up with all the elements needed?
>>>
>>>>- Evaluate broad utility of nCubes
>>>
>>>Consensus seemed to be that the logical description of the aggregate
>>>data nCubes is fine. What we need to improve is the physical
>>>description, which we addressed above.
>>>
>>>>- Need ability to describe method of aggregation
>>>
>>>We have this -- an existing attribute called aggMeth.
>>>
>>>>- Need of additional tags to describe aggregate data (not nCubes)
>>>
>>>Not sure what this means. In SDMX there are attributes that J
>>>mentioned
>>>in the last call (information about the measurement, like source and
>>>observation status -- projection, actual count, etc.). Adding this
>>>information could help to make the ncubes more robust and make the
DDI
>>>more compatible with SDMX.
>>>
>>>>- Review tag names
>>>
>>>This is important and we haven't done it yet. Sanda and I will try to
>>>make a stab at this next week and we can all review. For example,
>>>"cohort" is not exactly the right term for how it is used.
>>>
>>>>- Role of modules for different kinds of data
>>>
>>>Not sure what is meant by this. Have we covered the two main
>>>possibilities through 1 and 2 above?
>>>
>>>>- Align to SDMX
>>>
>>>J, we need your help in determining what is needed here. If we
>>>include
>>>the values and add the attributes you mentioned, are there other
things
>>>we need to do?
>>>
>>>
>>>>I've been looking over my notes on our changes/proposals; here is
>>>>what's been said in principle, but again, we'll want to document
this
>>>>and consider how to accomplish it (I put the date when I have it
discussed):
>>>>
>>>>- DDI is missing a way to describe the physical structure of a
>>>>spreadsheet; need physical description for rows/columns/layers to
say
>>>>how they relate to each other; this would enable machine-actionable
>>>>collapsing of rows or columns (e.g. collapsing of age groups) and
>>>>creation of subtotals and totals; it should also accommodate various
>>>>levels and irregular nested categories, and be able to identify the
>>>>lowest level (8/9)
>>>
>>>Not sure what layers are.
>>>
>>>>- (8/9)
>>>>- need to be able to mark up and represent existing tables (e.g.
>>>>from
>>>>print volumes) (8/16)
>>>>- enable creation of a single file containing both data and
metadata;
>>>>this format would be optional and could be applied when appropriate
(8/16)
>>>>- unlike SDMX, enable a single file that contains both the data and
the
>>>>structure (8/23)
>>>>- be able to apply attribute information at all levels (from cells
on
>>>>up); could add to n-cube in the measure element a sub-element that
>>>>defines attributes that can be attached to any level; provide a
>>>>structure by which authors could define these. However, it's not
the
>>>>case as with other 3.0 features that items at a lower level override
>>>>things at a higher level; therefore, the structure will need to be
such
>>>>that it's clear that attributes can be defined only at one chosen
level
>>>>(i.e. can't have conflicting attributes at different levels). (8/23)
>>>
>>>J, please help us determine the level at which the SDMX type
>>>attributes
>>>should apply.
>>>
>>>>- ability to locate the desired cell within the cube (8/23)
>>>>- hinging is important, yet may be addressed by comparative data
>>>>group;
>>>>SRG liaison will check (8/23)
>>>
>>>Do layers relate to hinging? Hinging is possible now but only within
>>>one
>>>DDI instance. Between two instances, we have to establish
comparability
>>>at the variable level.
>>>
>>>One other point, related to Wendy's earlier message: We are still
>>>confused when the discussion moves to 2- versus 3-dimensional
>>>structures. We need examples. Is Census data format 2 or 3
dimensional?
>>>How about an Excel spreadsheet? What are the differences in
dimensionality?
>>>
>>>
>>>
>>>>I'd ask others to help add to and clarify these. In addition, many
>>>>of
>>>>the above I just have articulated as goals, so I'm not clear if
we've
>>>>yet figured out how to accomplish these.
>>>>
>>>>Kate
>>>>
>>>>___________________________________________
>>>>Katherine McNeill-Harman
>>>>Data Services Librarian
>>>>Dewey Library for Management and Social Sciences Massachusetts
>>>>Institute of Technology 77 Massachusetts Avenue, E53-100
>>>>Cambridge, MA 02139
>>>>mcneillh at mit.edu
>>>>617-253-0787
>>>>_______________________________________________
>>>>DDI-ADG mailing list
>>>>DDI-ADG at icpsr.umich.edu
>>>>http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>>>
>>>Mary Vardigan
>>>Assistant Director
>>>Inter-university Consortium for Political and Social Research (ICPSR)
>>>University of Michigan P.O. Box 1248, Ann Arbor, MI 48106-1248
>>>Phone: 734-615-7908
>>>Fax: 734-647-8200
>>>www.icpsr.umich.edu
>>>_______________________________________________
>>>DDI-ADG mailing list
>>>DDI-ADG at icpsr.umich.edu
>>>http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>>
>>
>>_______________________________________________
>>DDI-ADG mailing list
>>DDI-ADG at icpsr.umich.edu
>>http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>
>
>
>Sanda Ionescu,
>Research Associate
>Inter-university Consortium for Political and Social Research (ICPSR)
>The University of Michigan P.O. Box 1248
>Ann Arbor, MI 48106
>
>Phone: (734) 615-7890
>Fax: (734) 615-7890
> (734) 647-8200
>
More information about the DDI-ADG
mailing list