[DDI-ADG] Fwd: Use Case No.1
Ron Nakao
ronbo at stanford.edu
Mon Jun 27 23:43:42 EDT 2005
sanda, mary,
thanks for getting the ball rolling. this concrete example does a great job
at testing the appropriateness and flexibility of the geographic
specification. based on this example, my thought is that the "tags" must
define the "largest geographic extent," as well as the option of presenting
different geographic hierarchies within that largest extent (i.e., a tree
with nodes and branches). (my apologies if you discussed and rejected this
idea at last week's CC that i missed.)
each specified node of any specified branch of the hierarchical tree would
have an associated "format/geotype" (required) (e.g., numeric, polygon,
lat/long, text, other?), authority reference (optional; specified only at the
furthest node of a branch and would include all nodes from the root node), and
"textnote" (optional). the smallest geographic unit for any branch would be
the last node specified. most data would probably have one branch with one
or two nodes, but more geographically complex data could be accomodated.
for example:
north america => country
=> province => ...
=> ICPSR region => state
=> county => tract => ...
=> 5-digit ZIP => 9-digit ZIP
should the polygon(s) be handled as a separate "branch" from text/numeric
branches and/or should they be mixed? for example, what if the polygon is a
spatial representation of a "textual" geography (e.g., state boundaries),
then should that state polygon can be connected to the "textual state" node
as an optional "polygon" somehow?
a proposed geo-hierarchy specification:
geographicCover (1..1)
=> format (1..1) <numeric, text, polygon, lat/long, ??>
=> text (0..1)
=> Polygon (0..1)
=> polyName (1..1)
=> polyCode (0..1)
=> Point (4..n)
=> gringLat (1..1)
=> gringLong (1..1)
=> subGeography (0..n)
=> format (1..1) <numeric, text, polygon, lat/long, ??>
=> text (0..1)
=> Polygon (0..1)
=> polyName (1..1)
=> polyCode (0..1)
=> Point (4..n)
=> gringLat (1..1)
=> gringLong (1..1)
...........> <nested subGeographies continue, if needed.>
............> <last node on a branch>
=> subGeography (0..n)
=> format (1..1) <numeric, text, polygon, lat/long, ??>
=> authority (0..1)
=> text (0..1)
=> Polygon (0..1)
=> polyName (1..1)
=> polyCode (0..1)
=> Point (4..n)
=> gringLat (1..1)
=> gringLong (1..1)
geoBndBox (0..1)
=> westBL (1..1)
=> eastBL (1..1)
=> southBL (1..1)
=> northBL (1..1)
two major questions:
being unfamiliar with the polygon specs, did i screw-up their specification,
or can they be "inter-mixed" as i proposed?
can you specify a n-deep nested hierarchy? if not, then as far as all of the
above... "never mind"
ron
Quoting Katherine McNeill-Harman <mcneillh at mit.edu>:
> I think you make all very good points (most of your suggestions I agree
> with) and will just add a couple of comments before tomorrow's meeting (and
> given this, I didn't pull together a different specific study for us to
> discuss). Just a couple of comments within:
>
> At 03:34 PM 6/24/2005 -0400, Sanda Ionescu wrote:
>
>
>
> >>Hi, all.
> >>
> >>Mary and I thought that looking at some concrete examples might give us a
> >>better insight into the use of available geographic tags, and also reveal
> >>any gaps that we might want to try to fill in the specification.
> >>
> >>Starting from Kate's first set of questions
> >>(1. Searching for variable information within a specific geographic
> >>coverage area at a specific level of geographic unit (yet still
> >>aggregated), e.g.
> >>- data on voter registration by precinct within the United States
> >>- unemployment rate by county within the United States)
> >>
> >>
> >>we picked up a codebook that documents just this type of data. It's ICPSR
> >>study no. 9405 and I'm sending it as an attachment to this message.
> >>
> >>At variable level:
> >>As you will see, there are 3 geographic variables in this dataset - state
> >>code (V4) county name (V5) and county code (V6).
> >>Although there is no region variable, analysis by region is also
> >>possible, because the code for region is embedded in the state code (see
> >>V4). This information is included in a free text note to the variable -
> >>should we attempt to capture this kind of information in such a way that
> >>it would become machine-readable? if yes, how do we go about it?
> >>Same question for V6 - in the DDI, we would probably include just the
> >>county code in the category values tags, because that's what we have in
> >>the data. However, these codes can only be used in conjunction with the
> >>codes in V4. Do we want to code this information in any way that will
> >>assist "machine actionability'?
i think this is a bit too much value-added to the mark-up. however, if
someone wants to parse these new variables (recodes) from the existing
variables, then they can do it themselves.
one problem with this study is that the values for county are not unique. one
would have to enter the value labels by assigning them to a new recoded
r_county that concatenated state with county...
> >>
> >>At study level (more closely related to search and discovery, and Kate's
> >>questions):
> >>Trying to mark up the info with the tags provided in the V 3.0
> spreadsheet -
> >>1. from what I see here, geographicCover becomes a "container element"
> >>with no PCDATA allowed (?) and the actual textual information is now
> >>supposed to go in its child (?) <text>, that replaces the <geogCover>
> >>from V 2.0?
> >>That's okay, I suppose. So , then, we would have:
> >><geographicCover><text>United States</text></geographicCover>.
> >>Farther down, however, it gets more confusing (at least to me it does).
> >>Looking at the way elements are structured, it appears that countryCode,
> >>subCountryCode, geographicUnit, geographicKind and geoBndBox are all
> >>children of geographicCover on a par with <text>. If I'm reading the
> >>spreadsheet correctly, this kind of structure is counterintuitive to the
> >>new definition of geographicCover, which is "largest geographic extent".
> >>The elements' structure only provides for a list of various levels of
> >>coverage without indicating a hierarchy. If, for instance, we find
> >>geoKind defined as line, or point, or polygon, how will we know whether
> >>this refers to the geographicUnit (lowest level of coverage) or <text>
> >>(largest extent) or <countryCode>, or what?
> >>
> >><countryCode> - will it be an element spelling out the name of the
> >>country(ies) covered with an attribute containing the ISO code? this is
> >>not clear in the spreadsheet. We probably need both name and code (If the
> >><text> under geogCover were "Europe" we would need to list all countries
> >>included both by name and code).
> >>Going back to our example above, where our largest coverage is one
> >>country, would we repeat the name of the country, and then add its ISO
> code?
> >><countryCode ISOcode="us">United States</countryCode>
> >>
> >><subCountryCode> - this seems to be intended for lower-than-country
> >>levels of coverage. How do we actually use it? If we have more than one
> >>subcountry level, do we just repeat the element for each level? - again,
> >>this does not allow for establishing a hierarchy among the said levels
> >>(maybe we should enable some kind of a nesting structure here? going down
> >>from largest extent to smallest unit (geographicUnit)?
> >>In the attached study, we have three subcountry levels - region, state,
> >>and county. Do we mention each within its own <subCountryCode> element?
> >
> >><subCountryCode>region</subCountryCode>
> >><subCountryCode>state</subCountryCode>
> >><subCountryCode>county</subCountryCode>
> >
> >>The presence of the word "code" in the name of the element suggest that
> >>we would have actual codes here - but that seems unlikely? as the codes
> >>will be listed in the variables' description or in an external document.
> >>So from here we would probably only have a link to either the geographic
> >>variable(s), or the said document. If this seems right, we should enable
> >>such a link. And maybe rename this element as <subCountryLevel>?
>
> This makes sense to me, b/one wouldn't indicate here the name/code of the
> county covered, but just the fact that information is available by county.
>
> >>Not sure if "authority" is meant to be an attribute - probably yes. In
> >>our example, it would read "ICPSR". But if we're only listing levels
> >>here, it should accompany the codes, wherever they are - or qualify the
> >>link to the codes.
>
> For authority, I thought that was the body that developed and maintained
> the codes that were used in this system (not the author of the study) (I
> assume you were using ICPSR as an author?)
>
> >>It also seems appropriate to point from the levels listed here to the
> >>variables that cover them - region and state to V4, county to V5 and V6.
> >>This is not the same thing as linking to the codes; the codes may not be
> >>embedded in the variable description, and there is not always a
> >>one-to-one match between levels, codes, and geographic variables.
> >>
> >>Finally, in our example, <geographicUnit> would be "county".
>
> I think I'm a little uncertain about the relationship of <subCountryCode>
> to <geographicUnit>. 2.0 defines <geographicUnit> as "lowest level of
> geographic aggregation covered by the data," which you have done so with
> the example. I agree with your point that we should provide a set of
> elements that can identify the different levels of geography at which info.
> is available (and have them be hierarchical in a machine-readable manner,
> as you suggest). And I like you're suggestion of <subCountryLevel>. That
> being said, do we need a separate <geographicUnit> element? If we were to
> adopt some sort of more structured <subCountryLevel> set of elements,
> presumably there would be one for county; could we somehow enable the flag
> it to indicate that it's the lowest level available. It's not a lot of
> repetition, and I'm not sure if also having it be separate would be easier
> for systems to interpret it.
>
>
> >>I won't go into the actual mapping with this example.
> >>
> >>I will raise one last question that's not related to our example. Some
> >>studies may cover regions that are above country level (like Eastern
> >>Europe, for instance) but smaller than total coverage (or largest
> >>extent). So far, we don't seem to have a way of accounting for this kind
> >>of coverage.
>
> Yes, and one thing that's come up in conversations is how to describe
> coverage of an area when not all sections in an area are covered (e.g. if
> it's info on Eastern Europe, but some countries in that region were not
> covered).
>
>
> >>One last word about the exercise above: it obviously raises more
> >>questions than provides answers, and that's precisely why we hope it will
> >>be a good starting point for our next discussion(s).
> >>
> >>Sanda and Mary
> >>
> >>
> >>
> >>Sanda Ionescu,
> >>Research Associate
> >>Inter-university Consortium for Political and Social Research (ICPSR)
> >>The University of Michigan
> >>P.O. Box 1248
> >>Ann Arbor, MI 48106
> >>
> >>Phone: (734) 615-7890
> >>Fax: (734) 615-7890
> >> (734) 647-8200
> >_______________________________________________
> >DDI-ADG mailing list
> >DDI-ADG at icpsr.umich.edu
> >http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>
> ___________________________________________
> Katherine McNeill-Harman
> Data Services Librarian
> Dewey Library for Management and Social Sciences
> Massachusetts Institute of Technology
> 77 Massachusetts Avenue, E53-100
> Cambridge, MA 02139
> mcneillh at mit.edu
> 617-253-0787
>
> _______________________________________________
> DDI-ADG mailing list
> DDI-ADG at icpsr.umich.edu
> http://www.icpsr.umich.edu/mailman/listinfo/ddi-adg
>
Ron Nakao
Social Science Data and Software (SSDS)
Social Sciences Resource Center
Green Library
Stanford University
Stanford, CA 94305-6067
(650) 725-1062
http://ssds.stanford.edu/
More information about the DDI-ADG
mailing list