[DDI-SRG] Data format summary
Joachim Wackerow
joachim.wackerow at gesis.org
Wed Jan 9 16:49:43 EST 2008
This is a summary of several emails and some discussion by email (see at
the bottom). It is related to input formats to read data in storage like
files etc.
In December we agreed that Arofan realizes a first try in XML Schema
based on this information and the discussion in the conference call.
Then I will look into it again and we can discuss further steps.
It doesn't exist a general open solution to deal with formats (details
see my email from 2007-12-06). So we must build an own one.
A pragmatic and evolutionary approach would be to include a core set of
common data formats and data types in DDI lists (DataTypeCode,
DataFormatCode). In addition domain-specific lists can be built (like
the SAS list for data formats, ANSI-SQL list for data types, etc.). Then
a mapping mechanism would be necessary to map formats/types from
different domains. Ideally the mapping would be done (like the ISO/IEC
11404 approach) from a domain-specific definition to a general-purpose
definition (like the DDI lists). Learning from the domain-specific lists
the general-purpose DDI lists can be expanded in future DDI versions.
The advantage of this approach is that only the basic lists and the
mapping mechanism must be provided now. Additional domain-specific lists
and related mappings can be produced later. (The disadvantage can be too
domain-specific data definitions in DDI instances. General-purpose
applications are getting more complex this way.)
The mapping can be done by the approach in the comparison module a
domain-specific format item is mapped to the general-purpose DDI format
item.
The input format should be defined in
/PhysicalDataProduct/RecordLayout/DataItem/PhysicalLocation
The DataType is the "recommended" data type for processing the data
item. This is optional and only necessary in specific cases. (A use case
would be, defining a recommended data type to optimize processing in
terms of used memory space. This would be not only based on the input
format but also on the really used data) It should use a restricted set
of the data types of XML Schema. See:
http://www.w3.org/TR/xmlschema-2/#built-in-datatypes
Any types below anySimpleType. I think the following exceptions would
make sense, omitting of:
QNAME, NOTATION, any type below string
The list should be realized by a controlled vocabulary. This way it
would be expandable by very specific types in another list if necessary.
Delimiter, StartPosition, Width, DecimalPositions, DecimalSeparator,
GroupingSeparator are fine.
PROPOSAL: Remove EndPosition. On the basis of StartPosition together
with Width the end position can be computed. Otherwise both the Width
and EndPosition can be used and this is error prone.
PROPOSAL: Rename GroupingSeparator into DigitGroupSeparator as this term
is more self-explanatory.
We can learn from how SAS defines the informats.
SAS informats have the following form:
<$>informat<w>.<d>
where
$ - indicates a character informat; its absence indicates a numeric
informat.
informat - names the informat.
w - specifies the informat width, which for most informats is the number
of columns in the input data.
d - specifies an optional decimal scaling factor in the numeric
informats. SAS divides the input data by 10 to the power of d.
Based on this structure we would need in addition to Width (w) and
DecimalPositions (d) a name for the informat and a type for the input
format. The name can be represented by a new element DataFormat.
DataFormat should be a controlled list of formats, the list for DDI
formats or lists of domain-specific formats.
OPEN ISSUE: DDI core list of data formats.
Is there a general approach for date/time representation beside of ISO
(and Unix date)?
PROPOSAL: Start with most common formats for integer, real, character
data (in character representation).
The indicator for the type ($) of the input format in SAS is actually a
boolean. It can be represented by a new element FormatType
(numeric|character). I'm not really sure if it is really necessary
because this information is already included in the input format definition.
DISCUSSION item: What is a use case where a distinction between numeric
and character data is important without using the formats?
Regarding the language-specific portion of dates in data files (not in
the DDI instance) like 2007-Dec-13 a language
attribute should be invented for data formats. This attribute shouldn't
be "xml:lang". "xml:lang" has another meaning: it describes the language
of an XML content, not the language of a data item in a data file which
is described by a DDI instance.
PROPOSAL: attribute "lanuageOfData" for DataFormat, allowed values are
the ISO two-letter codes.
Regarding the locale-specific formats of data like decimal and thousands
separators, like currency symbols etc. a locale attribute can be
invented. But the legitimate argument against it is, that this kind
description is not exact enough.
Use case 1: Which currency is defined by a locale of "de"? Deutsche Mark
or Euro?
Use case 2: Which decimal separators is meant by a locale of "de"?
Common definition in Germany is a format like "3 423,67" (equivalent in
the USA is "3,423.67"). But in some publications (like scientific ones)
the US format is used. Software like Excel is writing CSV files with a
decimal separator dependent from the used locale; but is the user aware
of this?
The locale definition seems to be error prone and not exact enough to
describe data where common local rules are often used, but not always.
For example the decimal separator should be described. It is not a good
idea to rely on common rules dependent from a locale.
Only in a controlled environment it can make sense to use locale as a
reliable indicator for the used format.
DISCUSSION: if it makes sense to invent the locale attribute for
controlled conditions. What is the use case.
If it will be invented a thorough description in the documentation about
the risks is necessary.
More details can be found in these emails:
attachment of
http://www.icpsr.umich.edu/pipermail/ddi-srg/2007-December/002653.html
references to common approaches and SPSS approach
http://www.icpsr.umich.edu/pipermail/ddi-srg/2007-December/002658.html
language issues
http://www.icpsr.umich.edu/pipermail/ddi-srg/2007-December/002677.html
Achim
More information about the DDI-SRG
mailing list