[DDI-SRG] Data format summary

Joachim Wackerow joachim.wackerow at gesis.org
Wed Jan 9 16:49:43 EST 2008


This is a summary of several emails and some discussion by email (see at 
the bottom). It is related to input formats to read data in storage like 
files etc.

In December we agreed that Arofan realizes a first try in XML Schema 
based on this information and the discussion in the conference call. 
Then I will look into it again and we can discuss further steps.


It doesn't exist a general open solution to deal with formats (details 
see my email from 2007-12-06). So we must build an own one.

A pragmatic and evolutionary approach would be to include a core set of 
common data formats and data types in DDI lists (DataTypeCode, 
DataFormatCode). In addition domain-specific lists can be built (like 
the SAS list for data formats, ANSI-SQL list for data types, etc.). Then 
a mapping mechanism would be necessary to map formats/types from 
different domains. Ideally the mapping would be done (like the ISO/IEC 
11404 approach) from a domain-specific definition to a general-purpose 
definition (like the DDI lists). Learning from the domain-specific lists 
the general-purpose DDI lists can be expanded in future DDI versions.

The advantage of this approach is that only the basic lists and the 
mapping mechanism must be provided now. Additional domain-specific lists 
and related mappings can be produced later. (The disadvantage can be too 
domain-specific data definitions in DDI instances. General-purpose 
applications are getting more complex this way.)

The mapping can be done by the approach in the comparison module a 
domain-specific format item is mapped to the general-purpose DDI format 
item.


The input format should be defined in
/PhysicalDataProduct/RecordLayout/DataItem/PhysicalLocation


The DataType is the "recommended" data type for processing the data 
item. This is optional and only necessary in specific cases. (A use case 
would be, defining a recommended data type to optimize processing in 
terms of used memory space. This would be not only based on the input 
format but also on the really used data) It should use a restricted set 
of the data types of XML Schema. See:
http://www.w3.org/TR/xmlschema-2/#built-in-datatypes
Any types below anySimpleType. I think the following exceptions would 
make sense, omitting of:
QNAME, NOTATION, any type below string

The list should be realized by a controlled vocabulary. This way it 
would be expandable by very specific types in another list if necessary.


Delimiter, StartPosition, Width, DecimalPositions, DecimalSeparator, 
GroupingSeparator are fine.

PROPOSAL: Remove EndPosition. On the basis of StartPosition together 
with Width the end position can be computed. Otherwise both the Width 
and EndPosition can be used and this is error prone.

PROPOSAL: Rename GroupingSeparator into DigitGroupSeparator as this term 
is more self-explanatory.


We can learn from how SAS defines the informats.

SAS informats have the following form:
<$>informat<w>.<d>
where

$ - indicates a character informat; its absence indicates a numeric 
informat.

informat - names the informat.

w - specifies the informat width, which for most informats is the number 
of columns in the input data.

d - specifies an optional decimal scaling factor in the numeric 
informats. SAS divides the input data by 10 to the power of d.


Based on this structure we would need in addition to Width (w) and 
DecimalPositions (d) a name for the informat and a type for the input 
format. The name can be represented by a new element DataFormat. 
DataFormat should be a controlled list of formats, the list for DDI 
formats or lists of domain-specific formats.

OPEN ISSUE: DDI core list of data formats.
Is there a general approach for date/time representation beside of ISO 
(and Unix date)?

PROPOSAL: Start with most common formats for integer, real, character 
data (in character representation).


The indicator for the type ($) of the input format in SAS is actually a 
boolean. It can be represented by a new element FormatType 
(numeric|character). I'm not really sure if it is really necessary 
because this information is already included in the input format definition.

DISCUSSION item: What is a use case where a distinction between numeric 
and character data is important without using the formats?


Regarding the language-specific portion of dates in data files (not in 
the DDI instance) like 2007-Dec-13 a language
attribute should be invented for data formats. This attribute shouldn't
be "xml:lang". "xml:lang" has another meaning: it describes the language
of an XML content, not the language of a data item in a data file which
is described by a DDI instance.

PROPOSAL: attribute "lanuageOfData" for DataFormat, allowed values are 
the ISO two-letter codes.


Regarding the locale-specific formats of data like decimal and thousands
separators, like currency symbols etc. a locale attribute can be
invented. But the legitimate argument against it is, that this kind
description is not exact enough.
Use case 1: Which currency is defined by a locale of "de"? Deutsche Mark
or Euro?
Use case 2: Which decimal separators is meant by a locale of "de"?
Common definition in Germany is a format like "3 423,67" (equivalent in
the USA is "3,423.67"). But in some publications (like scientific ones)
the US format is used. Software like Excel is writing CSV files with a
decimal separator dependent from the used locale; but is the user aware
of this?
The locale definition seems to be error prone and not exact enough to
describe data where common local rules are often used, but not always.
For example the decimal separator should be described. It is not a good
idea to rely on common rules dependent from a locale.
Only in a controlled environment it can make sense to use locale as a
reliable indicator for the used format.

DISCUSSION: if it makes sense to invent the locale attribute for 
controlled conditions. What is the use case.
If it will be invented a thorough description in the documentation about 
the risks is necessary.


More details can be found in these emails:

attachment of
http://www.icpsr.umich.edu/pipermail/ddi-srg/2007-December/002653.html

references to common approaches and SPSS approach
http://www.icpsr.umich.edu/pipermail/ddi-srg/2007-December/002658.html

language issues
http://www.icpsr.umich.edu/pipermail/ddi-srg/2007-December/002677.html

Achim


More information about the DDI-SRG mailing list