[DDI-SRG] Data format summary
Joachim Wackerow
joachim.wackerow at gesis.org
Mon Jan 14 14:48:45 EST 2008
The following is a summary based on the discussion in the conference
call last week. I hope I covered the important issues. Please look over
it if anything is missing.
StorageFormat (1..1)
This is the format how the data is stored in a file (usually ASCII). It
should stay in physical because it is dependent from the physical
representation of the data. Different representations of the data
require a specific storage format.
It should be below
/PhysicalDataProduct/RecordLayout/DataItem/PhysicalLocation.
On a more general level this can be seen as input format.
The storage format should be represented by a code list (controlled
vocabulary). A general DDI storage format would cover some core formats.
Additional other formats can be represented by other format lists.
Learning from these other format lists the DDI format list can be
expanded in future.
The content of storage format is just the name of the format. Other
optional characteristics of the format are (already existing):
Delimiter, StartPosition, EndPosition, Width, DecimalPositions,
DecimalSeparator, GroupingSeparator (now DigitGroupSeparator).
BTW (Ceterum censeo EndPosition'em (originally Carthaginem :) ) esse
delendam. )
One additional characteristic is necessary: LanguageOfData. This is the
language of the data items in the file. This is important for text and
dates (with text) as values for the variables in the file.
Another additional characteristic is still in discussion: LocaleOfData.
Text entries can have local language flavors, the locale, which can be
important for automatic coding systems. The question are: Is this a real
use case? When the element exists, can it do some harm, can it be misused?
An optional StorageFormatDefault on a higher level makes sense as well
as defaults for all the other characteristics.
A mapping (or perhaps better an association) mechanism on the level of
the format lists (not on the level of StorageFormat, which is not
repeatable) would provide flexibility. It is independently from the
actual DDI instance. It should include all the characteristics from
above to describe on format.
DataType (0..1)
This is the recommended processing data type (in memory) for an
application. It is dependent from the variable. The important
characteristics of a variable are for a data type:
- numeric or character
- if numeric real or integer
- range of codes in coding scheme
- range of codes in data based on frequency distribution
The latter item is only important if codes beyond a given coding scheme
are allowed or if checking the data is important regarding the
consistency to the coding scheme.
Based on that information (four items above) an optimized processing
type for a specific application can be determined (by an application).
Additionally to that the recommended processing data type can be
described by DataType. This is application-independent.
Data type should stay in logical because the data type is bound to the
variable (not to some physical representation formats), probably below
Representation in Variable.
A indicator can make sense if the variable is numeric or character. But
this is probably covered already by the the various representation
types. Other thoughts?
Data type should use a restricted set of the data types of XML Schema. See:
http://www.w3.org/TR/xmlschema-2/#built-in-datatypes
Any types below anySimpleType. I think the following exceptions would
make sense, omitting of:
QNAME, NOTATION, and any type below string
The list should be realized by a controlled vocabulary. This way it
would be expandable by very specific types in another list if necessary.
An optional DataTypeDefault on a higher level makes sense.
OutputFormat (0..1)
This is the recommended output format (display format) for a variable.
An application can determine additionally an optimized format based on
the four items above at data type.
It should stay in logical. It is application-independent and bound to
the variable (not to some physical representation formats). It should be
below Representation.
The output format should be represented by a code list (like at storage
format the general DDI format list).
An optional OutputFormatDefault on a higher level makes sense.
Use cases with the information from above
1. ASCII file (storage) -> DDI -> Report
The storage format in the ASCII file is described in
physical/StorageFormat. The output format in logical/OutputFormat (if
available) can be used by the report application.
2. ASCII file (storage) -> DDI -> ASCII file (optimized storage)
The data in the ASCII file is read by an application according to the
storage format described in physical/StorageFormat. The application uses
internally a corresponding data type based on the recommended data type
in logical/DataType (if not available based on the characteristics of
the variables). The application writes a new ASCII file with formats
based on the OutputFormat or based on a determined optimized output format.
Use cases which require additional information items:
3. SPSS system file (storage) -> DDI -> SPSS system file.
The metadata inside of the SPSS file is read by an application.
Available are the data type and an output format. The output format can
be understood as an recommend SPSS output format. It is dependent how
the SPSS system file is generated and/or configured. The SPSS data type
is stored in proprietaryPhysical/DataType (1..1). An optional
DataTypeDefault on a higher level makes sense.
The output format is a proprietary format which is bound to this storage
type. Therefore it should stay in physical not in logical, in an
additional element like proprietaryPhysical/OutputFormat (1..1). An
optional OutputFormatDefault on a higher level makes sense.
Based on the information of these two items in proprietaryPhysical (per
variable) a roundtrip is possible to generate again an identical SPSS
system file or a related SPSS command setup file.
4. SPSS system file (storage) -> DDI -> SAS system file.
Like in the use case 3 above again the SPSS data type is described in
proprietaryPhysical/DataType and the recommend SPSS output format in
proprietaryPhysical/OutputFormat.
Based on the information of these two items in proprietaryPhysical (per
variable) an application can generate a similar SAS system file or a
related SAS command setup file.
The application requires the information regarding the mapping from the
SPSS data type to the SAS data type and the mapping from the SPSS output
format to the SAS output format. The mapping information can be tied to
the application (independently from DDI). Alternatively the mapping
mechanism for the format list items (used by StorageFormat) can be used
to map (or associate) one proprietary format to another proprietary one
or to the general or recommended OutputFormat (in logical).
RECAPITULATION
- StorageFormat is in the regular physical for the description of files
(usually in ASCII).
- OutputFormat is in logical to describe an optional recommended output
format.
- DataType is in logical to describe an optional recommended processing
data type.
- For proprietary storage systems the last two items exist in
proprietary physical. OutputFormat as recommended output format of the
proprietary system. DataType as the internal data type dependent from
the proprietary system.
- A mapping (or association) system exists on a general level (not on
the DDI instance level) to describe equal or similar formats of
different format lists, which are used in StorageFormat and OutputFormat
(proprietary and general).
Currencies
Currency should be described in logical/Representation/@measurementUnit.
As suggested by the CVG a role for measurement unit would be
necessary. Then a controlled vocabulary for the role or the measurement
unit would be possible and dependent from the role a controlled
vocabulary for the measurement unit itself. So currency would be an item
in the controlled vocabulary for role and the list of currencies
(according ISO 4217 currency names and code elements) would be the
controlled vocabulary for the measurement unit itself. See the ISO list at:
http://www.iso.org/iso/support/faqs/faqs_widely_used_standards/widely_used_standards_other/currency_codes/currency_codes_list-1.htm
Cheers, Achim
Pascal Heus wrote:
> Achim:
> Thanks for this summary. A few comments/questions/ideas.
>
> (1) It its my understanding that:
> - datatype is defined in the variable representation in the variable's
> logical product
> - format is declared in the physical data product's data item
> Is this correct?
>
> (2) Is the Format element repeatable to allow for the capture of DDI
> (generic) formatting and software specific formatting?
>
> (3) Do we also need to have format also available in the variable
> representation? This would basically allow me to provide the
> "recommended" visual representation (print format in SPSS terminology)
> before I create any file. This could be a default with a possible
> override in the physical data product. A use case for this is a question
> bank or questionnaire where you want to desribe your variables without
> creating any file.
>
> (4) Do we differentiate between store/write format (the way it is
> represented in a file) and visual/print format (the way it is shown to
> the user)? This would very useful for example to specify that a date in
> a ASCII file is stored in ISO format but should be presented to the user
> as dd/mm/yyyy.
>
> (5) Regarding date format, I was able to map all the SPSS format into
> Stata (have not doen SAS yet). Stata is I think also a good syntax to
> consider for formatting.
>
> (6) For currency formatting, we should have the ability to use the ISO
> 4217 (http://en.wikipedia.org/wiki/ISO_4217)
>
> (7) For DataType, do we need to allow for a "other" value? I'm always a
> bit sceptical that I controlled vocabulary covers all possible cases.
> What would I use for example if my variable is an image or a number in
> the octal basis (don't know of such use case but not impossible). We
> should also exclude from the W3C data type list the values below "token"
>
> (8) Based on our suggestion, we should also have a LanguageOfDataDefault
> in the PhysicalDataProduct
>
> best
> Pascal
>
>
>
>
>
>
>
>
>
> Joachim Wackerow wrote:
>> This is a summary of several emails and some discussion by email (see
>> at the bottom). It is related to input formats to read data in storage
>> like files etc.
>>
>> In December we agreed that Arofan realizes a first try in XML Schema
>> based on this information and the discussion in the conference call.
>> Then I will look into it again and we can discuss further steps.
>>
>>
>> It doesn't exist a general open solution to deal with formats (details
>> see my email from 2007-12-06). So we must build an own one.
>>
>> A pragmatic and evolutionary approach would be to include a core set
>> of common data formats and data types in DDI lists (DataTypeCode,
>> DataFormatCode). In addition domain-specific lists can be built (like
>> the SAS list for data formats, ANSI-SQL list for data types, etc.).
>> Then a mapping mechanism would be necessary to map formats/types from
>> different domains. Ideally the mapping would be done (like the ISO/IEC
>> 11404 approach) from a domain-specific definition to a general-purpose
>> definition (like the DDI lists). Learning from the domain-specific
>> lists the general-purpose DDI lists can be expanded in future DDI
>> versions.
>>
>> The advantage of this approach is that only the basic lists and the
>> mapping mechanism must be provided now. Additional domain-specific
>> lists and related mappings can be produced later. (The disadvantage
>> can be too domain-specific data definitions in DDI instances.
>> General-purpose applications are getting more complex this way.)
>>
>> The mapping can be done by the approach in the comparison module a
>> domain-specific format item is mapped to the general-purpose DDI
>> format item.
>>
>>
>> The input format should be defined in
>> /PhysicalDataProduct/RecordLayout/DataItem/PhysicalLocation
>>
>>
>> The DataType is the "recommended" data type for processing the data
>> item. This is optional and only necessary in specific cases. (A use
>> case would be, defining a recommended data type to optimize processing
>> in terms of used memory space. This would be not only based on the
>> input format but also on the really used data) It should use a
>> restricted set of the data types of XML Schema. See:
>> http://www.w3.org/TR/xmlschema-2/#built-in-datatypes
>> Any types below anySimpleType. I think the following exceptions would
>> make sense, omitting of:
>> QNAME, NOTATION, any type below string
>>
>> The list should be realized by a controlled vocabulary. This way it
>> would be expandable by very specific types in another list if necessary.
>>
>>
>> Delimiter, StartPosition, Width, DecimalPositions, DecimalSeparator,
>> GroupingSeparator are fine.
>>
>> PROPOSAL: Remove EndPosition. On the basis of StartPosition together
>> with Width the end position can be computed. Otherwise both the Width
>> and EndPosition can be used and this is error prone.
>>
>> PROPOSAL: Rename GroupingSeparator into DigitGroupSeparator as this
>> term is more self-explanatory.
>>
>>
>> We can learn from how SAS defines the informats.
>>
>> SAS informats have the following form:
>> <$>informat<w>.<d>
>> where
>>
>> $ - indicates a character informat; its absence indicates a numeric
>> informat.
>>
>> informat - names the informat.
>>
>> w - specifies the informat width, which for most informats is the
>> number of columns in the input data.
>>
>> d - specifies an optional decimal scaling factor in the numeric
>> informats. SAS divides the input data by 10 to the power of d.
>>
>>
>> Based on this structure we would need in addition to Width (w) and
>> DecimalPositions (d) a name for the informat and a type for the input
>> format. The name can be represented by a new element DataFormat.
>> DataFormat should be a controlled list of formats, the list for DDI
>> formats or lists of domain-specific formats.
>>
>> OPEN ISSUE: DDI core list of data formats.
>> Is there a general approach for date/time representation beside of ISO
>> (and Unix date)?
>>
>> PROPOSAL: Start with most common formats for integer, real, character
>> data (in character representation).
>>
>>
>> The indicator for the type ($) of the input format in SAS is actually
>> a boolean. It can be represented by a new element FormatType
>> (numeric|character). I'm not really sure if it is really necessary
>> because this information is already included in the input format
>> definition.
>>
>> DISCUSSION item: What is a use case where a distinction between
>> numeric and character data is important without using the formats?
>>
>>
>> Regarding the language-specific portion of dates in data files (not in
>> the DDI instance) like 2007-Dec-13 a language
>> attribute should be invented for data formats. This attribute shouldn't
>> be "xml:lang". "xml:lang" has another meaning: it describes the language
>> of an XML content, not the language of a data item in a data file which
>> is described by a DDI instance.
>>
>> PROPOSAL: attribute "lanuageOfData" for DataFormat, allowed values are
>> the ISO two-letter codes.
>>
>>
>> Regarding the locale-specific formats of data like decimal and thousands
>> separators, like currency symbols etc. a locale attribute can be
>> invented. But the legitimate argument against it is, that this kind
>> description is not exact enough.
>> Use case 1: Which currency is defined by a locale of "de"? Deutsche Mark
>> or Euro?
>> Use case 2: Which decimal separators is meant by a locale of "de"?
>> Common definition in Germany is a format like "3 423,67" (equivalent in
>> the USA is "3,423.67"). But in some publications (like scientific ones)
>> the US format is used. Software like Excel is writing CSV files with a
>> decimal separator dependent from the used locale; but is the user aware
>> of this?
>> The locale definition seems to be error prone and not exact enough to
>> describe data where common local rules are often used, but not always.
>> For example the decimal separator should be described. It is not a good
>> idea to rely on common rules dependent from a locale.
>> Only in a controlled environment it can make sense to use locale as a
>> reliable indicator for the used format.
>>
>> DISCUSSION: if it makes sense to invent the locale attribute for
>> controlled conditions. What is the use case.
>> If it will be invented a thorough description in the documentation
>> about the risks is necessary.
>>
>>
>> More details can be found in these emails:
>>
>> attachment of
>> http://www.icpsr.umich.edu/pipermail/ddi-srg/2007-December/002653.html
>>
>> references to common approaches and SPSS approach
>> http://www.icpsr.umich.edu/pipermail/ddi-srg/2007-December/002658.html
>>
>> language issues
>> http://www.icpsr.umich.edu/pipermail/ddi-srg/2007-December/002677.html
>>
>> Achim
>> _______________________________________________
>> DDI-SRG mailing list
>> DDI-SRG at icpsr.umich.edu
>> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>
>>
n/
More information about the DDI-SRG
mailing list