[DDI-SRG] Data format summary

Pascal Heus pascal.heus at gmail.com
Thu Jan 10 08:48:26 EST 2008


Achim:
Thanks for this summary. A few comments/questions/ideas.

(1) It its my understanding that:
- datatype is defined in the variable representation in the variable's 
logical product
- format is declared in the physical data product's data item
Is this correct?

(2) Is the Format element repeatable to allow for the capture of DDI 
(generic) formatting and software specific formatting?

(3) Do we also need to have format also available in the variable 
representation? This would basically allow me to provide the 
"recommended" visual representation (print format in SPSS terminology) 
before I create any file. This could be a default with a possible 
override in the physical data product. A use case for this is a question 
bank or questionnaire where you want to desribe your variables without 
creating any file.

(4) Do we differentiate between store/write format (the way it is 
represented in a file) and visual/print format (the way it is shown to 
the user)? This would very useful for example to specify that a date in 
a ASCII file is stored in ISO format but should be presented to the user 
as dd/mm/yyyy.

(5) Regarding date format, I was able to map all the SPSS format into 
Stata (have not doen SAS yet). Stata is I think also a good syntax to 
consider for formatting.

(6) For currency formatting, we should have the ability to use the ISO 
4217 (http://en.wikipedia.org/wiki/ISO_4217)

(7) For DataType, do we need to allow for a "other" value? I'm always a 
bit sceptical that I controlled vocabulary covers all possible cases. 
What would I use for example if my variable is an image or a number in 
the octal basis (don't know of such use case but not impossible). We 
should also exclude from the W3C data type list the values below "token"

(8) Based on our suggestion, we should also have a LanguageOfDataDefault 
in the PhysicalDataProduct

best
Pascal









Joachim Wackerow wrote:
> This is a summary of several emails and some discussion by email (see at 
> the bottom). It is related to input formats to read data in storage like 
> files etc.
>
> In December we agreed that Arofan realizes a first try in XML Schema 
> based on this information and the discussion in the conference call. 
> Then I will look into it again and we can discuss further steps.
>
>
> It doesn't exist a general open solution to deal with formats (details 
> see my email from 2007-12-06). So we must build an own one.
>
> A pragmatic and evolutionary approach would be to include a core set of 
> common data formats and data types in DDI lists (DataTypeCode, 
> DataFormatCode). In addition domain-specific lists can be built (like 
> the SAS list for data formats, ANSI-SQL list for data types, etc.). Then 
> a mapping mechanism would be necessary to map formats/types from 
> different domains. Ideally the mapping would be done (like the ISO/IEC 
> 11404 approach) from a domain-specific definition to a general-purpose 
> definition (like the DDI lists). Learning from the domain-specific lists 
> the general-purpose DDI lists can be expanded in future DDI versions.
>
> The advantage of this approach is that only the basic lists and the 
> mapping mechanism must be provided now. Additional domain-specific lists 
> and related mappings can be produced later. (The disadvantage can be too 
> domain-specific data definitions in DDI instances. General-purpose 
> applications are getting more complex this way.)
>
> The mapping can be done by the approach in the comparison module a 
> domain-specific format item is mapped to the general-purpose DDI format 
> item.
>
>
> The input format should be defined in
> /PhysicalDataProduct/RecordLayout/DataItem/PhysicalLocation
>
>
> The DataType is the "recommended" data type for processing the data 
> item. This is optional and only necessary in specific cases. (A use case 
> would be, defining a recommended data type to optimize processing in 
> terms of used memory space. This would be not only based on the input 
> format but also on the really used data) It should use a restricted set 
> of the data types of XML Schema. See:
> http://www.w3.org/TR/xmlschema-2/#built-in-datatypes
> Any types below anySimpleType. I think the following exceptions would 
> make sense, omitting of:
> QNAME, NOTATION, any type below string
>
> The list should be realized by a controlled vocabulary. This way it 
> would be expandable by very specific types in another list if necessary.
>
>
> Delimiter, StartPosition, Width, DecimalPositions, DecimalSeparator, 
> GroupingSeparator are fine.
>
> PROPOSAL: Remove EndPosition. On the basis of StartPosition together 
> with Width the end position can be computed. Otherwise both the Width 
> and EndPosition can be used and this is error prone.
>
> PROPOSAL: Rename GroupingSeparator into DigitGroupSeparator as this term 
> is more self-explanatory.
>
>
> We can learn from how SAS defines the informats.
>
> SAS informats have the following form:
> <$>informat<w>.<d>
> where
>
> $ - indicates a character informat; its absence indicates a numeric 
> informat.
>
> informat - names the informat.
>
> w - specifies the informat width, which for most informats is the number 
> of columns in the input data.
>
> d - specifies an optional decimal scaling factor in the numeric 
> informats. SAS divides the input data by 10 to the power of d.
>
>
> Based on this structure we would need in addition to Width (w) and 
> DecimalPositions (d) a name for the informat and a type for the input 
> format. The name can be represented by a new element DataFormat. 
> DataFormat should be a controlled list of formats, the list for DDI 
> formats or lists of domain-specific formats.
>
> OPEN ISSUE: DDI core list of data formats.
> Is there a general approach for date/time representation beside of ISO 
> (and Unix date)?
>
> PROPOSAL: Start with most common formats for integer, real, character 
> data (in character representation).
>
>
> The indicator for the type ($) of the input format in SAS is actually a 
> boolean. It can be represented by a new element FormatType 
> (numeric|character). I'm not really sure if it is really necessary 
> because this information is already included in the input format definition.
>
> DISCUSSION item: What is a use case where a distinction between numeric 
> and character data is important without using the formats?
>
>
> Regarding the language-specific portion of dates in data files (not in 
> the DDI instance) like 2007-Dec-13 a language
> attribute should be invented for data formats. This attribute shouldn't
> be "xml:lang". "xml:lang" has another meaning: it describes the language
> of an XML content, not the language of a data item in a data file which
> is described by a DDI instance.
>
> PROPOSAL: attribute "lanuageOfData" for DataFormat, allowed values are 
> the ISO two-letter codes.
>
>
> Regarding the locale-specific formats of data like decimal and thousands
> separators, like currency symbols etc. a locale attribute can be
> invented. But the legitimate argument against it is, that this kind
> description is not exact enough.
> Use case 1: Which currency is defined by a locale of "de"? Deutsche Mark
> or Euro?
> Use case 2: Which decimal separators is meant by a locale of "de"?
> Common definition in Germany is a format like "3 423,67" (equivalent in
> the USA is "3,423.67"). But in some publications (like scientific ones)
> the US format is used. Software like Excel is writing CSV files with a
> decimal separator dependent from the used locale; but is the user aware
> of this?
> The locale definition seems to be error prone and not exact enough to
> describe data where common local rules are often used, but not always.
> For example the decimal separator should be described. It is not a good
> idea to rely on common rules dependent from a locale.
> Only in a controlled environment it can make sense to use locale as a
> reliable indicator for the used format.
>
> DISCUSSION: if it makes sense to invent the locale attribute for 
> controlled conditions. What is the use case.
> If it will be invented a thorough description in the documentation about 
> the risks is necessary.
>
>
> More details can be found in these emails:
>
> attachment of
> http://www.icpsr.umich.edu/pipermail/ddi-srg/2007-December/002653.html
>
> references to common approaches and SPSS approach
> http://www.icpsr.umich.edu/pipermail/ddi-srg/2007-December/002658.html
>
> language issues
> http://www.icpsr.umich.edu/pipermail/ddi-srg/2007-December/002677.html
>
> Achim
> _______________________________________________
> DDI-SRG mailing list
> DDI-SRG at icpsr.umich.edu
> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>
>   



More information about the DDI-SRG mailing list