[DDI-SRG] Data format summary
Pascal Heus
pascal.heus at gmail.com
Tue Jan 15 19:22:49 EST 2008
Achim:
thanks for putting this togeher. A few comments/ideas:
- For StorageFormat, we probably need more than just a controlled
vocabulary to be able to deal with complex formats such as date or
datetime. This was I tough going to be based on the SAS Informat syntax.
Is this correct?
- For LanguageOfData, I would then also include a DefaultLanguageOfData
under the PhysicalDataProduct
- For the locale issue, why don't we use the similar syntax as Java like
"fr" for french, "fr.be" (for belgian french)?
The component that seem missing in your proposal is the
ProprietaryOutputFormat that can be used to capture software specific
formatting. We cannot use OutputFormat as it is restricted to the DDI
generic syntax and we cannot create/maintain mappings to all possible
software packages. We suggested for this to be available at both the
Logical and Physical product levels as epeatable elements
(not-repeatable in proprietary record layout). This is where we can
store for example SPSS or SAS specific formatting.
Use case #1: SPSS file (storage)
- We need a placeholder in the ProprietaryRecordLayout's DataItem to
capture the format of the variable as described in the original SPSS
data dictionary.
Use case #2: ASCII file exported from SPSS (storage)
- When exporting an ASCII file from SPSS, we need to be able to carry
over the formatting present in the source SPSS file. In this case, the
StorageFormat from DDI is not specific enough.
Use case #3: recommended proprietary formats at the LogicalProduct level
- The ProprietaryOutputFormat can be used at the LogicalProduct/Variable
level to provide recommended output formats for various statistical
packages. The value specified here can be used as default value for the
Physical Data Product Data Item. I have several use case for this:
(a) as a data producer, I may want to describe the recommended way to
output a variable in a specific software package (before a file is
actually created)
(b) when importing a file from a package (like SPSS) into DDI 3, we can
capture the original format at the Logical level as well (in case I
loose the connection to the master physical data product and as default
for ASCII exports)
(c) when importing a file from a statistical package (like SPSS), we can
already create mappings to other packages (like SAS, Stata) so these
mapping do not need to be recreated by other applications that may not
know how to do this (this is what I do in DExT).
Suggestion:
- Define a reusable ProprietaryOutputFormat complex type as a xs:string
along with a software and optional attributes to capture the proprietary
package.
- Add ProprietaryOutputFormat to
PhysicalDataProduct/RecordLayout/DataItem/PhysicalLocation (0..n)
- Add ProprietaryOutputFormat to
PhysicalDataProduct/RecordLayout/DataItem/PhysicalLocation (0..1)
- Add ProprietaryOutputFormat to LogicalProduct/Representation (0..n)
Let me know if you have any question.
thanks
*P
Joachim Wackerow wrote:
> The following is a summary based on the discussion in the conference
> call last week. I hope I covered the important issues. Please look over
> it if anything is missing.
>
>
> StorageFormat (1..1)
>
> This is the format how the data is stored in a file (usually ASCII). It
> should stay in physical because it is dependent from the physical
> representation of the data. Different representations of the data
> require a specific storage format.
>
> It should be below
> /PhysicalDataProduct/RecordLayout/DataItem/PhysicalLocation.
> On a more general level this can be seen as input format.
>
> The storage format should be represented by a code list (controlled
> vocabulary). A general DDI storage format would cover some core formats.
> Additional other formats can be represented by other format lists.
> Learning from these other format lists the DDI format list can be
> expanded in future.
>
> The content of storage format is just the name of the format. Other
> optional characteristics of the format are (already existing):
> Delimiter, StartPosition, EndPosition, Width, DecimalPositions,
> DecimalSeparator, GroupingSeparator (now DigitGroupSeparator).
>
> BTW (Ceterum censeo EndPosition'em (originally Carthaginem :) ) esse
> delendam. )
>
> One additional characteristic is necessary: LanguageOfData. This is the
> language of the data items in the file. This is important for text and
> dates (with text) as values for the variables in the file.
>
> Another additional characteristic is still in discussion: LocaleOfData.
> Text entries can have local language flavors, the locale, which can be
> important for automatic coding systems. The question are: Is this a real
> use case? When the element exists, can it do some harm, can it be misused?
>
> An optional StorageFormatDefault on a higher level makes sense as well
> as defaults for all the other characteristics.
>
> A mapping (or perhaps better an association) mechanism on the level of
> the format lists (not on the level of StorageFormat, which is not
> repeatable) would provide flexibility. It is independently from the
> actual DDI instance. It should include all the characteristics from
> above to describe on format.
>
>
> DataType (0..1)
>
> This is the recommended processing data type (in memory) for an
> application. It is dependent from the variable. The important
> characteristics of a variable are for a data type:
> - numeric or character
> - if numeric real or integer
> - range of codes in coding scheme
> - range of codes in data based on frequency distribution
>
> The latter item is only important if codes beyond a given coding scheme
> are allowed or if checking the data is important regarding the
> consistency to the coding scheme.
>
> Based on that information (four items above) an optimized processing
> type for a specific application can be determined (by an application).
> Additionally to that the recommended processing data type can be
> described by DataType. This is application-independent.
>
> Data type should stay in logical because the data type is bound to the
> variable (not to some physical representation formats), probably below
> Representation in Variable.
>
> A indicator can make sense if the variable is numeric or character. But
> this is probably covered already by the the various representation
> types. Other thoughts?
>
> Data type should use a restricted set of the data types of XML Schema. See:
> http://www.w3.org/TR/xmlschema-2/#built-in-datatypes
> Any types below anySimpleType. I think the following exceptions would
> make sense, omitting of:
> QNAME, NOTATION, and any type below string
>
> The list should be realized by a controlled vocabulary. This way it
> would be expandable by very specific types in another list if necessary.
>
> An optional DataTypeDefault on a higher level makes sense.
>
>
> OutputFormat (0..1)
>
> This is the recommended output format (display format) for a variable.
> An application can determine additionally an optimized format based on
> the four items above at data type.
>
> It should stay in logical. It is application-independent and bound to
> the variable (not to some physical representation formats). It should be
> below Representation.
>
> The output format should be represented by a code list (like at storage
> format the general DDI format list).
>
> An optional OutputFormatDefault on a higher level makes sense.
>
>
> Use cases with the information from above
>
> 1. ASCII file (storage) -> DDI -> Report
>
> The storage format in the ASCII file is described in
> physical/StorageFormat. The output format in logical/OutputFormat (if
> available) can be used by the report application.
>
>
> 2. ASCII file (storage) -> DDI -> ASCII file (optimized storage)
>
> The data in the ASCII file is read by an application according to the
> storage format described in physical/StorageFormat. The application uses
> internally a corresponding data type based on the recommended data type
> in logical/DataType (if not available based on the characteristics of
> the variables). The application writes a new ASCII file with formats
> based on the OutputFormat or based on a determined optimized output format.
>
>
> Use cases which require additional information items:
>
> 3. SPSS system file (storage) -> DDI -> SPSS system file.
>
> The metadata inside of the SPSS file is read by an application.
> Available are the data type and an output format. The output format can
> be understood as an recommend SPSS output format. It is dependent how
> the SPSS system file is generated and/or configured. The SPSS data type
> is stored in proprietaryPhysical/DataType (1..1). An optional
> DataTypeDefault on a higher level makes sense.
>
>
> The output format is a proprietary format which is bound to this storage
> type. Therefore it should stay in physical not in logical, in an
> additional element like proprietaryPhysical/OutputFormat (1..1). An
> optional OutputFormatDefault on a higher level makes sense.
>
> Based on the information of these two items in proprietaryPhysical (per
> variable) a roundtrip is possible to generate again an identical SPSS
> system file or a related SPSS command setup file.
>
>
> 4. SPSS system file (storage) -> DDI -> SAS system file.
>
> Like in the use case 3 above again the SPSS data type is described in
> proprietaryPhysical/DataType and the recommend SPSS output format in
> proprietaryPhysical/OutputFormat.
>
> Based on the information of these two items in proprietaryPhysical (per
> variable) an application can generate a similar SAS system file or a
> related SAS command setup file.
>
> The application requires the information regarding the mapping from the
> SPSS data type to the SAS data type and the mapping from the SPSS output
> format to the SAS output format. The mapping information can be tied to
> the application (independently from DDI). Alternatively the mapping
> mechanism for the format list items (used by StorageFormat) can be used
> to map (or associate) one proprietary format to another proprietary one
> or to the general or recommended OutputFormat (in logical).
>
>
> RECAPITULATION
>
> - StorageFormat is in the regular physical for the description of files
> (usually in ASCII).
> - OutputFormat is in logical to describe an optional recommended output
> format.
> - DataType is in logical to describe an optional recommended processing
> data type.
> - For proprietary storage systems the last two items exist in
> proprietary physical. OutputFormat as recommended output format of the
> proprietary system. DataType as the internal data type dependent from
> the proprietary system.
> - A mapping (or association) system exists on a general level (not on
> the DDI instance level) to describe equal or similar formats of
> different format lists, which are used in StorageFormat and OutputFormat
> (proprietary and general).
>
>
>
> Currencies
>
> Currency should be described in logical/Representation/@measurementUnit.
> As suggested by the CVG a role for measurement unit would be
> necessary. Then a controlled vocabulary for the role or the measurement
> unit would be possible and dependent from the role a controlled
> vocabulary for the measurement unit itself. So currency would be an item
> in the controlled vocabulary for role and the list of currencies
> (according ISO 4217 currency names and code elements) would be the
> controlled vocabulary for the measurement unit itself. See the ISO list at:
> http://www.iso.org/iso/support/faqs/faqs_widely_used_standards/widely_used_standards_other/currency_codes/currency_codes_list-1.htm
>
>
> Cheers, Achim
>
>
> Pascal Heus wrote:
>
>> Achim:
>> Thanks for this summary. A few comments/questions/ideas.
>>
>> (1) It its my understanding that:
>> - datatype is defined in the variable representation in the variable's
>> logical product
>> - format is declared in the physical data product's data item
>> Is this correct?
>>
>> (2) Is the Format element repeatable to allow for the capture of DDI
>> (generic) formatting and software specific formatting?
>>
>> (3) Do we also need to have format also available in the variable
>> representation? This would basically allow me to provide the
>> "recommended" visual representation (print format in SPSS terminology)
>> before I create any file. This could be a default with a possible
>> override in the physical data product. A use case for this is a question
>> bank or questionnaire where you want to desribe your variables without
>> creating any file.
>>
>> (4) Do we differentiate between store/write format (the way it is
>> represented in a file) and visual/print format (the way it is shown to
>> the user)? This would very useful for example to specify that a date in
>> a ASCII file is stored in ISO format but should be presented to the user
>> as dd/mm/yyyy.
>>
>> (5) Regarding date format, I was able to map all the SPSS format into
>> Stata (have not doen SAS yet). Stata is I think also a good syntax to
>> consider for formatting.
>>
>> (6) For currency formatting, we should have the ability to use the ISO
>> 4217 (http://en.wikipedia.org/wiki/ISO_4217)
>>
>> (7) For DataType, do we need to allow for a "other" value? I'm always a
>> bit sceptical that I controlled vocabulary covers all possible cases.
>> What would I use for example if my variable is an image or a number in
>> the octal basis (don't know of such use case but not impossible). We
>> should also exclude from the W3C data type list the values below "token"
>>
>> (8) Based on our suggestion, we should also have a LanguageOfDataDefault
>> in the PhysicalDataProduct
>>
>> best
>> Pascal
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Joachim Wackerow wrote:
>>
>>> This is a summary of several emails and some discussion by email (see
>>> at the bottom). It is related to input formats to read data in storage
>>> like files etc.
>>>
>>> In December we agreed that Arofan realizes a first try in XML Schema
>>> based on this information and the discussion in the conference call.
>>> Then I will look into it again and we can discuss further steps.
>>>
>>>
>>> It doesn't exist a general open solution to deal with formats (details
>>> see my email from 2007-12-06). So we must build an own one.
>>>
>>> A pragmatic and evolutionary approach would be to include a core set
>>> of common data formats and data types in DDI lists (DataTypeCode,
>>> DataFormatCode). In addition domain-specific lists can be built (like
>>> the SAS list for data formats, ANSI-SQL list for data types, etc.).
>>> Then a mapping mechanism would be necessary to map formats/types from
>>> different domains. Ideally the mapping would be done (like the ISO/IEC
>>> 11404 approach) from a domain-specific definition to a general-purpose
>>> definition (like the DDI lists). Learning from the domain-specific
>>> lists the general-purpose DDI lists can be expanded in future DDI
>>> versions.
>>>
>>> The advantage of this approach is that only the basic lists and the
>>> mapping mechanism must be provided now. Additional domain-specific
>>> lists and related mappings can be produced later. (The disadvantage
>>> can be too domain-specific data definitions in DDI instances.
>>> General-purpose applications are getting more complex this way.)
>>>
>>> The mapping can be done by the approach in the comparison module a
>>> domain-specific format item is mapped to the general-purpose DDI
>>> format item.
>>>
>>>
>>> The input format should be defined in
>>> /PhysicalDataProduct/RecordLayout/DataItem/PhysicalLocation
>>>
>>>
>>> The DataType is the "recommended" data type for processing the data
>>> item. This is optional and only necessary in specific cases. (A use
>>> case would be, defining a recommended data type to optimize processing
>>> in terms of used memory space. This would be not only based on the
>>> input format but also on the really used data) It should use a
>>> restricted set of the data types of XML Schema. See:
>>> http://www.w3.org/TR/xmlschema-2/#built-in-datatypes
>>> Any types below anySimpleType. I think the following exceptions would
>>> make sense, omitting of:
>>> QNAME, NOTATION, any type below string
>>>
>>> The list should be realized by a controlled vocabulary. This way it
>>> would be expandable by very specific types in another list if necessary.
>>>
>>>
>>> Delimiter, StartPosition, Width, DecimalPositions, DecimalSeparator,
>>> GroupingSeparator are fine.
>>>
>>> PROPOSAL: Remove EndPosition. On the basis of StartPosition together
>>> with Width the end position can be computed. Otherwise both the Width
>>> and EndPosition can be used and this is error prone.
>>>
>>> PROPOSAL: Rename GroupingSeparator into DigitGroupSeparator as this
>>> term is more self-explanatory.
>>>
>>>
>>> We can learn from how SAS defines the informats.
>>>
>>> SAS informats have the following form:
>>> <$>informat<w>.<d>
>>> where
>>>
>>> $ - indicates a character informat; its absence indicates a numeric
>>> informat.
>>>
>>> informat - names the informat.
>>>
>>> w - specifies the informat width, which for most informats is the
>>> number of columns in the input data.
>>>
>>> d - specifies an optional decimal scaling factor in the numeric
>>> informats. SAS divides the input data by 10 to the power of d.
>>>
>>>
>>> Based on this structure we would need in addition to Width (w) and
>>> DecimalPositions (d) a name for the informat and a type for the input
>>> format. The name can be represented by a new element DataFormat.
>>> DataFormat should be a controlled list of formats, the list for DDI
>>> formats or lists of domain-specific formats.
>>>
>>> OPEN ISSUE: DDI core list of data formats.
>>> Is there a general approach for date/time representation beside of ISO
>>> (and Unix date)?
>>>
>>> PROPOSAL: Start with most common formats for integer, real, character
>>> data (in character representation).
>>>
>>>
>>> The indicator for the type ($) of the input format in SAS is actually
>>> a boolean. It can be represented by a new element FormatType
>>> (numeric|character). I'm not really sure if it is really necessary
>>> because this information is already included in the input format
>>> definition.
>>>
>>> DISCUSSION item: What is a use case where a distinction between
>>> numeric and character data is important without using the formats?
>>>
>>>
>>> Regarding the language-specific portion of dates in data files (not in
>>> the DDI instance) like 2007-Dec-13 a language
>>> attribute should be invented for data formats. This attribute shouldn't
>>> be "xml:lang". "xml:lang" has another meaning: it describes the language
>>> of an XML content, not the language of a data item in a data file which
>>> is described by a DDI instance.
>>>
>>> PROPOSAL: attribute "lanuageOfData" for DataFormat, allowed values are
>>> the ISO two-letter codes.
>>>
>>>
>>> Regarding the locale-specific formats of data like decimal and thousands
>>> separators, like currency symbols etc. a locale attribute can be
>>> invented. But the legitimate argument against it is, that this kind
>>> description is not exact enough.
>>> Use case 1: Which currency is defined by a locale of "de"? Deutsche Mark
>>> or Euro?
>>> Use case 2: Which decimal separators is meant by a locale of "de"?
>>> Common definition in Germany is a format like "3 423,67" (equivalent in
>>> the USA is "3,423.67"). But in some publications (like scientific ones)
>>> the US format is used. Software like Excel is writing CSV files with a
>>> decimal separator dependent from the used locale; but is the user aware
>>> of this?
>>> The locale definition seems to be error prone and not exact enough to
>>> describe data where common local rules are often used, but not always.
>>> For example the decimal separator should be described. It is not a good
>>> idea to rely on common rules dependent from a locale.
>>> Only in a controlled environment it can make sense to use locale as a
>>> reliable indicator for the used format.
>>>
>>> DISCUSSION: if it makes sense to invent the locale attribute for
>>> controlled conditions. What is the use case.
>>> If it will be invented a thorough description in the documentation
>>> about the risks is necessary.
>>>
>>>
>>> More details can be found in these emails:
>>>
>>> attachment of
>>> http://www.icpsr.umich.edu/pipermail/ddi-srg/2007-December/002653.html
>>>
>>> references to common approaches and SPSS approach
>>> http://www.icpsr.umich.edu/pipermail/ddi-srg/2007-December/002658.html
>>>
>>> language issues
>>> http://www.icpsr.umich.edu/pipermail/ddi-srg/2007-December/002677.html
>>>
>>> Achim
>>> _______________________________________________
>>> DDI-SRG mailing list
>>> DDI-SRG at icpsr.umich.edu
>>> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>>
>>>
>>>
> n/
> _______________________________________________
> DDI-SRG mailing list
> DDI-SRG at icpsr.umich.edu
> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.icpsr.umich.edu/pipermail/ddi-srg/attachments/20080115/3c966e5b/attachment-0001.html
More information about the DDI-SRG
mailing list