[DDI-SRG] Data types / data formats
Wendy Thomas
wlt at pop.umn.edu
Fri Aug 3 09:42:40 EDT 2007
I'm going to add a lot of the information from Achim's email to Mantis. In
trying to do the descriptions for the NumericTypeCodeType attributes I was
finding things that were not in the w3c page that I-Lin sent. For example
BigInteger seems to be a java type with the equivilency of
nonNegativeInteger, negativeInteger, nonPositiveInteger, and integer. Java
also uses BigDecimal for decimal.
http://www.w3.org/2001/sw/Europe/200301/x2r/ht/a1/standardBuiltins.xml
This is certainly not closed. If we are using w3c (or whatever we decide
to sue) we should use w3c and provide the mapping. Mixing types lists just
makes a mess as some type lists are less explicit for decimals and strings
(like java) and others may be more explicit. We need to sort out all the
data types at various levels and make sure we are consistant.
Achim, thanks for continuing to pursue this. I'll try to sort something
out over the weekend and send it out for everyone to consider. If I'm
wrong in thinking we should stick to a single type tree at least in the
generalized sections (basically everything but the yet to be created
software specific physical sepcifications), LET ME KNOW ASAP.
Wendy
On Fri, 3 Aug 2007, Joachim Wackerow wrote:
> After talking yesterday shortly about data types I have the impression
> that still some confusion does exist. I try to describe the difference
> between data types and data formats (with the background of social
> science data and statistical packages), and their usage in DDI.
>
>
> Data Types
>
> Data types are types for data processing and computation (primarily)
> which takes place in-memory. Data types for data processing in
> statistical packages are mostly double precision and string (and date
> types). For performance reasons this data types are stored without
> conversion in system-specific files.
>
> double precision is equivalent to:
> - common 8 byte real in program languages
> - IEEE double-precision 64-bit floating point type [IEEE 754-1985]
> - double in XML Schema data types
>
> string with defined length equivalent to:
> - string in XML Schema data types
> - string in program languages. But the internal representation varies,
> explicit length definition, object representation or special end of
> string character, more variation with Unicode.
>
>
> Data Formats
>
> Data formats are used to describe data which is stored in files
> (external to programs, NOT in-memory). Data formats are used for
> conversion to data types for data processing. Data formats are NOT used
> for data processing like computation. Independently of the data format
> the files are represented in common codings like ASCII, EBCDIC, Unicode.
>
>
> Statistical packages and data types
>
> SPSS and SAS know only the data type double precision and string, they
> don't use - for example - integer. Stata has more variety of data types
> for performance reasons, because - in contrast to SPSS and SAS - Stata
> holds the whole data in memory for computation. Other, very specific
> packages like TDA, know a bit data type to process dichotome variables
> efficiently. Date formats are converted to date/time data types (often
> again double precision) for date/time computations. All the packages can
> store their data types without conversion in system-specific files.
>
>
> DDI 3.0
>
> The data description in PhysicalDataProduct is intended for files. It is
> primarily a description of data formats (not data types for data
> processing/computation). An application can derive the appropriate data
> type (of the application) from the description of the data format and
> the minimum and maximum values.
>
> Minimum and maximum values as a property of the real data are described in:
> /PhysicalInstance/Statistics/VariableStatistics/SummaryStatistic/Value
> specified by SummaryStatisticTypeCoded: Minimum, Maximum
> (Question: should be in addition a description of minimum and maximum
> possible as a property of the variable definition independent from the
> real data?)
>
> The format is described in:
> /PhysicalDataProduct/GrossRecordStructure/DataItem/PhysicalLocation/ValueLocation
> or general properties like 'DecimalSeparator' in:
> /PhysicalDataProduct
>
> This assumes a data value in a rectangular file where the variables are
> organized as columns and the cases as rows. The data can be arranged in
> fixed format or in free format with a delimiter character ('Delimiter').
> The data item itself is described mainly by format ('Format') and width
> ('Width'). In case of a real number with no explicit decimal separator
> the decimal positions ('DecimalPositions') must be specified. The format
> specification is dependent from the format scheme ('FormatScheme').
> Three alternatives are described.
>
>
> Examples:
>
> two-digit integer number: 67
> Format: Integer (value from a assumed very general format scheme)
> Format: $2. (SAS format scheme)
> Format: integer with the constraint totalDigits:2 (XML Schema format scheme)
> Width: 2
> DecimalPositions: 0
>
> 4-digit real number with explicit decimal separator: 34.6
> Format: Real (value from a assumed very general format scheme)
> Format: $4. (SAS format scheme)
> Format: decimal with the constraints totalDigits:4, fractionDigits:1
> (XML Schema format scheme)
> Width: 4
> DecimalSeparator: Dot
> With decimal (XML Schema) it seems to be not possible to describe the
> same number with German decimal separator (34,6).
> With this XML Schema description a value of "2.75" would be not allowed,
> but with the other formats it is a legal value.
>
> 3-digit real number without explicit decimal separator: 346
> Format: Real (value from a assumed very general format scheme)
> Format: $3.1 (SAS format scheme)
> Format: decimal with the constraints totalDigits:3, fractionDigits:1 ??
> (XML Schema format scheme)
> Width: 3
> DecimalPositions: 1
> With decimal (XML Schema) it seems to be not possible to describe the
> format appropriately.
>
> string with length 2: DE
> Format: String (value from a assumed very general format scheme)
> Format: $CHAR2 (SAS format scheme)
> Format: string with the constraint length:2 (XML Schema format scheme)
> Width: 2
>
> Dependent from the format and the width an application can derive the
> appropriate data type. Further optimization for numeric data types is
> possible on the basis of the description of the value (code) range. This
> can be based on the definition of the range in a variable description
> (not available yet?) or on the definition of the range of the actual data.
>
> In addition to the data format a data type can be described
> ('DataType'), which can be necessary in special cases; in general it
> seems to be more a convenience. This field can also be used to describe
> system-specific data types, on which basis system-specific
> storage-applications can be build. (I'm not sure if the replacement of
> PhysicalDataProduct with a system-specific DDI module is a suitable
> alternative).
>
>
> Note on DecimalSeparator (same on GroupingSeparator)
>
> From field-level documentation: "The decimal separator definition only
> makes sense with some XML Schema primitives."
> This is irritating. The decimal separator makes only sense with numeric
> data (real values) which are represented as strings (XSD string) with an
> explicit decimal separator.
>
>
> SDMX data types
>
> When I remember correctly, these data types are heading for data
> processing and computation. No legacy data formats are included.
>
> Arofan: please send again the document on SDMX data types, I didn't find
> the file.
>
>
> Conclusion
>
> XML Schema data types seems to make no difference between the usual
> representation of numeric decimal data with and without a decimal
> separator (number 34.6 represented as 34.6 or 346). Also it doesn't
> cover binary legacy data types which are important for the archives.
>
> It seems to be difficult to achieve a general format scheme which covers
> all possibilities. The SAS scheme is a good candidate, but would be
> vendor-dependent. Perhaps this list can be more abstracted and this way
> vendor-independent.
>
>
> Comments, other views?
>
> Achim
>
>
> SAS format scheme
>
> SAS informats have the following form:
> <$>informat<w>.<d> where
> - $ indicates a character informat; its absence indicates a numeric
> informat.
> - informat names the informat
> - w specifies the informat width, which for most informats is the number
> of columns in the input data.
> - d specifies an optional decimal scaling factor in the numeric informats
>
> _______________________________________________
> DDI-SRG mailing list
> DDI-SRG at icpsr.umich.edu
> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>
Wendy L. Thomas Phone: +1 612.624.4389
Data Access Core Director Fax: +1 612.626.8375
Minnesota Population Center Email: wlt at pop.umn.edu
University of Minnesota
50 Willey Hall
225 19th Avenue South
Minneapolis, MN 55455
More information about the DDI-SRG
mailing list