[DDI-SRG] Data types / data formats
Joachim Wackerow
joachim.wackerow at gesis.org
Fri Aug 3 07:37:41 EDT 2007
After talking yesterday shortly about data types I have the impression
that still some confusion does exist. I try to describe the difference
between data types and data formats (with the background of social
science data and statistical packages), and their usage in DDI.
Data Types
Data types are types for data processing and computation (primarily)
which takes place in-memory. Data types for data processing in
statistical packages are mostly double precision and string (and date
types). For performance reasons this data types are stored without
conversion in system-specific files.
double precision is equivalent to:
- common 8 byte real in program languages
- IEEE double-precision 64-bit floating point type [IEEE 754-1985]
- double in XML Schema data types
string with defined length equivalent to:
- string in XML Schema data types
- string in program languages. But the internal representation varies,
explicit length definition, object representation or special end of
string character, more variation with Unicode.
Data Formats
Data formats are used to describe data which is stored in files
(external to programs, NOT in-memory). Data formats are used for
conversion to data types for data processing. Data formats are NOT used
for data processing like computation. Independently of the data format
the files are represented in common codings like ASCII, EBCDIC, Unicode.
Statistical packages and data types
SPSS and SAS know only the data type double precision and string, they
don't use - for example - integer. Stata has more variety of data types
for performance reasons, because - in contrast to SPSS and SAS - Stata
holds the whole data in memory for computation. Other, very specific
packages like TDA, know a bit data type to process dichotome variables
efficiently. Date formats are converted to date/time data types (often
again double precision) for date/time computations. All the packages can
store their data types without conversion in system-specific files.
DDI 3.0
The data description in PhysicalDataProduct is intended for files. It is
primarily a description of data formats (not data types for data
processing/computation). An application can derive the appropriate data
type (of the application) from the description of the data format and
the minimum and maximum values.
Minimum and maximum values as a property of the real data are described in:
/PhysicalInstance/Statistics/VariableStatistics/SummaryStatistic/Value
specified by SummaryStatisticTypeCoded: Minimum, Maximum
(Question: should be in addition a description of minimum and maximum
possible as a property of the variable definition independent from the
real data?)
The format is described in:
/PhysicalDataProduct/GrossRecordStructure/DataItem/PhysicalLocation/ValueLocation
or general properties like 'DecimalSeparator' in:
/PhysicalDataProduct
This assumes a data value in a rectangular file where the variables are
organized as columns and the cases as rows. The data can be arranged in
fixed format or in free format with a delimiter character ('Delimiter').
The data item itself is described mainly by format ('Format') and width
('Width'). In case of a real number with no explicit decimal separator
the decimal positions ('DecimalPositions') must be specified. The format
specification is dependent from the format scheme ('FormatScheme').
Three alternatives are described.
Examples:
two-digit integer number: 67
Format: Integer (value from a assumed very general format scheme)
Format: $2. (SAS format scheme)
Format: integer with the constraint totalDigits:2 (XML Schema format scheme)
Width: 2
DecimalPositions: 0
4-digit real number with explicit decimal separator: 34.6
Format: Real (value from a assumed very general format scheme)
Format: $4. (SAS format scheme)
Format: decimal with the constraints totalDigits:4, fractionDigits:1
(XML Schema format scheme)
Width: 4
DecimalSeparator: Dot
With decimal (XML Schema) it seems to be not possible to describe the
same number with German decimal separator (34,6).
With this XML Schema description a value of "2.75" would be not allowed,
but with the other formats it is a legal value.
3-digit real number without explicit decimal separator: 346
Format: Real (value from a assumed very general format scheme)
Format: $3.1 (SAS format scheme)
Format: decimal with the constraints totalDigits:3, fractionDigits:1 ??
(XML Schema format scheme)
Width: 3
DecimalPositions: 1
With decimal (XML Schema) it seems to be not possible to describe the
format appropriately.
string with length 2: DE
Format: String (value from a assumed very general format scheme)
Format: $CHAR2 (SAS format scheme)
Format: string with the constraint length:2 (XML Schema format scheme)
Width: 2
Dependent from the format and the width an application can derive the
appropriate data type. Further optimization for numeric data types is
possible on the basis of the description of the value (code) range. This
can be based on the definition of the range in a variable description
(not available yet?) or on the definition of the range of the actual data.
In addition to the data format a data type can be described
('DataType'), which can be necessary in special cases; in general it
seems to be more a convenience. This field can also be used to describe
system-specific data types, on which basis system-specific
storage-applications can be build. (I'm not sure if the replacement of
PhysicalDataProduct with a system-specific DDI module is a suitable
alternative).
Note on DecimalSeparator (same on GroupingSeparator)
From field-level documentation: "The decimal separator definition only
makes sense with some XML Schema primitives."
This is irritating. The decimal separator makes only sense with numeric
data (real values) which are represented as strings (XSD string) with an
explicit decimal separator.
SDMX data types
When I remember correctly, these data types are heading for data
processing and computation. No legacy data formats are included.
Arofan: please send again the document on SDMX data types, I didn't find
the file.
Conclusion
XML Schema data types seems to make no difference between the usual
representation of numeric decimal data with and without a decimal
separator (number 34.6 represented as 34.6 or 346). Also it doesn't
cover binary legacy data types which are important for the archives.
It seems to be difficult to achieve a general format scheme which covers
all possibilities. The SAS scheme is a good candidate, but would be
vendor-dependent. Perhaps this list can be more abstracted and this way
vendor-independent.
Comments, other views?
Achim
SAS format scheme
SAS informats have the following form:
<$>informat<w>.<d> where
- $ indicates a character informat; its absence indicates a numeric
informat.
- informat names the informat
- w specifies the informat width, which for most informats is the number
of columns in the input data.
- d specifies an optional decimal scaling factor in the numeric informats
More information about the DDI-SRG
mailing list