[DDI-SRG] Data types / data formats

Joachim Wackerow joachim.wackerow at gesis.org
Fri Aug 3 07:37:41 EDT 2007


After talking yesterday shortly about data types I have the impression 
that still some confusion does exist. I try to describe the difference 
between data types and data formats (with the background of social 
science data and statistical packages), and their usage in DDI.


Data Types

Data types are types for data processing and computation (primarily) 
which takes place in-memory. Data types for data processing in 
statistical packages are mostly double precision and string (and date 
types). For performance reasons this data types are stored without 
conversion in system-specific files.

double precision is equivalent to:
- common 8 byte real in program languages
- IEEE double-precision 64-bit floating point type [IEEE 754-1985]
- double in XML Schema data types

string with defined length equivalent to:
- string in XML Schema data types
- string in program languages. But the internal representation varies, 
explicit length definition, object representation or special end of 
string character, more variation with Unicode.


Data Formats

Data formats are used to describe data which is stored in files 
(external to programs, NOT in-memory). Data formats are used for 
conversion to data types for data processing. Data formats are NOT used 
for data processing like computation. Independently of the data format 
the files are represented in common codings like ASCII, EBCDIC, Unicode.


Statistical packages and data types

SPSS and SAS know only the data type double precision and string, they 
don't use - for example - integer. Stata has more variety of data types 
for performance reasons, because - in contrast to SPSS and SAS - Stata 
holds the whole data in memory for computation. Other, very specific 
packages like TDA, know a bit data type to process dichotome variables 
efficiently. Date formats are converted to date/time data types (often 
again double precision) for date/time computations. All the packages can 
store their data types without conversion in system-specific files.


DDI 3.0

The data description in PhysicalDataProduct is intended for files. It is 
primarily a description of data formats (not data types for data 
processing/computation). An application can derive the appropriate data 
type (of the application) from the description of the data format and 
the minimum and maximum values.

Minimum and maximum values as a property of the real data are described in:
/PhysicalInstance/Statistics/VariableStatistics/SummaryStatistic/Value
specified by SummaryStatisticTypeCoded: Minimum, Maximum
(Question: should be in addition a description of minimum and maximum 
possible as a property of the variable definition independent from the 
real data?)

The format is described in:
/PhysicalDataProduct/GrossRecordStructure/DataItem/PhysicalLocation/ValueLocation
or general properties like 'DecimalSeparator' in:
/PhysicalDataProduct

This assumes a data value in a rectangular file where the variables are 
organized as columns and the cases as rows. The data can be arranged in 
fixed format or in free format with a delimiter character ('Delimiter'). 
The data item itself is described mainly by format ('Format') and width 
('Width'). In case of a real number with no explicit decimal separator 
the decimal positions ('DecimalPositions') must be specified. The format 
specification is dependent from the format scheme ('FormatScheme'). 
Three alternatives are described.


Examples:

two-digit integer number: 67
Format: Integer (value from a assumed very general format scheme)
Format: $2. (SAS format scheme)
Format: integer with the constraint totalDigits:2 (XML Schema format scheme)
Width: 2
DecimalPositions: 0

4-digit real number with explicit decimal separator: 34.6
Format: Real (value from a assumed very general format scheme)
Format: $4. (SAS format scheme)
Format: decimal with the constraints totalDigits:4, fractionDigits:1
  (XML Schema format scheme)
Width: 4
DecimalSeparator: Dot
With decimal (XML Schema) it seems to be not possible to describe the 
same number with German decimal separator (34,6).
With this XML Schema description a value of "2.75" would be not allowed, 
but with the other formats it is a legal value.

3-digit real number without explicit decimal separator: 346
Format: Real (value from a assumed very general format scheme)
Format: $3.1 (SAS format scheme)
Format: decimal with the constraints totalDigits:3, fractionDigits:1 ??
  (XML Schema format scheme)
Width: 3
DecimalPositions: 1
With decimal (XML Schema) it seems to be not possible to describe the 
format appropriately.

string with length 2: DE
Format: String (value from a assumed very general format scheme)
Format: $CHAR2 (SAS format scheme)
Format: string with the constraint length:2 (XML Schema format scheme)
Width: 2

Dependent from the format and the width an application can derive the 
appropriate data type. Further optimization for numeric data types is 
possible on the basis of the description of the value (code) range. This 
can be based on the definition of the range in a variable description 
(not available yet?) or on the definition of the range of the actual data.

In addition to the data format a data type can be described 
('DataType'), which can be necessary in special cases; in general it 
seems to be more a convenience. This field can also be used to describe 
system-specific data types, on which basis system-specific 
storage-applications can be build. (I'm not sure if the replacement of 
PhysicalDataProduct with a system-specific DDI module is a suitable 
alternative).


Note on DecimalSeparator (same on GroupingSeparator)

 From field-level documentation: "The decimal separator definition only 
makes sense with some XML Schema primitives."
This is irritating. The decimal separator makes only sense with numeric 
data (real values) which are represented as strings (XSD string) with an 
explicit decimal separator.


SDMX data types

When I remember correctly, these data types are heading for data 
processing and computation. No legacy data formats are included.

Arofan: please send again the document on SDMX data types, I didn't find 
the file.


Conclusion

XML Schema data types seems to make no difference between the usual 
representation of numeric decimal data with and without a decimal 
separator (number 34.6 represented as 34.6 or 346). Also it doesn't 
cover binary legacy data types which are important for the archives.

It seems to be difficult to achieve a general format scheme which covers 
all possibilities. The SAS scheme is a good candidate, but would be 
vendor-dependent. Perhaps this list can be more abstracted and this way 
vendor-independent.


Comments, other views?

Achim


SAS format scheme

SAS informats have the following form:
<$>informat<w>.<d> where
- $ indicates a character informat; its absence indicates a numeric 
informat.
- informat names the informat
- w specifies the informat width, which for most informats is the number 
of columns in the input data.
- d specifies an optional decimal scaling factor in the numeric informats



More information about the DDI-SRG mailing list