[DDI-SRG] Data types / data formats
Joachim Wackerow
joachim.wackerow at gesis.org
Fri Aug 3 10:54:36 EDT 2007
Wendy,
Here are two pointers regarding C and Java data formats (for conversion
from external representations to program internal data types):
C function scanf:
http://www.openbsd.org/cgi-bin/man.cgi?query=scanf&sektion=3&arch=&apropos=0&manpath=OpenBSD+Current
Java Scanner class:
http://java.sun.com/javase/6/docs/api/java/util/Scanner.html
This is from my email in January:
http://www.icpsr.umich.edu/pipermail/ddi-srg/2007-January/001976.html
Just to clarify: Java BigInteger etc. are data types for math
computation, these are very seldom used in data storage. A Java int (32
bit) has already a large range: a minimum value of -2,147,483,648 and a
maximum value of 2,147,483,647.
http://java.sun.com/docs/books/tutorial/java/nutsandbolts/datatypes.html
http://java.sun.com/javase/6/docs/api/java/math/package-summary.html
In general I think we cannot cover every thinkable data format and data
type. Important is to have a common list of data formats which are
really used, and to have a structure which is expandable by exotic
formats and types: data format depending from a FormatScheme, data types
depending from a type scheme (which is currently not available).
If you have anything regarding this subject I'm happy to look over it.
Achim
Wendy Thomas wrote:
> I'm going to add a lot of the information from Achim's email to Mantis.
> In trying to do the descriptions for the NumericTypeCodeType attributes
> I was finding things that were not in the w3c page that I-Lin sent. For
> example BigInteger seems to be a java type with the equivilency of
> nonNegativeInteger, negativeInteger, nonPositiveInteger, and integer.
> Java also uses BigDecimal for decimal.
>
> http://www.w3.org/2001/sw/Europe/200301/x2r/ht/a1/standardBuiltins.xml
>
> This is certainly not closed. If we are using w3c (or whatever we decide
> to sue) we should use w3c and provide the mapping. Mixing types lists
> just makes a mess as some type lists are less explicit for decimals and
> strings (like java) and others may be more explicit. We need to sort out
> all the data types at various levels and make sure we are consistant.
>
> Achim, thanks for continuing to pursue this. I'll try to sort something
> out over the weekend and send it out for everyone to consider. If I'm
> wrong in thinking we should stick to a single type tree at least in the
> generalized sections (basically everything but the yet to be created
> software specific physical sepcifications), LET ME KNOW ASAP.
>
> Wendy
>
>
>
>
> On Fri, 3 Aug 2007, Joachim Wackerow wrote:
>
>> After talking yesterday shortly about data types I have the impression
>> that still some confusion does exist. I try to describe the difference
>> between data types and data formats (with the background of social
>> science data and statistical packages), and their usage in DDI.
>>
>>
>> Data Types
>>
>> Data types are types for data processing and computation (primarily)
>> which takes place in-memory. Data types for data processing in
>> statistical packages are mostly double precision and string (and date
>> types). For performance reasons this data types are stored without
>> conversion in system-specific files.
>>
>> double precision is equivalent to:
>> - common 8 byte real in program languages
>> - IEEE double-precision 64-bit floating point type [IEEE 754-1985]
>> - double in XML Schema data types
>>
>> string with defined length equivalent to:
>> - string in XML Schema data types
>> - string in program languages. But the internal representation varies,
>> explicit length definition, object representation or special end of
>> string character, more variation with Unicode.
>>
>>
>> Data Formats
>>
>> Data formats are used to describe data which is stored in files
>> (external to programs, NOT in-memory). Data formats are used for
>> conversion to data types for data processing. Data formats are NOT used
>> for data processing like computation. Independently of the data format
>> the files are represented in common codings like ASCII, EBCDIC, Unicode.
>>
>>
>> Statistical packages and data types
>>
>> SPSS and SAS know only the data type double precision and string, they
>> don't use - for example - integer. Stata has more variety of data types
>> for performance reasons, because - in contrast to SPSS and SAS - Stata
>> holds the whole data in memory for computation. Other, very specific
>> packages like TDA, know a bit data type to process dichotome variables
>> efficiently. Date formats are converted to date/time data types (often
>> again double precision) for date/time computations. All the packages can
>> store their data types without conversion in system-specific files.
>>
>>
>> DDI 3.0
>>
>> The data description in PhysicalDataProduct is intended for files. It is
>> primarily a description of data formats (not data types for data
>> processing/computation). An application can derive the appropriate data
>> type (of the application) from the description of the data format and
>> the minimum and maximum values.
>>
>> Minimum and maximum values as a property of the real data are
>> described in:
>> /PhysicalInstance/Statistics/VariableStatistics/SummaryStatistic/Value
>> specified by SummaryStatisticTypeCoded: Minimum, Maximum
>> (Question: should be in addition a description of minimum and maximum
>> possible as a property of the variable definition independent from the
>> real data?)
>>
>> The format is described in:
>> /PhysicalDataProduct/GrossRecordStructure/DataItem/PhysicalLocation/ValueLocation
>>
>> or general properties like 'DecimalSeparator' in:
>> /PhysicalDataProduct
>>
>> This assumes a data value in a rectangular file where the variables are
>> organized as columns and the cases as rows. The data can be arranged in
>> fixed format or in free format with a delimiter character ('Delimiter').
>> The data item itself is described mainly by format ('Format') and width
>> ('Width'). In case of a real number with no explicit decimal separator
>> the decimal positions ('DecimalPositions') must be specified. The format
>> specification is dependent from the format scheme ('FormatScheme').
>> Three alternatives are described.
>>
>>
>> Examples:
>>
>> two-digit integer number: 67
>> Format: Integer (value from a assumed very general format scheme)
>> Format: $2. (SAS format scheme)
>> Format: integer with the constraint totalDigits:2 (XML Schema format
>> scheme)
>> Width: 2
>> DecimalPositions: 0
>>
>> 4-digit real number with explicit decimal separator: 34.6
>> Format: Real (value from a assumed very general format scheme)
>> Format: $4. (SAS format scheme)
>> Format: decimal with the constraints totalDigits:4, fractionDigits:1
>> (XML Schema format scheme)
>> Width: 4
>> DecimalSeparator: Dot
>> With decimal (XML Schema) it seems to be not possible to describe the
>> same number with German decimal separator (34,6).
>> With this XML Schema description a value of "2.75" would be not allowed,
>> but with the other formats it is a legal value.
>>
>> 3-digit real number without explicit decimal separator: 346
>> Format: Real (value from a assumed very general format scheme)
>> Format: $3.1 (SAS format scheme)
>> Format: decimal with the constraints totalDigits:3, fractionDigits:1 ??
>> (XML Schema format scheme)
>> Width: 3
>> DecimalPositions: 1
>> With decimal (XML Schema) it seems to be not possible to describe the
>> format appropriately.
>>
>> string with length 2: DE
>> Format: String (value from a assumed very general format scheme)
>> Format: $CHAR2 (SAS format scheme)
>> Format: string with the constraint length:2 (XML Schema format scheme)
>> Width: 2
>>
>> Dependent from the format and the width an application can derive the
>> appropriate data type. Further optimization for numeric data types is
>> possible on the basis of the description of the value (code) range. This
>> can be based on the definition of the range in a variable description
>> (not available yet?) or on the definition of the range of the actual
>> data.
>>
>> In addition to the data format a data type can be described
>> ('DataType'), which can be necessary in special cases; in general it
>> seems to be more a convenience. This field can also be used to describe
>> system-specific data types, on which basis system-specific
>> storage-applications can be build. (I'm not sure if the replacement of
>> PhysicalDataProduct with a system-specific DDI module is a suitable
>> alternative).
>>
>>
>> Note on DecimalSeparator (same on GroupingSeparator)
>>
>> From field-level documentation: "The decimal separator definition only
>> makes sense with some XML Schema primitives."
>> This is irritating. The decimal separator makes only sense with numeric
>> data (real values) which are represented as strings (XSD string) with an
>> explicit decimal separator.
>>
>>
>> SDMX data types
>>
>> When I remember correctly, these data types are heading for data
>> processing and computation. No legacy data formats are included.
>>
>> Arofan: please send again the document on SDMX data types, I didn't find
>> the file.
>>
>>
>> Conclusion
>>
>> XML Schema data types seems to make no difference between the usual
>> representation of numeric decimal data with and without a decimal
>> separator (number 34.6 represented as 34.6 or 346). Also it doesn't
>> cover binary legacy data types which are important for the archives.
>>
>> It seems to be difficult to achieve a general format scheme which covers
>> all possibilities. The SAS scheme is a good candidate, but would be
>> vendor-dependent. Perhaps this list can be more abstracted and this way
>> vendor-independent.
>>
>>
>> Comments, other views?
>>
>> Achim
>>
>>
>> SAS format scheme
>>
>> SAS informats have the following form:
>> <$>informat<w>.<d> where
>> - $ indicates a character informat; its absence indicates a numeric
>> informat.
>> - informat names the informat
>> - w specifies the informat width, which for most informats is the number
>> of columns in the input data.
>> - d specifies an optional decimal scaling factor in the numeric informats
>>
>> _______________________________________________
>> DDI-SRG mailing list
>> DDI-SRG at icpsr.umich.edu
>> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>
>
> Wendy L. Thomas Phone: +1 612.624.4389
> Data Access Core Director Fax: +1 612.626.8375
> Minnesota Population Center Email: wlt at pop.umn.edu
> University of Minnesota
> 50 Willey Hall
> 225 19th Avenue South
> Minneapolis, MN 55455
--
GESIS - German Social Science Infrastructure Services
http://www.gesis.org/en/
More information about the DDI-SRG
mailing list