[DDI-SRG] Data formats: locale and language

Joachim Wackerow joachim.wackerow at gesis.org
Wed Dec 12 10:33:19 EST 2007


I-Lin,

As Wendy pointed out the requirement is to describe data in data files, 
where we have no control over the used representation, see my email on 
data types/data format from last week.

I think it is not a question of going one way (detailed description) or 
another (or NLS support for generic description). My suggestion is to 
provide both ways.

Regarding the small example there is probably a misunderstanding:
10.12.2007 is understood in Germany as a date corresponding to the ISO 
format 2007-12-10.
10.12.2007 can be understood in USA as a date corresponding to the ISO 
format 2007-10-12, because dates are often written in this order 
month/day/year.

Indeed these are probably very specific cases, but we want to cover also 
these.

Wendy:
I'm not sure if things like SUN MON should be described on the logical 
level. You are right, SUN can be understood as a code, but to be able to 
make computations with this this code must be converted by an 
appropriate date format into a numeric representation, which is be done 
in statistical packages or database systems. The example with a 
representation for only the week day is probably poor. Imagine a 
representation year-week-weekday, this can be converted in a numeric 
date for computation. I'm not sure if the definition at the logical 
level would be sufficient for that.



In general I have the impression that data formats should be describable 
by their characteristics but also by NLS attributes. This point I didn't 
mention last week.

Which date formats are common in legacy or current data files? Is the 
use of date/time variables with specific representations common?

The examples approach is reasonable. I can try some, but I'm not sure, 
if this will happen this week.

Achim

Wendy Thomas wrote:
> I-Lin
> 
> I think Achim is refering to describing the data as it is contained in 
> the data file (over which we have no entry control). While the cases he 
> discusses are rare they are a definately problem with both legacy data 
> and with data created by others than large scale organizations 
> (creativity reigns supreme).
> 
> Some of these seem to be storage related issues, but some like the SUN 
> MON type of information seems more related to the Variable description 
> in logical. The question there is, is Sunday Monday etc a 
> DateTimeRepresentation or a CodeRepresentation? Does this get converted 
> a it goes into a specific storage format irregardless of the users 
> creation of say an alternate variable with coding based on the original 
> variable?
> 
> I think our use of dates within a DDI instance is covered. The question 
> is are the reqested changes for an expansion or change of the 
> DateTimeRepresentation (making sure something is stamped as a specific 
> representation type rather than a generic category, code, numeric, or 
> text response domain? Or is the request to expand or change the 
> representation of specifics to a physical store?
> 
> Can we get some walk through examples of where the problem lies?
> 
> Wendy
> 
> 
> 
> 
> 
> On Wed, 12 Dec 2007, I-Lin Kuo wrote:
> 
>> Hi Joachim,
>>
>> While I understand the intent, I'm not sure that localization covers is
>> sufficient or the right solution
>>
>> First, the ISO date + locale example is not correct. ISO 8601 is locale
>> neutral, and time elements are arranged in descending order. If 12
>> represents the month, then the US example 2007-10-12 is not an ISO date.
>> Secondly, the DateFormatStandardName + locale (=ISO + US) scheme of
>> identifying formats is not expressive enough to cover 10-DEC-2007 and all
>> the other possible variations on date that might occur, unless we greatly
>> expand the set of allowed identifying formats (ORA-US). If we do allow
>> nonstandard formats, do the formats then mean anything? ORA-US to me 
>> means
>> Oracle date format, US, but might not mean that to someone else. I would
>> vote for YYYY-MM-DD for date specifications rather than a name.
>>
>> In general, I favor specific markup to specific cases rather than a 
>> general
>> approach of localization. For money and currency, I would simply prefer
>> @unitOfCurrency and @decimalDelimiter @thousandsDelimiter to solve the
>> problem rather than a more general localization approach. This may be 
>> for no
>> other reason that the country of currency is no longer sufficient to 
>> specify
>> whether the currency is in marks or euros.
>>
>> The other reason I don't favor the localization approach is that for the
>> data format concern, I see date, number, and currency as the only issues.
>> The other items on the list at
>> http://en.wikipedia.org/wiki/Internationalization_and_localization are 
>> all
>> already covered.
>>
>> On Dec 12, 2007 7:42 AM, Joachim Wackerow <joachim.wackerow at gesis.org>
>> wrote:
>>
>>> Looking at the SAS formats I realized that we would need additional
>>> information like locale and/or language for specific formats.
>>>
>>> For example some date formats like a string representation of "day in
>>> the week". Assuming strings like "SUN" or "MON" in the data file. This
>>> can be represented by a generic format, but additionally a definition of
>>> the used language would be necessary.
>>>
>>> Similar with dates like 10.12.2007 (in Germany in ISO format 2007-12-10,
>>> in USA in ISO format 2007-10-12); using a generic format an additional
>>> information about the locale would be necessary. The alternative would
>>> be to have a specific format definition for each variation. But then the
>>> information is lost, that the format is locale dependent.
>>>
>>> Reading numeric or monetary values with embedded grouping (or thousands)
>>> separator and decimal separator is another candidate for localization.
>>> We have already explicit elements for decimal and grouping separators.
>>> But a alternate way would be to use a generic numeric format with a
>>> locale.
>>>
>>> The locale and language information should stay at the same place where
>>> the data format is defined. Both can be seen as attributes of data 
>>> format.
>>>
>>> In general I think both ways can make sense: definition of a specific
>>> format by a name (for a related type) and definition of a generic format
>>> with attributes like decimal separator.
>>>
>>> SPSS has no NLS support, SAS has NLS support, but also old style fixed
>>> definitions, SQL has also both. When both ways of definitions are
>>> available, the work of describing the formats seems to be easier. The
>>> mapping table and the applications using the mapping table are getting
>>> more complicate. But doing formats without NLS seems to be a bad choice.
>>>
>>> Achim
>>> _______________________________________________
>>> DDI-SRG mailing list
>>> DDI-SRG at icpsr.umich.edu
>>> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>>
>>
>>
>>
>> -- 
>> I-Lin Kuo
>>
> 
> Wendy L. Thomas                          Phone: +1 612.624.4389
> Data Access Core Director         Fax:   +1 612.626.8375
> Minnesota Population Center              Email: wlt at pop.umn.edu
> University of Minnesota
> 50 Willey Hall
> 225 19th Avenue South
> Minneapolis, MN 55455


-- 
GESIS - German Social Science Infrastructure Services
http://www.gesis.org/en/


More information about the DDI-SRG mailing list