[DDI-SRG] Data formats: locale and language, PROPOSAL

Pascal Heus pascal.heus at gmail.com
Thu Dec 13 21:01:45 EST 2007


Achim:
I think the separator issue is already covered in PhysicalDataProduct as 
we have DefaultDecimalSeparator and DecimalSeparator, we could add 
ThousandSeparator (but we may want a more generic name as for example in 
Japan or China the separator is at ten-thousand or 4-digits). On 
Wikipedia, this is refered to as DigitGroupSeparator:
http://en.wikipedia.org/wiki/Decimal_separator#Thousands_separator
Note that we could also infer that if the DecimalSeparator is a dot, the 
DigitGroupSeparator is a comma (and vice-versa) but I'm not sure if this 
is a universal rule.
It also seem that the use of comma/dot may have changed across time...
As far as the language attribute is concerned, I would use lang (not 
xml:lang). As suggested, another approach would be to create a mechanism 
to describe enumerations for days and months.
*P

Joachim Wackerow wrote:
> Here is my proposal regarding these issues:
>
> Regarding the language-specific portion of dates (or similar) in the 
> data file (not in the DDI instance) like 2007-Dec-13 a language 
> attribute should be invented for data formats. This attribute shouldn't 
> be "xml:lang". "xml:lang" has another meaning: it describes the language 
> of an XML content, not the language of a data item in a data file which 
> is described by a DDI instance.
>
> PROPOSAL: new attribute "language" or "lanuageOfData" for DataFormat
>
>
> Regarding the locale-specific formats of data like decimal and thousands 
> separators, like currency symbols etc. a locale attribute can be 
> invented. But the legitimate argument against it is, that this kind 
> description is not exact enough.
> Use case 1: Which currency is defined by a locale of "de"? Deutsche Mark 
> or Euro?
> Use case 2: Which decimal separators is meant by a locale of "de"?
> Common definition in Germany is a format like "3 423,67" (equivalent in 
> the USA is "3,423.67"). But in some publications (like scientific ones) 
> the US format is used. Software like Excel is writing CSV files with a 
> decimal separator dependent from the used locale; but is the user aware 
> of this?
> The locale definition seems to be error prone and not exact enough to 
> describe data where common local rules are often used, but not always. 
> For example the decimal separator should be described. It is not a good 
> idea to rely on common rules dependent from a locale.
> Only in a controlled environment it can make sense to use locale as a 
> reliable indicator for the used format.
>
> PROPOSAL: Further discussion, if it makes sense to invent the locale 
> attribute for controlled conditions. If yes, invent it only with a 
> thorough description in the documentation about the risks.
>
> Achim
>
> Wendy Thomas wrote:
>   
>> there is a measurementUnit in Variable
>>
>> On Wed, 12 Dec 2007, I-Lin Kuo wrote:
>>
>>     
>>> So if separators are already in PhysicalDataProduct ( I haven't checked to
>>> make sure), then I think we're covered for number and currency, since
>>> lyra/euro/marks/dollars unit of currency should be someplace in
>>> LogicalDataProduct as a unit of measure. This should work, right?, unless we
>>> have a use case where a variable contains switchable currency units, e.g.
>>> sometimes pounds, sometimes euros. In that case, the data file has problems
>>> anyway...
>>>
>>> Of the three use cases, that leaves only Date unresolved.
>>>
>>> On Dec 12, 2007 11:15 AM, Pascal Heus <pascal.heus at gmail.com> wrote:
>>>
>>>       
>>>>  Wendy:
>>>> I assume this will all take place at the format level in the
>>>> PhysicalDataProduct. I agree that we may have an language issue when it
>>>> comes to alphanumeric days or months (like JAN,FEV,MAR,AVR or
>>>> lundi,mardi,mercredi in French...). Could have a Format = something like
>>>> DD-MMM-YYYY with an extra xml:lang="FR"?
>>>> A more formal  option would be have a mechanism to declare enumerations
>>>> for the date type (and possibly currencies). Maybe something like:
>>>> <DateFormat expression="MMM" type="month" lang="FR">
>>>>     <Jan>JAN</Jan><Feb>FEV</Feb>....
>>>> </DateFormat>
>>>> This could be reusable though so could also be stored elsewhere and
>>>> referenced (in a translation section?).
>>>> For separators (comma/dot), I think we already have that in the
>>>> PhysicalDataProduct.
>>>> later,
>>>> *P
>>>>
>>>>
>>>> Wendy Thomas wrote:
>>>>
>>>> Thanks Achim
>>>>
>>>> The case you are stating is much clearer now. I think the examples will
>>>> help clarify what changes are needed and where they are needed.
>>>>
>>>> Wendy
>>>>
>>>> On Wed, 12 Dec 2007, Joachim Wackerow wrote:
>>>>
>>>>
>>>>
>>>>  I-Lin,
>>>>
>>>> As Wendy pointed out the requirement is to describe data in data files,
>>>> where we have no control over the used representation, see my email on
>>>> data types/data format from last week.
>>>>
>>>> I think it is not a question of going one way (detailed description) or
>>>> another (or NLS support for generic description). My suggestion is to
>>>> provide both ways.
>>>>
>>>> Regarding the small example there is probably a misunderstanding:10.12.2007 is understood in Germany as a date corresponding to the ISO
>>>> format 2007-12-10.10.12.2007 can be understood in USA as a date corresponding to the ISO
>>>> format 2007-10-12, because dates are often written in this order
>>>> month/day/year.
>>>>
>>>> Indeed these are probably very specific cases, but we want to cover also
>>>> these.
>>>>
>>>> Wendy:
>>>> I'm not sure if things like SUN MON should be described on the logical
>>>> level. You are right, SUN can be understood as a code, but to be able to
>>>> make computations with this this code must be converted by an
>>>> appropriate date format into a numeric representation, which is be done
>>>> in statistical packages or database systems. The example with a
>>>> representation for only the week day is probably poor. Imagine a
>>>> representation year-week-weekday, this can be converted in a numeric
>>>> date for computation. I'm not sure if the definition at the logical
>>>> level would be sufficient for that.
>>>>
>>>>
>>>>
>>>> In general I have the impression that data formats should be describable
>>>> by their characteristics but also by NLS attributes. This point I didn't
>>>> mention last week.
>>>>
>>>> Which date formats are common in legacy or current data files? Is the
>>>> use of date/time variables with specific representations common?
>>>>
>>>> The examples approach is reasonable. I can try some, but I'm not sure,
>>>> if this will happen this week.
>>>>
>>>> Achim
>>>>
>>>> Wendy Thomas wrote:
>>>>
>>>>
>>>>  I-Lin
>>>>
>>>> I think Achim is refering to describing the data as it is contained in
>>>> the data file (over which we have no entry control). While the cases he
>>>> discusses are rare they are a definately problem with both legacy data
>>>> and with data created by others than large scale organizations
>>>> (creativity reigns supreme).
>>>>
>>>> Some of these seem to be storage related issues, but some like the SUN
>>>> MON type of information seems more related to the Variable description
>>>> in logical. The question there is, is Sunday Monday etc a
>>>> DateTimeRepresentation or a CodeRepresentation? Does this get converted
>>>> a it goes into a specific storage format irregardless of the users
>>>> creation of say an alternate variable with coding based on the original
>>>> variable?
>>>>
>>>> I think our use of dates within a DDI instance is covered. The question
>>>> is are the reqested changes for an expansion or change of the
>>>> DateTimeRepresentation (making sure something is stamped as a specific
>>>> representation type rather than a generic category, code, numeric, or
>>>> text response domain? Or is the request to expand or change the
>>>> representation of specifics to a physical store?
>>>>
>>>> Can we get some walk through examples of where the problem lies?
>>>>
>>>> Wendy
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 12 Dec 2007, I-Lin Kuo wrote:
>>>>
>>>>
>>>>
>>>>  Hi Joachim,
>>>>
>>>> While I understand the intent, I'm not sure that localization covers is
>>>> sufficient or the right solution
>>>>
>>>> First, the ISO date + locale example is not correct. ISO 8601 is locale
>>>> neutral, and time elements are arranged in descending order. If 12
>>>> represents the month, then the US example 2007-10-12 is not an ISO date.
>>>> Secondly, the DateFormatStandardName + locale (=ISO + US) scheme of
>>>> identifying formats is not expressive enough to cover 10-DEC-2007 and all
>>>> the other possible variations on date that might occur, unless we greatly
>>>> expand the set of allowed identifying formats (ORA-US). If we do allow
>>>> nonstandard formats, do the formats then mean anything? ORA-US to me
>>>> means
>>>> Oracle date format, US, but might not mean that to someone else. I would
>>>> vote for YYYY-MM-DD for date specifications rather than a name.
>>>>
>>>> In general, I favor specific markup to specific cases rather than a
>>>> general
>>>> approach of localization. For money and currency, I would simply prefer
>>>> @unitOfCurrency and @decimalDelimiter @thousandsDelimiter to solve the
>>>> problem rather than a more general localization approach. This may be
>>>> for no
>>>> other reason that the country of currency is no longer sufficient to
>>>> specify
>>>> whether the currency is in marks or euros.
>>>>
>>>> The other reason I don't favor the localization approach is that for the
>>>> data format concern, I see date, number, and currency as the only issues.
>>>> The other items on the list athttp://en.wikipedia.org/wiki/Internationalization_and_localization are
>>>> all
>>>> already covered.
>>>>
>>>> On Dec 12, 2007 7:42 AM, Joachim Wackerow <joachim.wackerow at gesis.org> <joachim.wackerow at gesis.org>
>>>> wrote:
>>>>
>>>>
>>>>
>>>>  Looking at the SAS formats I realized that we would need additional
>>>> information like locale and/or language for specific formats.
>>>>
>>>> For example some date formats like a string representation of "day in
>>>> the week". Assuming strings like "SUN" or "MON" in the data file. This
>>>> can be represented by a generic format, but additionally a definition of
>>>> the used language would be necessary.
>>>>
>>>> Similar with dates like 10.12.2007 (in Germany in ISO format 2007-12-10,
>>>> in USA in ISO format 2007-10-12); using a generic format an additional
>>>> information about the locale would be necessary. The alternative would
>>>> be to have a specific format definition for each variation. But then the
>>>> information is lost, that the format is locale dependent.
>>>>
>>>> Reading numeric or monetary values with embedded grouping (or thousands)
>>>> separator and decimal separator is another candidate for localization.
>>>> We have already explicit elements for decimal and grouping separators.
>>>> But a alternate way would be to use a generic numeric format with a
>>>> locale.
>>>>
>>>> The locale and language information should stay at the same place where
>>>> the data format is defined. Both can be seen as attributes of data
>>>> format.
>>>>
>>>> In general I think both ways can make sense: definition of a specific
>>>> format by a name (for a related type) and definition of a generic format
>>>> with attributes like decimal separator.
>>>>
>>>> SPSS has no NLS support, SAS has NLS support, but also old style fixed
>>>> definitions, SQL has also both. When both ways of definitions are
>>>> available, the work of describing the formats seems to be easier. The
>>>> mapping table and the applications using the mapping table are getting
>>>> more complicate. But doing formats without NLS seems to be a bad choice.
>>>>
>>>> Achim
>>>> _______________________________________________
>>>> DDI-SRG mailing listDDI-SRG at icpsr.umich.eduhttp://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>>>
>>>>            --
>>>> I-Lin Kuo
>>>>
>>>>
>>>>
>>>>  Wendy L. Thomas                          Phone: +1 612.624.4389
>>>> Data Access Core Director         Fax:   +1 612.626.8375
>>>> Minnesota Population Center              Email: wlt at pop.umn.edu
>>>> University of Minnesota
>>>> 50 Willey Hall
>>>> 225 19th Avenue South
>>>> Minneapolis, MN 55455
>>>>
>>>>
>>>>  --
>>>> GESIS - German Social Science Infrastructure Serviceshttp://www.gesis.org/en/
>>>> _______________________________________________
>>>> DDI-SRG mailing listDDI-SRG at icpsr.umich.eduhttp://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>>>
>>>>      Wendy L. Thomas                          Phone: +1 612.624.4389
>>>> Data Access Core Director		 Fax:   +1 612.626.8375
>>>> Minnesota Population Center              Email: wlt at pop.umn.edu
>>>> University of Minnesota
>>>> 50 Willey Hall
>>>> 225 19th Avenue South
>>>> Minneapolis, MN 55455
>>>> _______________________________________________
>>>> DDI-SRG mailing listDDI-SRG at icpsr.umich.eduhttp://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> DDI-SRG mailing list
>>>> DDI-SRG at icpsr.umich.edu
>>>> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>>>
>>>>
>>>>         
>>> -- 
>>> I-Lin Kuo
>>>
>>>       
>> Wendy L. Thomas                          Phone: +1 612.624.4389
>> Data Access Core Director		 Fax:   +1 612.626.8375
>> Minnesota Population Center              Email: wlt at pop.umn.edu
>> University of Minnesota
>> 50 Willey Hall
>> 225 19th Avenue South
>> Minneapolis, MN 55455
>> _______________________________________________
>> DDI-SRG mailing list
>> DDI-SRG at icpsr.umich.edu
>> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>     
>
>
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.icpsr.umich.edu/pipermail/ddi-srg/attachments/20071213/dfbf8898/attachment-0001.html 


More information about the DDI-SRG mailing list