[DDI-SRG] Data formats: locale and language, PROPOSAL

Wendy Thomas wlt at pop.umn.edu
Fri Dec 14 09:27:35 EST 2007


There is a DefaultGroupingSeparator
<xs:documentation>The character used to separate groups of digits (if an 
explicit separator is used in the data). Allowed values are: None 
(default), Dot, Comma, Other. The decimal separator definition makes only 
sense with some XML Schema primitives. This is a default which may be 
overridden in specific cases.</xs:documentation>

It should be available at the DataItem level as an override. If not this 
is a bug

Wendy


On Thu, 13 Dec 2007, Pascal Heus wrote:

> Achim:
> I think the separator issue is already covered in PhysicalDataProduct as we 
> have DefaultDecimalSeparator and DecimalSeparator, we could add 
> ThousandSeparator (but we may want a more generic name as for example in 
> Japan or China the separator is at ten-thousand or 4-digits). On Wikipedia, 
> this is refered to as DigitGroupSeparator:
> http://en.wikipedia.org/wiki/Decimal_separator#Thousands_separator
> Note that we could also infer that if the DecimalSeparator is a dot, the 
> DigitGroupSeparator is a comma (and vice-versa) but I'm not sure if this is a 
> universal rule.
> It also seem that the use of comma/dot may have changed across time...
> As far as the language attribute is concerned, I would use lang (not 
> xml:lang). As suggested, another approach would be to create a mechanism to 
> describe enumerations for days and months.
> *P
>
> Joachim Wackerow wrote:
>> Here is my proposal regarding these issues:
>> 
>> Regarding the language-specific portion of dates (or similar) in the data 
>> file (not in the DDI instance) like 2007-Dec-13 a language attribute should 
>> be invented for data formats. This attribute shouldn't be "xml:lang". 
>> "xml:lang" has another meaning: it describes the language of an XML 
>> content, not the language of a data item in a data file which is described 
>> by a DDI instance.
>> 
>> PROPOSAL: new attribute "language" or "lanuageOfData" for DataFormat
>> 
>> 
>> Regarding the locale-specific formats of data like decimal and thousands 
>> separators, like currency symbols etc. a locale attribute can be invented. 
>> But the legitimate argument against it is, that this kind description is 
>> not exact enough.
>> Use case 1: Which currency is defined by a locale of "de"? Deutsche Mark or 
>> Euro?
>> Use case 2: Which decimal separators is meant by a locale of "de"?
>> Common definition in Germany is a format like "3 423,67" (equivalent in the 
>> USA is "3,423.67"). But in some publications (like scientific ones) the US 
>> format is used. Software like Excel is writing CSV files with a decimal 
>> separator dependent from the used locale; but is the user aware of this?
>> The locale definition seems to be error prone and not exact enough to 
>> describe data where common local rules are often used, but not always. For 
>> example the decimal separator should be described. It is not a good idea to 
>> rely on common rules dependent from a locale.
>> Only in a controlled environment it can make sense to use locale as a 
>> reliable indicator for the used format.
>> 
>> PROPOSAL: Further discussion, if it makes sense to invent the locale 
>> attribute for controlled conditions. If yes, invent it only with a thorough 
>> description in the documentation about the risks.
>> 
>> Achim
>> 
>> Wendy Thomas wrote:
>> 
>>> there is a measurementUnit in Variable
>>> 
>>> On Wed, 12 Dec 2007, I-Lin Kuo wrote:
>>>
>>> 
>>>> So if separators are already in PhysicalDataProduct ( I haven't checked 
>>>> to
>>>> make sure), then I think we're covered for number and currency, since
>>>> lyra/euro/marks/dollars unit of currency should be someplace in
>>>> LogicalDataProduct as a unit of measure. This should work, right?, unless 
>>>> we
>>>> have a use case where a variable contains switchable currency units, e.g.
>>>> sometimes pounds, sometimes euros. In that case, the data file has 
>>>> problems
>>>> anyway...
>>>> 
>>>> Of the three use cases, that leaves only Date unresolved.
>>>> 
>>>> On Dec 12, 2007 11:15 AM, Pascal Heus <pascal.heus at gmail.com> wrote:
>>>>
>>>>
>>>>>  Wendy:
>>>>> I assume this will all take place at the format level in the
>>>>> PhysicalDataProduct. I agree that we may have an language issue when it
>>>>> comes to alphanumeric days or months (like JAN,FEV,MAR,AVR or
>>>>> lundi,mardi,mercredi in French...). Could have a Format = something like
>>>>> DD-MMM-YYYY with an extra xml:lang="FR"?
>>>>> A more formal  option would be have a mechanism to declare enumerations
>>>>> for the date type (and possibly currencies). Maybe something like:
>>>>> <DateFormat expression="MMM" type="month" lang="FR">
>>>>>     <Jan>JAN</Jan><Feb>FEV</Feb>....
>>>>> </DateFormat>
>>>>> This could be reusable though so could also be stored elsewhere and
>>>>> referenced (in a translation section?).
>>>>> For separators (comma/dot), I think we already have that in the
>>>>> PhysicalDataProduct.
>>>>> later,
>>>>> *P
>>>>> 
>>>>> 
>>>>> Wendy Thomas wrote:
>>>>> 
>>>>> Thanks Achim
>>>>> 
>>>>> The case you are stating is much clearer now. I think the examples will
>>>>> help clarify what changes are needed and where they are needed.
>>>>> 
>>>>> Wendy
>>>>> 
>>>>> On Wed, 12 Dec 2007, Joachim Wackerow wrote:
>>>>> 
>>>>> 
>>>>>
>>>>>  I-Lin,
>>>>> 
>>>>> As Wendy pointed out the requirement is to describe data in data files,
>>>>> where we have no control over the used representation, see my email on
>>>>> data types/data format from last week.
>>>>> 
>>>>> I think it is not a question of going one way (detailed description) or
>>>>> another (or NLS support for generic description). My suggestion is to
>>>>> provide both ways.
>>>>> 
>>>>> Regarding the small example there is probably a 
>>>>> misunderstanding:10.12.2007 is understood in Germany as a date 
>>>>> corresponding to the ISO
>>>>> format 2007-12-10.10.12.2007 can be understood in USA as a date 
>>>>> corresponding to the ISO
>>>>> format 2007-10-12, because dates are often written in this order
>>>>> month/day/year.
>>>>> 
>>>>> Indeed these are probably very specific cases, but we want to cover also
>>>>> these.
>>>>> 
>>>>> Wendy:
>>>>> I'm not sure if things like SUN MON should be described on the logical
>>>>> level. You are right, SUN can be understood as a code, but to be able to
>>>>> make computations with this this code must be converted by an
>>>>> appropriate date format into a numeric representation, which is be done
>>>>> in statistical packages or database systems. The example with a
>>>>> representation for only the week day is probably poor. Imagine a
>>>>> representation year-week-weekday, this can be converted in a numeric
>>>>> date for computation. I'm not sure if the definition at the logical
>>>>> level would be sufficient for that.
>>>>> 
>>>>> 
>>>>> 
>>>>> In general I have the impression that data formats should be describable
>>>>> by their characteristics but also by NLS attributes. This point I didn't
>>>>> mention last week.
>>>>> 
>>>>> Which date formats are common in legacy or current data files? Is the
>>>>> use of date/time variables with specific representations common?
>>>>> 
>>>>> The examples approach is reasonable. I can try some, but I'm not sure,
>>>>> if this will happen this week.
>>>>> 
>>>>> Achim
>>>>> 
>>>>> Wendy Thomas wrote:
>>>>> 
>>>>>
>>>>>  I-Lin
>>>>> 
>>>>> I think Achim is refering to describing the data as it is contained in
>>>>> the data file (over which we have no entry control). While the cases he
>>>>> discusses are rare they are a definately problem with both legacy data
>>>>> and with data created by others than large scale organizations
>>>>> (creativity reigns supreme).
>>>>> 
>>>>> Some of these seem to be storage related issues, but some like the SUN
>>>>> MON type of information seems more related to the Variable description
>>>>> in logical. The question there is, is Sunday Monday etc a
>>>>> DateTimeRepresentation or a CodeRepresentation? Does this get converted
>>>>> a it goes into a specific storage format irregardless of the users
>>>>> creation of say an alternate variable with coding based on the original
>>>>> variable?
>>>>> 
>>>>> I think our use of dates within a DDI instance is covered. The question
>>>>> is are the reqested changes for an expansion or change of the
>>>>> DateTimeRepresentation (making sure something is stamped as a specific
>>>>> representation type rather than a generic category, code, numeric, or
>>>>> text response domain? Or is the request to expand or change the
>>>>> representation of specifics to a physical store?
>>>>> 
>>>>> Can we get some walk through examples of where the problem lies?
>>>>> 
>>>>> Wendy
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, 12 Dec 2007, I-Lin Kuo wrote:
>>>>> 
>>>>> 
>>>>>
>>>>>  Hi Joachim,
>>>>> 
>>>>> While I understand the intent, I'm not sure that localization covers is
>>>>> sufficient or the right solution
>>>>> 
>>>>> First, the ISO date + locale example is not correct. ISO 8601 is locale
>>>>> neutral, and time elements are arranged in descending order. If 12
>>>>> represents the month, then the US example 2007-10-12 is not an ISO date.
>>>>> Secondly, the DateFormatStandardName + locale (=ISO + US) scheme of
>>>>> identifying formats is not expressive enough to cover 10-DEC-2007 and 
>>>>> all
>>>>> the other possible variations on date that might occur, unless we 
>>>>> greatly
>>>>> expand the set of allowed identifying formats (ORA-US). If we do allow
>>>>> nonstandard formats, do the formats then mean anything? ORA-US to me
>>>>> means
>>>>> Oracle date format, US, but might not mean that to someone else. I would
>>>>> vote for YYYY-MM-DD for date specifications rather than a name.
>>>>> 
>>>>> In general, I favor specific markup to specific cases rather than a
>>>>> general
>>>>> approach of localization. For money and currency, I would simply prefer
>>>>> @unitOfCurrency and @decimalDelimiter @thousandsDelimiter to solve the
>>>>> problem rather than a more general localization approach. This may be
>>>>> for no
>>>>> other reason that the country of currency is no longer sufficient to
>>>>> specify
>>>>> whether the currency is in marks or euros.
>>>>> 
>>>>> The other reason I don't favor the localization approach is that for the
>>>>> data format concern, I see date, number, and currency as the only 
>>>>> issues.
>>>>> The other items on the list 
>>>>> athttp://en.wikipedia.org/wiki/Internationalization_and_localization are
>>>>> all
>>>>> already covered.
>>>>> 
>>>>> On Dec 12, 2007 7:42 AM, Joachim Wackerow <joachim.wackerow at gesis.org> 
>>>>> <joachim.wackerow at gesis.org>
>>>>> wrote:
>>>>> 
>>>>> 
>>>>>
>>>>>  Looking at the SAS formats I realized that we would need additional
>>>>> information like locale and/or language for specific formats.
>>>>> 
>>>>> For example some date formats like a string representation of "day in
>>>>> the week". Assuming strings like "SUN" or "MON" in the data file. This
>>>>> can be represented by a generic format, but additionally a definition of
>>>>> the used language would be necessary.
>>>>> 
>>>>> Similar with dates like 10.12.2007 (in Germany in ISO format 2007-12-10,
>>>>> in USA in ISO format 2007-10-12); using a generic format an additional
>>>>> information about the locale would be necessary. The alternative would
>>>>> be to have a specific format definition for each variation. But then the
>>>>> information is lost, that the format is locale dependent.
>>>>> 
>>>>> Reading numeric or monetary values with embedded grouping (or thousands)
>>>>> separator and decimal separator is another candidate for localization.
>>>>> We have already explicit elements for decimal and grouping separators.
>>>>> But a alternate way would be to use a generic numeric format with a
>>>>> locale.
>>>>> 
>>>>> The locale and language information should stay at the same place where
>>>>> the data format is defined. Both can be seen as attributes of data
>>>>> format.
>>>>> 
>>>>> In general I think both ways can make sense: definition of a specific
>>>>> format by a name (for a related type) and definition of a generic format
>>>>> with attributes like decimal separator.
>>>>> 
>>>>> SPSS has no NLS support, SAS has NLS support, but also old style fixed
>>>>> definitions, SQL has also both. When both ways of definitions are
>>>>> available, the work of describing the formats seems to be easier. The
>>>>> mapping table and the applications using the mapping table are getting
>>>>> more complicate. But doing formats without NLS seems to be a bad choice.
>>>>> 
>>>>> Achim
>>>>> _______________________________________________
>>>>> DDI-SRG mailing 
>>>>> listDDI-SRG at icpsr.umich.eduhttp://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>>>>
>>>>>            --
>>>>> I-Lin Kuo
>>>>> 
>>>>> 
>>>>>
>>>>>  Wendy L. Thomas                          Phone: +1 612.624.4389
>>>>> Data Access Core Director         Fax:   +1 612.626.8375
>>>>> Minnesota Population Center              Email: wlt at pop.umn.edu
>>>>> University of Minnesota
>>>>> 50 Willey Hall
>>>>> 225 19th Avenue South
>>>>> Minneapolis, MN 55455
>>>>> 
>>>>>
>>>>>  --
>>>>> GESIS - German Social Science Infrastructure 
>>>>> Serviceshttp://www.gesis.org/en/
>>>>> _______________________________________________
>>>>> DDI-SRG mailing 
>>>>> listDDI-SRG at icpsr.umich.eduhttp://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>>>>
>>>>>      Wendy L. Thomas                          Phone: +1 612.624.4389
>>>>> Data Access Core Director		 Fax:   +1 612.626.8375
>>>>> Minnesota Population Center              Email: wlt at pop.umn.edu
>>>>> University of Minnesota
>>>>> 50 Willey Hall
>>>>> 225 19th Avenue South
>>>>> Minneapolis, MN 55455
>>>>> _______________________________________________
>>>>> DDI-SRG mailing 
>>>>> listDDI-SRG at icpsr.umich.eduhttp://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> DDI-SRG mailing list
>>>>> DDI-SRG at icpsr.umich.edu
>>>>> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>>>> 
>>>>>
>>>>> 
>>>> -- 
>>>> I-Lin Kuo
>>>>
>>>> 
>>> Wendy L. Thomas                          Phone: +1 612.624.4389
>>> Data Access Core Director		 Fax:   +1 612.626.8375
>>> Minnesota Population Center              Email: wlt at pop.umn.edu
>>> University of Minnesota
>>> 50 Willey Hall
>>> 225 19th Avenue South
>>> Minneapolis, MN 55455
>>> _______________________________________________
>>> DDI-SRG mailing list
>>> DDI-SRG at icpsr.umich.edu
>>> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>> 
>> 
>>
>> 
>
>

Wendy L. Thomas                          Phone: +1 612.624.4389
Data Access Core Director		 Fax:   +1 612.626.8375
Minnesota Population Center              Email: wlt at pop.umn.edu
University of Minnesota
50 Willey Hall
225 19th Avenue South
Minneapolis, MN 55455


More information about the DDI-SRG mailing list