[DDI-SRG] Data formats: locale and language, PROPOSAL
Joachim Wackerow
joachim.wackerow at gesis.org
Thu Dec 13 14:23:49 EST 2007
Here is my proposal regarding these issues:
Regarding the language-specific portion of dates (or similar) in the
data file (not in the DDI instance) like 2007-Dec-13 a language
attribute should be invented for data formats. This attribute shouldn't
be "xml:lang". "xml:lang" has another meaning: it describes the language
of an XML content, not the language of a data item in a data file which
is described by a DDI instance.
PROPOSAL: new attribute "language" or "lanuageOfData" for DataFormat
Regarding the locale-specific formats of data like decimal and thousands
separators, like currency symbols etc. a locale attribute can be
invented. But the legitimate argument against it is, that this kind
description is not exact enough.
Use case 1: Which currency is defined by a locale of "de"? Deutsche Mark
or Euro?
Use case 2: Which decimal separators is meant by a locale of "de"?
Common definition in Germany is a format like "3 423,67" (equivalent in
the USA is "3,423.67"). But in some publications (like scientific ones)
the US format is used. Software like Excel is writing CSV files with a
decimal separator dependent from the used locale; but is the user aware
of this?
The locale definition seems to be error prone and not exact enough to
describe data where common local rules are often used, but not always.
For example the decimal separator should be described. It is not a good
idea to rely on common rules dependent from a locale.
Only in a controlled environment it can make sense to use locale as a
reliable indicator for the used format.
PROPOSAL: Further discussion, if it makes sense to invent the locale
attribute for controlled conditions. If yes, invent it only with a
thorough description in the documentation about the risks.
Achim
Wendy Thomas wrote:
> there is a measurementUnit in Variable
>
> On Wed, 12 Dec 2007, I-Lin Kuo wrote:
>
>> So if separators are already in PhysicalDataProduct ( I haven't checked to
>> make sure), then I think we're covered for number and currency, since
>> lyra/euro/marks/dollars unit of currency should be someplace in
>> LogicalDataProduct as a unit of measure. This should work, right?, unless we
>> have a use case where a variable contains switchable currency units, e.g.
>> sometimes pounds, sometimes euros. In that case, the data file has problems
>> anyway...
>>
>> Of the three use cases, that leaves only Date unresolved.
>>
>> On Dec 12, 2007 11:15 AM, Pascal Heus <pascal.heus at gmail.com> wrote:
>>
>>> Wendy:
>>> I assume this will all take place at the format level in the
>>> PhysicalDataProduct. I agree that we may have an language issue when it
>>> comes to alphanumeric days or months (like JAN,FEV,MAR,AVR or
>>> lundi,mardi,mercredi in French...). Could have a Format = something like
>>> DD-MMM-YYYY with an extra xml:lang="FR"?
>>> A more formal option would be have a mechanism to declare enumerations
>>> for the date type (and possibly currencies). Maybe something like:
>>> <DateFormat expression="MMM" type="month" lang="FR">
>>> <Jan>JAN</Jan><Feb>FEV</Feb>....
>>> </DateFormat>
>>> This could be reusable though so could also be stored elsewhere and
>>> referenced (in a translation section?).
>>> For separators (comma/dot), I think we already have that in the
>>> PhysicalDataProduct.
>>> later,
>>> *P
>>>
>>>
>>> Wendy Thomas wrote:
>>>
>>> Thanks Achim
>>>
>>> The case you are stating is much clearer now. I think the examples will
>>> help clarify what changes are needed and where they are needed.
>>>
>>> Wendy
>>>
>>> On Wed, 12 Dec 2007, Joachim Wackerow wrote:
>>>
>>>
>>>
>>> I-Lin,
>>>
>>> As Wendy pointed out the requirement is to describe data in data files,
>>> where we have no control over the used representation, see my email on
>>> data types/data format from last week.
>>>
>>> I think it is not a question of going one way (detailed description) or
>>> another (or NLS support for generic description). My suggestion is to
>>> provide both ways.
>>>
>>> Regarding the small example there is probably a misunderstanding:10.12.2007 is understood in Germany as a date corresponding to the ISO
>>> format 2007-12-10.10.12.2007 can be understood in USA as a date corresponding to the ISO
>>> format 2007-10-12, because dates are often written in this order
>>> month/day/year.
>>>
>>> Indeed these are probably very specific cases, but we want to cover also
>>> these.
>>>
>>> Wendy:
>>> I'm not sure if things like SUN MON should be described on the logical
>>> level. You are right, SUN can be understood as a code, but to be able to
>>> make computations with this this code must be converted by an
>>> appropriate date format into a numeric representation, which is be done
>>> in statistical packages or database systems. The example with a
>>> representation for only the week day is probably poor. Imagine a
>>> representation year-week-weekday, this can be converted in a numeric
>>> date for computation. I'm not sure if the definition at the logical
>>> level would be sufficient for that.
>>>
>>>
>>>
>>> In general I have the impression that data formats should be describable
>>> by their characteristics but also by NLS attributes. This point I didn't
>>> mention last week.
>>>
>>> Which date formats are common in legacy or current data files? Is the
>>> use of date/time variables with specific representations common?
>>>
>>> The examples approach is reasonable. I can try some, but I'm not sure,
>>> if this will happen this week.
>>>
>>> Achim
>>>
>>> Wendy Thomas wrote:
>>>
>>>
>>> I-Lin
>>>
>>> I think Achim is refering to describing the data as it is contained in
>>> the data file (over which we have no entry control). While the cases he
>>> discusses are rare they are a definately problem with both legacy data
>>> and with data created by others than large scale organizations
>>> (creativity reigns supreme).
>>>
>>> Some of these seem to be storage related issues, but some like the SUN
>>> MON type of information seems more related to the Variable description
>>> in logical. The question there is, is Sunday Monday etc a
>>> DateTimeRepresentation or a CodeRepresentation? Does this get converted
>>> a it goes into a specific storage format irregardless of the users
>>> creation of say an alternate variable with coding based on the original
>>> variable?
>>>
>>> I think our use of dates within a DDI instance is covered. The question
>>> is are the reqested changes for an expansion or change of the
>>> DateTimeRepresentation (making sure something is stamped as a specific
>>> representation type rather than a generic category, code, numeric, or
>>> text response domain? Or is the request to expand or change the
>>> representation of specifics to a physical store?
>>>
>>> Can we get some walk through examples of where the problem lies?
>>>
>>> Wendy
>>>
>>>
>>>
>>>
>>>
>>> On Wed, 12 Dec 2007, I-Lin Kuo wrote:
>>>
>>>
>>>
>>> Hi Joachim,
>>>
>>> While I understand the intent, I'm not sure that localization covers is
>>> sufficient or the right solution
>>>
>>> First, the ISO date + locale example is not correct. ISO 8601 is locale
>>> neutral, and time elements are arranged in descending order. If 12
>>> represents the month, then the US example 2007-10-12 is not an ISO date.
>>> Secondly, the DateFormatStandardName + locale (=ISO + US) scheme of
>>> identifying formats is not expressive enough to cover 10-DEC-2007 and all
>>> the other possible variations on date that might occur, unless we greatly
>>> expand the set of allowed identifying formats (ORA-US). If we do allow
>>> nonstandard formats, do the formats then mean anything? ORA-US to me
>>> means
>>> Oracle date format, US, but might not mean that to someone else. I would
>>> vote for YYYY-MM-DD for date specifications rather than a name.
>>>
>>> In general, I favor specific markup to specific cases rather than a
>>> general
>>> approach of localization. For money and currency, I would simply prefer
>>> @unitOfCurrency and @decimalDelimiter @thousandsDelimiter to solve the
>>> problem rather than a more general localization approach. This may be
>>> for no
>>> other reason that the country of currency is no longer sufficient to
>>> specify
>>> whether the currency is in marks or euros.
>>>
>>> The other reason I don't favor the localization approach is that for the
>>> data format concern, I see date, number, and currency as the only issues.
>>> The other items on the list athttp://en.wikipedia.org/wiki/Internationalization_and_localization are
>>> all
>>> already covered.
>>>
>>> On Dec 12, 2007 7:42 AM, Joachim Wackerow <joachim.wackerow at gesis.org> <joachim.wackerow at gesis.org>
>>> wrote:
>>>
>>>
>>>
>>> Looking at the SAS formats I realized that we would need additional
>>> information like locale and/or language for specific formats.
>>>
>>> For example some date formats like a string representation of "day in
>>> the week". Assuming strings like "SUN" or "MON" in the data file. This
>>> can be represented by a generic format, but additionally a definition of
>>> the used language would be necessary.
>>>
>>> Similar with dates like 10.12.2007 (in Germany in ISO format 2007-12-10,
>>> in USA in ISO format 2007-10-12); using a generic format an additional
>>> information about the locale would be necessary. The alternative would
>>> be to have a specific format definition for each variation. But then the
>>> information is lost, that the format is locale dependent.
>>>
>>> Reading numeric or monetary values with embedded grouping (or thousands)
>>> separator and decimal separator is another candidate for localization.
>>> We have already explicit elements for decimal and grouping separators.
>>> But a alternate way would be to use a generic numeric format with a
>>> locale.
>>>
>>> The locale and language information should stay at the same place where
>>> the data format is defined. Both can be seen as attributes of data
>>> format.
>>>
>>> In general I think both ways can make sense: definition of a specific
>>> format by a name (for a related type) and definition of a generic format
>>> with attributes like decimal separator.
>>>
>>> SPSS has no NLS support, SAS has NLS support, but also old style fixed
>>> definitions, SQL has also both. When both ways of definitions are
>>> available, the work of describing the formats seems to be easier. The
>>> mapping table and the applications using the mapping table are getting
>>> more complicate. But doing formats without NLS seems to be a bad choice.
>>>
>>> Achim
>>> _______________________________________________
>>> DDI-SRG mailing listDDI-SRG at icpsr.umich.eduhttp://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>>
>>> --
>>> I-Lin Kuo
>>>
>>>
>>>
>>> Wendy L. Thomas Phone: +1 612.624.4389
>>> Data Access Core Director Fax: +1 612.626.8375
>>> Minnesota Population Center Email: wlt at pop.umn.edu
>>> University of Minnesota
>>> 50 Willey Hall
>>> 225 19th Avenue South
>>> Minneapolis, MN 55455
>>>
>>>
>>> --
>>> GESIS - German Social Science Infrastructure Serviceshttp://www.gesis.org/en/
>>> _______________________________________________
>>> DDI-SRG mailing listDDI-SRG at icpsr.umich.eduhttp://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>>
>>> Wendy L. Thomas Phone: +1 612.624.4389
>>> Data Access Core Director Fax: +1 612.626.8375
>>> Minnesota Population Center Email: wlt at pop.umn.edu
>>> University of Minnesota
>>> 50 Willey Hall
>>> 225 19th Avenue South
>>> Minneapolis, MN 55455
>>> _______________________________________________
>>> DDI-SRG mailing listDDI-SRG at icpsr.umich.eduhttp://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>>
>>>
>>>
>>> _______________________________________________
>>> DDI-SRG mailing list
>>> DDI-SRG at icpsr.umich.edu
>>> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
>>>
>>>
>>
>> --
>> I-Lin Kuo
>>
>
> Wendy L. Thomas Phone: +1 612.624.4389
> Data Access Core Director Fax: +1 612.626.8375
> Minnesota Population Center Email: wlt at pop.umn.edu
> University of Minnesota
> 50 Willey Hall
> 225 19th Avenue South
> Minneapolis, MN 55455
> _______________________________________________
> DDI-SRG mailing list
> DDI-SRG at icpsr.umich.edu
> http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
--
GESIS - German Social Science Infrastructure Services
http://www.gesis.org/en/
More information about the DDI-SRG
mailing list