[DDI-SRG] Don't remember if I sent this

Wendy Thomas wlt at pop.umn.edu
Thu Mar 19 10:56:39 EDT 2009


This covers the remaining 9 bugs not yet discussed. It contains the Mantis 
content plus suggested solution for some of the bugs.  wlt



152	Data Formats
 	r:GenericOutputFormat (CodeValueType) is not enough to describe a 
format.
 	When a generic output format should be described in full detail 
all the properties of p:PhysicalLocationType are necessary for this 
purpose. When just some core properties of a generic output format should 
be described, name of the format, width, and optional DecimalPositions are 
necessary.
 	This should be DISCUSSED. I would opt for the first alternative 
"full detail".

 	The absolute minimal requirement to describe a data format is the 
existence of the name of the format and the width. These elements should 
therefore be required.

 	Any default structure for GenericOutputFormat in LogicalProduct or 
VariableScheme is missing (see my email on "Data Format Summary" from 
2008-01-18).
Suggestion (alternative "full detail"):
put p:PhysicalLocation in reusable and rename it to r:DataFormat
rename new r:StorageFormat to r:DataFormatName (CodeValueType)
invent p:StorageFormat which uses new r:DataFormat
r:GenericOutputFormat uses new r:DataFormat

Suggestion (alternative "core properties")
invent a r:DataFormat in reusable
r:DataFormat
   r:DataFormatName (was r:StorageFormat or r:GenericOutputFormat)
   r:Width
   r:DecimalPositions
changed p:StorageFormat uses r:DataFormat
r:GenericOutputFormat uses r:DataFormat

Definition of GenericOutputFormat "This field provides a recommended 
generic treatment of the data for display by an application. The value 
should come from a controlled vocabulary. "

The second suggestion results in having the Width in reusable and 
StartPosition in physicaldataproduct.

Any change in p:PhysicalLocation results in changes to the defaults in 
p:PhysicalStructure. The defaults are now not comprehensive. An element 
p:PhysicalStructure/p:DefaultStorageFormat can use r:DataFormat (of 
alternative "full detail" from above). This way the whole format structure 
can be used as defaults on an upper level.

Invent default structure for GenericOutputFormat. Possible solution 
according to the alternative "full detail" from above.
invent DefaultGenericOutputFormat which uses r:DataFormat

This new element can stay just below LogicalProduct (has effect for all 
variables in all VariableSchemes) and below VariableScheme (has effect for 
all variables in this VariableScheme).

**********
A type for the data format can make sense (character|numeric). I think we 
had already a discussion, but I don't remember the outcome.

***********
Addition to suggestion

r:TextRepresentationType, r:DateTimeRepresentationType, and 
r:NumericRepresentationType

r:GenericOutputFormat (according to the suggestion alternative "full 
detail") has type r:DataFormat and is used by TextRepresentationType, 
DateTimeRepresentationType, and NumericRepresentationType. Not all details 
of r:DataFormat make sense for every RepresentationType.

NumericRepresentationType: reasonable subset of r:DataFormat

DataFormatName
Delimiter
StartPosition
EndPosition
Width
DecimalPositions
DecimalSeparator
DigitGroupSeparator

TextRepresentationType and DateTimeRepresentationType: reasonable subset 
of r:DataFormat

DataFormatName
StartPosition
EndPosition
Width
LanguageOfData
LocaleOfData

--------------------------------

ProprietaryRecordLayout

CodedDataAsNumeric, CodedDataAsText

Shouldn.t these elements use the same structure as in p:PhysicalLocation 
(or the suggested new r:DataFormat)? These elements are using attributes, 
p:PhysicalLocation is using elements for the same purpose. Apparently a 
r:DataFormat with all details could be reused (sometimes as restricted 
version) at three locations: p:PhysicalLocation, ProprietaryRecordLayout, 
r:RepresentationType used in l:ValueRepresentation. This is a consistency 
and reuse issue.

--------------------------------

Mapping / Association between different data format schemes

Is this cancelled or just not realized or did I miss something in the XML 
schemas?

************
After thinking again on date formats and formats for specialized purposes 
I think that an additional field must be added to a reusable data format. 
It should hold a format pattern string in dependency of a used language.

The main purpose of this element is the various date formats which are 
often dependent from a locale. In cases when the common approach to 
describe a format (format name, width, decimal positions in case of 
numeric format) is not sufficient, this format pattern can be used.

The format pattern string is dependent of an used language. It doesn't 
make sense to invent a new pattern system. It should be strongly 
recommended to use the format patterns of Java. The Java format strings 
are built on experiences with other common format strings like FORTRAN, C, 
POSIX.

The essential information on Java formats (with examples) is available at 
Sun's web site.

Class SimpleDateFormat
http://java.sun.com/javase/6/docs/api/java/text/SimpleDateFormat.html [^]

Class DecimalFormat (numeric)
http://java.sun.com/javase/6/docs/api/java/text/DecimalFormat.html [^]

PROPOSED SOLUTION

ProprietaryOutputFormat (in ProprietaryRecordLayout), 
r:RepresentationType, p:PhysicalLocation should use a common data format 
structure (in reusable) which can be restricted to the related purpose.

DataFormatType
   FormatName r:CodeValueType [1..1] CHANGED NAME, CHANGED CARDINALITY
   FormatPattern xs:string [0..1] NEW ELEMENT
     @language r:CodeValueType
   Delimiter xs:string [0..1]
   SEQUENCE
     StartPosition xs:integer [0..1]
     ArrayPosition xs:integer [0..1]
     EndPosition xs:integer [0..1]
     Width xs:integer [0..1]
   END SEQUENCE
   DecimalPositions xs:integer [0..1]
   DecimalSeparator r:OneCharStringType [0..1]
   DigitGroupSeparator r:OneCharStringType [0..1]
   LanguageOfData xs:string [0..1]
   LocaleOfData> xs:string [0..1]


Suggestion for basic controlled vocabularies:

FormatName - Integer, Real, String, Date
FormatPattern/@language - Java, C, SAS, SPSS, Stata

*************
Discussion in TIC resulted in the decision to make this a 3.0 bug. The 
basic elements are available and this is a discussion regarding the level 
of detail that needs to be provided for GenericOutputFormat. The 
discussion should take place and issue resolved for the next minor 
revision following publication of 3.0

TIC Discussion:

Major problem with putting all this information in GenericOutputFormat as 
it moves all of this physical information to the logical content. We seem 
to be repacking things back into the variable. If we're going to get into 
default display we should provide default display style sheets not pack it 
into the logical description.

SOLUTION:


+++++++++++++++++++++++++++++++++++++++++

209	Weighted element in Summary Statistics / Category Statistics
The Weight element in SummaryStatistic / CategoryStatistic is currently 
mandatory. This only applies to specific datasets and should be option 
with a default value of false. This could also be an attribute.

Make Weighted an optional attribute of SummaryStatistic / 
CategoryStatistic (instead of mandatory child)

SOLUTION:  The issue is that VariableStatisticsType contains an xs:choice 
of WeightUsedReference 0..1 and WeightVariableReference 0..1
In some cases no weighting factor is used but this option is not avaiable.

CHANGE xs:choice to minOccurs="0"   allowing for no weight, a standard 
weight OR and weight variable

+++++++++++++++++++++++++++++++++++++++++

222	Some measures of statistical dispersion are not correctly listed 
in SummaryStatisticTypeCodedType. Following measures are affected: 
quartiles, quintiles, deciles.

Definition: A percentile is the value of a variable below which a certain 
percent of observations fall.

The 50th percentile is equivalent to the median or second quartile. So the 
median divides the dispersion in two parts where one half is below the 
limit and one half above. The quartiles divide the dispersion in four 
parts by three limits. The quintiles divide the dispersion in five parts 
by four limits etc. The deciles divide the dispersion in 10 parts by 9 
limits. The fifth decile is equivalent to the median.
Remove following entries from the enumeration list in 
SummaryStatisticTypeCodedType:

FourthQuartile
FifthQuintile
FifthDecile
TenthDecile

Change the field documentation of median to:
median, equivalent to second quartile and fifth decile


SOLUTION: While assessment is accurate the codes are not inaccurate just 
repetative in terms of definition. Since there is a bug files to do an 
extensive review of hard coded code value lists in DDI we should address 
this in that context. Create a link in Mantis and change status to 
assigned.

+++++++++++++++++++++++++++++++++++++++++

227 CategoryStatisticTypeCodedType is a list of values, which represent a 
statistic for categories. Some values don't make sense:

- CrossTabulation (probably copied from DDI 2, now crosstabulation can be 
better represented by FilterCategoryStatistics)
The following are summary statistics
- ValidCases
- InvalidCases
- Minimum
- Maximum
- StandardDeviation

Remove from the enumeration of CategoryStatisticTypeCodedType :
- CrossTabulation
- ValidCases
- InvalidCases
- Minimum
- Maximum
- StandardDeviation

SOLUTION: Table and address during comprehensive review. Link to review 
bug and change status to assigned. Note that in pariticular filtered cases 
may have differing valid cases and invalid case counts, minimum and 
maximum and standard deviation.

+++++++++++++++++++++++++++++++++++++++++

228	Totals in code categories
Currently totals (the sum of codes below a specific code/category) can be 
only described by label and definition of a category. In a 
machine-actionable sense this is poor. Programs can recognize such 
categories as totals only by a used convention. Totals and sub-totals are 
important for hierarchical coding schemes and for ncubes.
Invention of a boolean attribute "total" at "Category". This way
categories can be defined for valid total and missing total in
combination with the existing attribute "missing".

SOLUTION: This seems to be a feature more of a coding scheme than a 
category scheme. Code currently has an isComprehensive attribute "Used in 
hierarchical structures at upper level values to indicate whether or not 
the subelements of the code are comprehensive in coverage. Not applicable 
if attribute isDiscrete is set to "true"."
If @isComprehensive is "true" then this is a total and subvalues can be 
used to generate the total IF no supression is applied to lower levels, if 
@isComprehensive is "false" the content is a total but all the subvalues 
are not listed so addition cannot be used to generate the value. If 
@isComprehensive is "Unknown" no assumptions should be made concerning the 
ability to create the total.
NO ACTION NEEDED an attribute "total" would be repetitive.

+++++++++++++++++++++++++++++++++++++++++

207 	Repetition of CodeSchemeReference
> Problem: [as reported by Jannik]
> If the IncludedCodeReference is refering a CodeType, we have an error
> here CodeType not being an AbstractIdentifiableType <xs:complexType 
name="IncludedCodeReferenceType">
> <xs:annotation>
> <xs:documentation>Allows selection of specific codes not based on 
levels.</xs:documentation>
> </xs:annotation>
> <xs:complexContent>
> <xs:extension base="r:ReferenceType">
> <xs:attribute name="coordinateValue" type="xs:integer" use="optional"/>
> </xs:extension>
> </xs:complexContent>
> </xs:complexType>
>
> Fix:
> CodeType made AbstractIdentifiable
>

INITIAL RESPONSE:
CodeSchemeReference occurs in 3 locations as a CodeSchemeReferenceType 
which is an extension of r:Reference (as opposed to a r:SchemeReference) 
with the extension CodeSubsetInfo

CodeSubsetInfo allows you to include specific Codes by Level, most 
descrete only, or by identifying specific Code Values. I see there is no 
field level documentation for this. IncludedCodeReference is an extension 
of r:Reference with an attribute extension of coordinateValue which 
provides the value of the specific Code. Not that any of this is clear 
from lack of documentation AND the weird name (Arofan must have been 
thinking in NCubes at the time). This is also as clear as mud in the User 
Manual. Both should be fixed

The reason Code has no ID is that its value by definition already provides 
it with unique identification.

RESPONSE:> Thanks for laying this out for me. I as read you answer the 
CodeSchemeReference is a reference to the CodeScheme. The 
IncludedCodeReference is holding a code value to be included in the 
referred CodeScheme.
>
> Now if the Code can not be referred, because its id is its value and the
> attribute extension of IncludedCodeReference is providing the value I
> have the impression the base extension of Reference for
> IncludedCodeReference is referring void, and therefore could be deleted
> in an update of the LogicalProduct schema.

COMMENT:
Yes that second reference is definately redundant. I wouldn't say its a
"void" as it must reference something. Also, since value is an attribute,
they need to "hang" it on an element. So I can see that it might be
removed in the future. This would probably result in the attribute
becoming an element unless they had an empty element with just an
attribute...which would be an oddity in DDI 3. However since that would be
a non-backward compatable change it would result in a major revision and
lots of xml editing. Could this means it becomes a "feature" rather than a
"bug"? I'll make sure this point is brought up in TIC and the discussion
is entered in Mantis.

I guess the short answer is, that in terms of programming you should
probably stick to its parent CodeSchemeReference (the one under
CodeRepresentation). After all they have to match. Then your programming
will be safe either way we go.

*******************************
We need to review this situation. It looks like a fix would require 
changing an attribute to an element. Right now the documentation is 
missing, but the content is not conflicting. If a change would be 
non-backward compatable is this something that should be "fixed" or does 
it become a feature?

REGARDLESS OF DECISION ON SCHEMA STRUCTURE, we must fix the field level 
documentation and explain this in the User Manual

SOLUTION:

+++++++++++++++++++++++++++++++++++++++

235	Anchored Scales
Anchored scales can currently be described by providing a CodeScheme that 
references a Category with content for the upper and lower bounds and a 
blank label category for the intervening values. We handle Likert scales 
(each code is labeled), but frequently survey's use a multippoint scale 
with label anchors only:

strongly disagree 1 2 3 4 5 6 7 8 9 10 strongly agree

DDI does not handles these well. The purpose of these scales is to suggest 
to the interviewee an interval level scale (Likert scales are ordinal).

****************************
In essense this is a type of CodeScheme so it may be appropriate to 
address this in the category description by having a means of describing 
the intermittent points.

[add notes from Jeremy]
[validate that code can be a negative number]

Note that problem is not just variable representation but representation 
in a questionnaire where the scale may need to be laid out horizontally or 
include a line with or without demarcations.
However it does seem to be a specialized form of categories unless there 
is a continuous numeric with labeled anchors. Note that scales with 
intermediate labels are forms of code schemes and can be ordinal or 
interval in nature.


SOLUTION:

+++++++++++++++++++++++++++++++++++++++


208	Code Element in Control Construct
203	Investigate Program language

[pull together other notes]



Wendy L. Thomas                          Phone: +1 612.624.4389
Data Access Core Director		 Fax:   +1 612.626.8375
Minnesota Population Center              Email: wlt at pop.umn.edu
University of Minnesota
50 Willey Hall
225 19th Avenue South
Minneapolis, MN 55455


More information about the DDI-SRG mailing list