[DDI-SRG] Don't remember if I sent this
Wendy Thomas
wlt at pop.umn.edu
Thu Mar 19 10:56:39 EDT 2009
This covers the remaining 9 bugs not yet discussed. It contains the Mantis
content plus suggested solution for some of the bugs. wlt
152 Data Formats
r:GenericOutputFormat (CodeValueType) is not enough to describe a
format.
When a generic output format should be described in full detail
all the properties of p:PhysicalLocationType are necessary for this
purpose. When just some core properties of a generic output format should
be described, name of the format, width, and optional DecimalPositions are
necessary.
This should be DISCUSSED. I would opt for the first alternative
"full detail".
The absolute minimal requirement to describe a data format is the
existence of the name of the format and the width. These elements should
therefore be required.
Any default structure for GenericOutputFormat in LogicalProduct or
VariableScheme is missing (see my email on "Data Format Summary" from
2008-01-18).
Suggestion (alternative "full detail"):
put p:PhysicalLocation in reusable and rename it to r:DataFormat
rename new r:StorageFormat to r:DataFormatName (CodeValueType)
invent p:StorageFormat which uses new r:DataFormat
r:GenericOutputFormat uses new r:DataFormat
Suggestion (alternative "core properties")
invent a r:DataFormat in reusable
r:DataFormat
r:DataFormatName (was r:StorageFormat or r:GenericOutputFormat)
r:Width
r:DecimalPositions
changed p:StorageFormat uses r:DataFormat
r:GenericOutputFormat uses r:DataFormat
Definition of GenericOutputFormat "This field provides a recommended
generic treatment of the data for display by an application. The value
should come from a controlled vocabulary. "
The second suggestion results in having the Width in reusable and
StartPosition in physicaldataproduct.
Any change in p:PhysicalLocation results in changes to the defaults in
p:PhysicalStructure. The defaults are now not comprehensive. An element
p:PhysicalStructure/p:DefaultStorageFormat can use r:DataFormat (of
alternative "full detail" from above). This way the whole format structure
can be used as defaults on an upper level.
Invent default structure for GenericOutputFormat. Possible solution
according to the alternative "full detail" from above.
invent DefaultGenericOutputFormat which uses r:DataFormat
This new element can stay just below LogicalProduct (has effect for all
variables in all VariableSchemes) and below VariableScheme (has effect for
all variables in this VariableScheme).
**********
A type for the data format can make sense (character|numeric). I think we
had already a discussion, but I don't remember the outcome.
***********
Addition to suggestion
r:TextRepresentationType, r:DateTimeRepresentationType, and
r:NumericRepresentationType
r:GenericOutputFormat (according to the suggestion alternative "full
detail") has type r:DataFormat and is used by TextRepresentationType,
DateTimeRepresentationType, and NumericRepresentationType. Not all details
of r:DataFormat make sense for every RepresentationType.
NumericRepresentationType: reasonable subset of r:DataFormat
DataFormatName
Delimiter
StartPosition
EndPosition
Width
DecimalPositions
DecimalSeparator
DigitGroupSeparator
TextRepresentationType and DateTimeRepresentationType: reasonable subset
of r:DataFormat
DataFormatName
StartPosition
EndPosition
Width
LanguageOfData
LocaleOfData
--------------------------------
ProprietaryRecordLayout
CodedDataAsNumeric, CodedDataAsText
Shouldn.t these elements use the same structure as in p:PhysicalLocation
(or the suggested new r:DataFormat)? These elements are using attributes,
p:PhysicalLocation is using elements for the same purpose. Apparently a
r:DataFormat with all details could be reused (sometimes as restricted
version) at three locations: p:PhysicalLocation, ProprietaryRecordLayout,
r:RepresentationType used in l:ValueRepresentation. This is a consistency
and reuse issue.
--------------------------------
Mapping / Association between different data format schemes
Is this cancelled or just not realized or did I miss something in the XML
schemas?
************
After thinking again on date formats and formats for specialized purposes
I think that an additional field must be added to a reusable data format.
It should hold a format pattern string in dependency of a used language.
The main purpose of this element is the various date formats which are
often dependent from a locale. In cases when the common approach to
describe a format (format name, width, decimal positions in case of
numeric format) is not sufficient, this format pattern can be used.
The format pattern string is dependent of an used language. It doesn't
make sense to invent a new pattern system. It should be strongly
recommended to use the format patterns of Java. The Java format strings
are built on experiences with other common format strings like FORTRAN, C,
POSIX.
The essential information on Java formats (with examples) is available at
Sun's web site.
Class SimpleDateFormat
http://java.sun.com/javase/6/docs/api/java/text/SimpleDateFormat.html [^]
Class DecimalFormat (numeric)
http://java.sun.com/javase/6/docs/api/java/text/DecimalFormat.html [^]
PROPOSED SOLUTION
ProprietaryOutputFormat (in ProprietaryRecordLayout),
r:RepresentationType, p:PhysicalLocation should use a common data format
structure (in reusable) which can be restricted to the related purpose.
DataFormatType
FormatName r:CodeValueType [1..1] CHANGED NAME, CHANGED CARDINALITY
FormatPattern xs:string [0..1] NEW ELEMENT
@language r:CodeValueType
Delimiter xs:string [0..1]
SEQUENCE
StartPosition xs:integer [0..1]
ArrayPosition xs:integer [0..1]
EndPosition xs:integer [0..1]
Width xs:integer [0..1]
END SEQUENCE
DecimalPositions xs:integer [0..1]
DecimalSeparator r:OneCharStringType [0..1]
DigitGroupSeparator r:OneCharStringType [0..1]
LanguageOfData xs:string [0..1]
LocaleOfData> xs:string [0..1]
Suggestion for basic controlled vocabularies:
FormatName - Integer, Real, String, Date
FormatPattern/@language - Java, C, SAS, SPSS, Stata
*************
Discussion in TIC resulted in the decision to make this a 3.0 bug. The
basic elements are available and this is a discussion regarding the level
of detail that needs to be provided for GenericOutputFormat. The
discussion should take place and issue resolved for the next minor
revision following publication of 3.0
TIC Discussion:
Major problem with putting all this information in GenericOutputFormat as
it moves all of this physical information to the logical content. We seem
to be repacking things back into the variable. If we're going to get into
default display we should provide default display style sheets not pack it
into the logical description.
SOLUTION:
+++++++++++++++++++++++++++++++++++++++++
209 Weighted element in Summary Statistics / Category Statistics
The Weight element in SummaryStatistic / CategoryStatistic is currently
mandatory. This only applies to specific datasets and should be option
with a default value of false. This could also be an attribute.
Make Weighted an optional attribute of SummaryStatistic /
CategoryStatistic (instead of mandatory child)
SOLUTION: The issue is that VariableStatisticsType contains an xs:choice
of WeightUsedReference 0..1 and WeightVariableReference 0..1
In some cases no weighting factor is used but this option is not avaiable.
CHANGE xs:choice to minOccurs="0" allowing for no weight, a standard
weight OR and weight variable
+++++++++++++++++++++++++++++++++++++++++
222 Some measures of statistical dispersion are not correctly listed
in SummaryStatisticTypeCodedType. Following measures are affected:
quartiles, quintiles, deciles.
Definition: A percentile is the value of a variable below which a certain
percent of observations fall.
The 50th percentile is equivalent to the median or second quartile. So the
median divides the dispersion in two parts where one half is below the
limit and one half above. The quartiles divide the dispersion in four
parts by three limits. The quintiles divide the dispersion in five parts
by four limits etc. The deciles divide the dispersion in 10 parts by 9
limits. The fifth decile is equivalent to the median.
Remove following entries from the enumeration list in
SummaryStatisticTypeCodedType:
FourthQuartile
FifthQuintile
FifthDecile
TenthDecile
Change the field documentation of median to:
median, equivalent to second quartile and fifth decile
SOLUTION: While assessment is accurate the codes are not inaccurate just
repetative in terms of definition. Since there is a bug files to do an
extensive review of hard coded code value lists in DDI we should address
this in that context. Create a link in Mantis and change status to
assigned.
+++++++++++++++++++++++++++++++++++++++++
227 CategoryStatisticTypeCodedType is a list of values, which represent a
statistic for categories. Some values don't make sense:
- CrossTabulation (probably copied from DDI 2, now crosstabulation can be
better represented by FilterCategoryStatistics)
The following are summary statistics
- ValidCases
- InvalidCases
- Minimum
- Maximum
- StandardDeviation
Remove from the enumeration of CategoryStatisticTypeCodedType :
- CrossTabulation
- ValidCases
- InvalidCases
- Minimum
- Maximum
- StandardDeviation
SOLUTION: Table and address during comprehensive review. Link to review
bug and change status to assigned. Note that in pariticular filtered cases
may have differing valid cases and invalid case counts, minimum and
maximum and standard deviation.
+++++++++++++++++++++++++++++++++++++++++
228 Totals in code categories
Currently totals (the sum of codes below a specific code/category) can be
only described by label and definition of a category. In a
machine-actionable sense this is poor. Programs can recognize such
categories as totals only by a used convention. Totals and sub-totals are
important for hierarchical coding schemes and for ncubes.
Invention of a boolean attribute "total" at "Category". This way
categories can be defined for valid total and missing total in
combination with the existing attribute "missing".
SOLUTION: This seems to be a feature more of a coding scheme than a
category scheme. Code currently has an isComprehensive attribute "Used in
hierarchical structures at upper level values to indicate whether or not
the subelements of the code are comprehensive in coverage. Not applicable
if attribute isDiscrete is set to "true"."
If @isComprehensive is "true" then this is a total and subvalues can be
used to generate the total IF no supression is applied to lower levels, if
@isComprehensive is "false" the content is a total but all the subvalues
are not listed so addition cannot be used to generate the value. If
@isComprehensive is "Unknown" no assumptions should be made concerning the
ability to create the total.
NO ACTION NEEDED an attribute "total" would be repetitive.
+++++++++++++++++++++++++++++++++++++++++
207 Repetition of CodeSchemeReference
> Problem: [as reported by Jannik]
> If the IncludedCodeReference is refering a CodeType, we have an error
> here CodeType not being an AbstractIdentifiableType <xs:complexType
name="IncludedCodeReferenceType">
> <xs:annotation>
> <xs:documentation>Allows selection of specific codes not based on
levels.</xs:documentation>
> </xs:annotation>
> <xs:complexContent>
> <xs:extension base="r:ReferenceType">
> <xs:attribute name="coordinateValue" type="xs:integer" use="optional"/>
> </xs:extension>
> </xs:complexContent>
> </xs:complexType>
>
> Fix:
> CodeType made AbstractIdentifiable
>
INITIAL RESPONSE:
CodeSchemeReference occurs in 3 locations as a CodeSchemeReferenceType
which is an extension of r:Reference (as opposed to a r:SchemeReference)
with the extension CodeSubsetInfo
CodeSubsetInfo allows you to include specific Codes by Level, most
descrete only, or by identifying specific Code Values. I see there is no
field level documentation for this. IncludedCodeReference is an extension
of r:Reference with an attribute extension of coordinateValue which
provides the value of the specific Code. Not that any of this is clear
from lack of documentation AND the weird name (Arofan must have been
thinking in NCubes at the time). This is also as clear as mud in the User
Manual. Both should be fixed
The reason Code has no ID is that its value by definition already provides
it with unique identification.
RESPONSE:> Thanks for laying this out for me. I as read you answer the
CodeSchemeReference is a reference to the CodeScheme. The
IncludedCodeReference is holding a code value to be included in the
referred CodeScheme.
>
> Now if the Code can not be referred, because its id is its value and the
> attribute extension of IncludedCodeReference is providing the value I
> have the impression the base extension of Reference for
> IncludedCodeReference is referring void, and therefore could be deleted
> in an update of the LogicalProduct schema.
COMMENT:
Yes that second reference is definately redundant. I wouldn't say its a
"void" as it must reference something. Also, since value is an attribute,
they need to "hang" it on an element. So I can see that it might be
removed in the future. This would probably result in the attribute
becoming an element unless they had an empty element with just an
attribute...which would be an oddity in DDI 3. However since that would be
a non-backward compatable change it would result in a major revision and
lots of xml editing. Could this means it becomes a "feature" rather than a
"bug"? I'll make sure this point is brought up in TIC and the discussion
is entered in Mantis.
I guess the short answer is, that in terms of programming you should
probably stick to its parent CodeSchemeReference (the one under
CodeRepresentation). After all they have to match. Then your programming
will be safe either way we go.
*******************************
We need to review this situation. It looks like a fix would require
changing an attribute to an element. Right now the documentation is
missing, but the content is not conflicting. If a change would be
non-backward compatable is this something that should be "fixed" or does
it become a feature?
REGARDLESS OF DECISION ON SCHEMA STRUCTURE, we must fix the field level
documentation and explain this in the User Manual
SOLUTION:
+++++++++++++++++++++++++++++++++++++++
235 Anchored Scales
Anchored scales can currently be described by providing a CodeScheme that
references a Category with content for the upper and lower bounds and a
blank label category for the intervening values. We handle Likert scales
(each code is labeled), but frequently survey's use a multippoint scale
with label anchors only:
strongly disagree 1 2 3 4 5 6 7 8 9 10 strongly agree
DDI does not handles these well. The purpose of these scales is to suggest
to the interviewee an interval level scale (Likert scales are ordinal).
****************************
In essense this is a type of CodeScheme so it may be appropriate to
address this in the category description by having a means of describing
the intermittent points.
[add notes from Jeremy]
[validate that code can be a negative number]
Note that problem is not just variable representation but representation
in a questionnaire where the scale may need to be laid out horizontally or
include a line with or without demarcations.
However it does seem to be a specialized form of categories unless there
is a continuous numeric with labeled anchors. Note that scales with
intermediate labels are forms of code schemes and can be ordinal or
interval in nature.
SOLUTION:
+++++++++++++++++++++++++++++++++++++++
208 Code Element in Control Construct
203 Investigate Program language
[pull together other notes]
Wendy L. Thomas Phone: +1 612.624.4389
Data Access Core Director Fax: +1 612.626.8375
Minnesota Population Center Email: wlt at pop.umn.edu
University of Minnesota
50 Willey Hall
225 19th Avenue South
Minneapolis, MN 55455
More information about the DDI-SRG
mailing list