[DDI-CDG] Thoughts on relation of files / overlap of CDG and CF
Joachim Wackerow
wackerow at zuma-mannheim.de
Wed May 19 04:07:58 EDT 2004
Dear all,
I would like to express some quick thoughts in preparation of the
discussion of the proposal of the complex file group. As Tom Piazza
already mentioned months ago an important overlap exists between the CDG
and the CF group (unfortunately I had not enough time to try a
integrated model).
The CDG groups primary task is to deal with the combination of data
files on the variable level with the goal to compare groups of cases.
This is done in vertical direction by harmonizing variables with the
same meaning.
The CF groups primary task is to deal with the combination of data files
on the case level. This is done mainly in horizontal direction with more
or less parallel data files by matching common ID variables.
(A data file in this sense is a logical rectangular file, which could
consist of several physical files).
I would suggest folloing goals for describing the relation of files:
- Clear description possibilities, easy to use for authors.
- Use of common accepted formal language fragments with known semantics.
- Precise formulation of relation with the intention that an application
could generate the combined data file.
The second point means, that we should build on common formal
expressions of relations like SQL (join and union clause) or e.g. SPSS
(match files, add files).
From my perspective I see following central questions:
- Should a codebook describe only a single study (what is a study)?
- Should a codebook describe the relation of several studies?
In addition:
- Description of the outcome of the relation?
- Integration of all the details of the included studies (e.g. with
XInclude)?
This arises the question, should we differenciate in a clear way between
a simple codebook for one study and a "super codebook" for the relation
of several studies.
It is not clear to me how this approaches could integrated in the
structure of the current DTD. With this background I have problems to
understand why the current main parts (stdyDscr, dataDscr) are
repeatable elements.
I would prefer a structure for a "super codebook" like this:
<set>
<relation type="">
description of relation (optional reference to other codebooks)
...
(in case of integration of description of every included study)
<codeBook> Codebook 1
...
<codeBook> Codebook 2
...
The element "set" (or collection) could also expressed as "codeBook
type='set'".
For harmonizing variables (comparison purposes) and building new ID
variables (matching cases purposes) I see the need for a means to
describe the construction of new variables. Perhaps we could use MathML,
instead of an own language for this purpose.
Referencing other files
If other "files" should referenced like codebooks or data files in
general we should think in URI's not only in local files. For the naming
of URI attributes we should use W3C standard XLink, e.g. the simple link
attribute "xlink:href".
Kind regards, Achim
More information about the DDI-CDG
mailing list