[DDI-CDG] Thoughts on relation of files / overlap of CDG and CF

Joachim Wackerow wackerow at zuma-mannheim.de
Wed May 19 04:07:58 EDT 2004


Dear all,

I would like to express some quick thoughts in preparation of the 
discussion of the proposal of the complex file group. As Tom Piazza 
already mentioned months ago an important overlap exists between the CDG 
and the CF group (unfortunately I had not enough time to try a 
integrated model).

The CDG groups primary task is to deal with the combination of data 
files on the variable level with the goal to compare groups of cases. 
This is done in vertical direction by harmonizing variables with the 
same meaning.

The CF groups primary task is to deal with the combination of data files 
on the case level. This is done mainly in horizontal direction with more 
or less parallel data files by matching common ID variables.

(A data file in this sense is a logical rectangular file, which could 
consist of several physical files).

I would suggest folloing goals for describing the relation of files:

- Clear description possibilities, easy to use for authors.
- Use of common accepted formal language fragments with known semantics.
- Precise formulation of relation with the intention that an application
   could generate the combined data file.

The second point means, that we should build on common formal 
expressions of relations like  SQL (join and union clause) or e.g. SPSS 
(match files, add files).

 From my perspective I see following central questions:

- Should a codebook describe only a single study (what is a study)?
- Should a codebook describe the relation of several studies?
   In addition:
   - Description of the outcome of the relation?
   - Integration of all the details of the included studies (e.g. with
     XInclude)?

This arises the question, should we differenciate in a clear way between 
a simple codebook for one study and a "super codebook" for the relation 
of several studies.

It is not clear to me how this approaches could integrated in the 
structure of the current DTD. With this background I have problems to 
understand why the current main parts (stdyDscr, dataDscr) are 
repeatable elements.

I would prefer a structure for a "super codebook" like this:

<set>
   <relation type="">
   description of relation (optional reference to other codebooks)
   ...
   (in case of integration of description of every included study)
   <codeBook>  Codebook 1
   ...
   <codeBook>  Codebook 2
   ...

The element "set" (or collection) could also expressed as "codeBook 
type='set'".


For harmonizing variables (comparison purposes) and building new ID 
variables (matching cases purposes) I see the need for a means to 
describe the construction of new variables. Perhaps we could use MathML, 
instead of an own language for this purpose.


Referencing other files

If other "files" should referenced like codebooks or data files in 
general we should think in URI's not only in local files. For the naming 
of URI attributes we should use W3C standard XLink, e.g. the simple link 
attribute "xlink:href".

Kind regards, Achim



More information about the DDI-CDG mailing list