[DDI-SRG] Single child Reference element in referenecs

I-Lin Kuo ikuoikuo at gmail.com
Thu Sep 20 17:00:12 EDT 2007


Achim,

The performance difference between the XPaths
/*[URN|ID][contains(local-name(), 'Reference')] and /*[Reference] may or may
not be significant depending on what you're doing. if it's

  <apply-templates select="/*[URN|ID][contains(local-name(),
'Reference')]"/>
  <apply-templates select="/*[Reference]"/>

I doubt it's that much. (it should be difference in evaluation time of
XPathExpression/total execution time of code executed by apply-templates, at
most)

If, however, you're worried about lookup cost in resolving the value of the
reference, that again is not that much once you use keys -- you only pay the
cost in the construction of the key index, which is very, very fast.

There is a thread from 5/4/06 with the subect: "[DDI-SRG] Size and the
Ethiopia Example" in which performance of a stylesheet was applied to the
277k ethiopia example:
  24 min - XMLSpy with original styles sheet
  just over 1 min - Saxon with original stylesheet

Significant improvement in performance was obtained by simply adding two
keys to the stylesheet:

<xsl:key name="CategoryGroup-key" match="l:CategoryGroup"
use="concat(r:Identification/r:ID,'--',r:Identification/r:IdentifyingAgency,'--',
r:Identification/r:VersionNumber)"    />
    <xsl:key name="Category-key" match="l:Category"
use="concat(r:Identification/r:ID,'--',r:Identification/r:IdentifyingAgency,'--',
r:Identification/r:VersionNumber)"    />

  2 min - XMLSpy with new stylesheet
  <5sec - Saxon with new stylesheet

So, what I'm trying to say is that, in my opinion, the "performance
difference" between the XPaths /*[URN|ID][contains(local-name(),
'Reference')] and /*[Reference] is not a problem that I think really exists,
and so doesn't need a solution.


On 9/20/07, Joachim Wackerow <joachim.wackerow at gesis.org> wrote:
>
> Hi I-Lin,
>
> Sorry for the confusion. I meant XPath predicates like
> [URN|ID][contains(local-name(), 'Reference')], which filter a node-set.
> I didn't mean to avoid XSLT at all.
>
> The reasoning is that predicates are slower processed by the XSLT
> processor than simply selecting elements by their names (using XSLT
> templates without predicates).
>
> Hope this clarifies the issue.
>
> Achim
>
> I-Lin Kuo wrote:
> > Hi Achim,
> >
> > I'm confused by "Nevertheless it should be generally avoided to be
> > dependent heavily from
> >   XSLT filters at central places of processing"
> >
> > What are XSLT filters? Do you mean this:
> >
> > http://www.caucho.com/resin-3.0/xml/xslt-filter.xtp
> >
> > Or do you mean we should avoid XSLT processing in general? Also, while I
>
> > understand the statement, I don't understand the reasoning, the why's
> > behind it. To me, it's similar to saying "Nevertheless it should be
> > generally avoided to be dependent heavily upon Windows servers at
> > central places of processing" which, though it may be true, requires
> > some justification. Is the reason because of XSLT's performance, or
> > because XSLT skills are not widely available, etc.?
> >
> > On 9/20/07, *Joachim Wackerow* <joachim.wackerow at gesis.org
> > <mailto:joachim.wackerow at gesis.org>> wrote:
> >
> >     Hi I-Lin,
> >
> >     I agree, that the key construction for the identification will be
> >     complicated and is indeed a very good candidate for optimization.
> >
> >     The approach with the contains function seems to be not bad.
> >
> >     Nevertheless it should be generally avoided to be dependent heavily
> from
> >       XSLT filters at central places of processing. If the design of the
> >     schemas would allow it to choose a more adapted alternative in this
> >     sense, I would opt for that.
> >
> >     Regarding StAX: I think this can be a good approach for special
> >     applications. I'm not familiar enough with it, but I'm doubting if
> this
> >     can be a general approach.
> >
> >     Achim
> >
> >     I-Lin Kuo wrote:
> >      > Hi Achim,
> >      >
> >      > I'm inclined to think this is premature optimization. A general
> >      > heuristic about optimization is that you profile first before you
> >      > optimize, so that you spend your efforts on the bottlenecks
> >     rather than
> >      > every little thing. Not having profiled the key construction,
> >     however,
> >      > I'm going to make a guess on where the bottleneck is going to be
> with
> >      > XSLT key construction.
> >      >
> >      > First, while you're right about ends-with and //*[URN|ID], you're
>
> >      > fixating on the "Reference-ness" of the element and missing the
> trees
> >      > for the forest. How about combining the two conditions into
> >      > //*[URN|ID][contains(local-name(), 'Reference')]? That should
> >     work, I
> >      > think.
> >      >
> >      >  From what I can see, the computational bottleneck isn't going to
> >     be the
> >      > identification of Reference elements, (this requires only two
> >      > conditions, see previous paragraph), but the construction of the
> >      > Reference value for the key. That involves selecting either the
> URN
> >      > value or the ID value. Selecting the URN is not difficult if no
> >     parsing
> >      > of the URN is necessary and right now I don't think it is.
> >     Selecting the
> >      > ID involves handling inheritance of agency code and version
> number up
> >      > the ancestor:: axis from only those appropriate ancestral types
> >      > (Maintainable, etc.) and is a complicated pain in the @r5e. That
> >     to me,
> >      > is more likely to be a bottleneck. I remember looking at
> >     DexTris's XSLT
> >      > stylesheets over 6 months to see how they handled it and was
> >     impressed
> >      > to see that it handled it reasonably. However, I seem to recall
> >     that it
> >      > didn't handle agency or version. The addition of agency and
> version
> >      > complicate things greatly. In addition, we still don't have a
> >     document
> >      > detailing the construction of URNs and the algorithm translating
> >     a URN
> >      > reference to and ID reference and back. So... I would suggest
> >     targeting
> >      > the reference/URN system if you're looking for a performance
> >     bottleneck.
> >      >
> >      > On 9/13/07, *Joachim Wackerow* < joachim.wackerow at gesis.org
> >     <mailto:joachim.wackerow at gesis.org>
> >      > <mailto:joachim.wackerow at gesis.org
> >     <mailto:joachim.wackerow at gesis.org>>> wrote:
> >      >
> >      >     Pascal and others,
> >      >
> >      >     What are now the reasons for removing the ReferenceType
> again?
> >      >     We should collect the pro/cons and record them in Mantis. So
> >     a decision
> >      >     is later comprehensible.
> >      >     I have the impression, that the decision regarding this issue
> is
> >      >     dependent which persons are attending the meeting :) .
> >      >
> >      >
> >      >     106 elements (22 in reusable, 84 in others) in the current
> >     Schema are
> >      >     reference elements, i.e. they are now using ReferenceType.
> >      >
> >      >     Some thoughts regarding the processing of references:
> >      >     (I have the impression that some of these thoughts have open
> >     ends)
> >      >
> >      >     References are heavily used in DDI 3.0. Therefore the
> >     processing of
> >      >     references should be easy and clear, and should have a good
> >      >     performance.
> >      >     Every detail will affect the complexity of the XML document,
> the
> >      >     complexity of an XSLT stylesheet, and the processing
> >     performance as
> >      >     well.
> >      >
> >      >     As the references and identification will be used heavily in
> >     a DDI 3
> >      >     document, XSLT keys seem to very important for processing
> both. A
> >      >     construction of a key based on a defined element like
> >     ReferenceType or
> >      >     IdentifiableID seems to be straight forward (the construction
> >     of the URN
> >      >     is complicated anyway). A construction of a key for
> >     references which are
> >      >     represented by 106 different elements is a lot of work. I
> >     don't see an
> >      >     easy way. I'm asking myself for what this can be really
> >     useful. It is
> >      >     not as necessary as for identifications. But without this
> >     possibility I
> >      >     have a bad feeling. I think references are in DDI 3 so
> important.
> >      >     Probably there will be an application requirement to identify
> >     easily
> >      >     references. So the suggested attribute "isReference" with a
> >     fixed value
> >      >     "true" should be at least realized.
> >      >
> >      >     Then a key must be constructed with
> >     "//*[@isReference='true']". This is
> >      >     an XPath expression which makes a filter test on every
> >     element. It is
> >      >     very costly when processing large documents to go over EVERY
> >     element.
> >      >     Furthermore filter tests are not really quickly processed.
> DDI 3
> >      >     documents will be large. This approach can make only sense
> >     with keys,
> >      >     then this process is necessary only once.
> >      >
> >      >     When ReferenceType is not available every referencing element
> >     needs an
> >      >     own template, which can call a general template for
> >     processing the
> >      >     reference details.
> >      >
> >      >     What is really the problem with the current explicit
> reference
> >      >     solution?
> >      >     It can be only the size of the document, the complexity is
> >     not really
> >      >     larger for applications.
> >      >
> >      >     I was wondering if the complexity of an XSLT stylesheet can
> >     be reduced
> >      >     in using a generic approach for references, so not for each
> >     of the 106
> >      >     referencing elements an own template will be necessary. This
> >     can make
> >      >     sense in a generic reporting tool using a data-driven
> >     approach (push
> >      >     approach). This approach could make also use of the
> >     systematic names of
> >      >     the referencing elements and/or a look-up table with the
> >     field-level
> >      >     documentation.
> >      >
> >      >     Thinking this one step further a general reference container
> >     can make
> >      >     sense, where the referenced subject is defined in an
> >     attribute or a
> >      >     child element. For example instead of UniverseReference,
> >      >     Reference/ReferencedSubject=Universe.
> >      >
> >      >
> >      >     BTW "//*[URN|ID]" will not work. This would catch also the
> >      >     identification. Something would be necessary like:
> >      >     //*
> >      >     [ local-name() != 'MaintainableID' ]
> >      >     [ local-name() != 'VersionableID' ]
> >      >     [ local-name() != 'IdentifiableID' ]
> >      >     [ r:URN | r:ID ]
> >      >     This would result in five test on every element.
> >      >
> >      >     The function 'ends-with' is only available in XSLT/XPath 2.0.
> >     XSLT 1.0
> >      >     cannot be used again. Only one processor (Saxon) does exist
> >     for XSLT
> >      >     2.0. This seems to be a limiting approach.
> >      >
> >      >     This is a general note:
> >      >     I would prefer data-driven processing of the complex DDI
> >     documents
> >      >     and let do the XSLT processor the work, i.e. element-specific
> >     templates,
> >      >     not a declarative programming style. The schema should be
> >     constructed
> >      >     accordingly when possible.
> >      >
> >      >     Any response welcome.
> >      >
> >      >     I'll be available again starting at the meeting September 20.
>
> >      >
> >      >       Achim
> >      >
> >      >     I-Lin Kuo wrote:
> >      >      > I would even go so far as to say that @isReference is
> >     redundant.
> >      >      >
> >      >      > I vaguely recollect that the reason given for the
> >     Reference was to
> >      >      > identify elements of ReferenceType via a test of
> >     //*[Reference]. It
> >      >      > didn't convince me at the time, as I think that with the
> >      >     elimination of
> >      >      > the extra Reference element,
> >     //*[ends-with(local-name(),'Reference')]
> >      >      > would work, or //*[URN|ID].
> >      >      >
> >      >      > On 9/7/07, *Pascal Heus* <pascal.heus at gmail.com
> >     <mailto:pascal.heus at gmail.com>
> >      >     <mailto: pascal.heus at gmail.com <mailto:pascal.heus at gmail.com
> >>
> >      >      > <mailto: pascal.heus at gmail.com
> >     <mailto:pascal.heus at gmail.com> <mailto:pascal.heus at gmail.com
> >     <mailto: pascal.heus at gmail.com>>>> wrote:
> >      >      >
> >      >      >     Achim, I-Lin:
> >      >      >     we reviewed yesterday bug #2
> >      >      >     ( http://mantis.ddialliance.org/view.php?id=2) related
> >     to the
> >      >     extra
> >      >      >     */Reference element that appears under every
> >     referencing type
> >      >     in the
> >      >      >     current schema. We had a general agreement that this
> >     is not
> >      >     necessary
> >      >      >     and that it should be removed or possibly replaced
> >     with a fixed
> >      >      >     @isReference attribute. Since I believe you initially
> >      >     requested this
> >      >      >     change, we would like to have your perspective on the
> >     issue
> >      >     before
> >      >      >     making a final decision.
> >      >      >     Would appreciate your prompt input as this change
> impacts
> >      >     most of the
> >      >      >     ongoing tool development and the earlier we can make
> it
> >      >     happen, the
> >      >      >     better.
> >      >      >     many thanks
> >      >      >     Pascal
> >      >      >
> >      >      >     _______________________________________________
> >      >      >     DDI-SRG mailing list
> >      >      >     DDI-SRG at icpsr.umich.edu
> >     <mailto:DDI-SRG at icpsr.umich.edu> <mailto: DDI-SRG at icpsr.umich.edu
> >     <mailto: DDI-SRG at icpsr.umich.edu>>
> >      >     <mailto:DDI-SRG at icpsr.umich.edu
> >     <mailto: DDI-SRG at icpsr.umich.edu> <mailto:DDI-SRG at icpsr.umich.edu
> >     <mailto:DDI-SRG at icpsr.umich.edu>>>
> >      >      >     http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
> >      >      >
> >      >      >
> >      >      >
> >      >      >
> >      >      > --
> >      >      > I-Lin Kuo
> >      >      >
> >      >      >
> >      >      >
> >      >
> >
> ------------------------------------------------------------------------
> >      >      >
> >      >      > _______________________________________________
> >      >      > DDI-SRG mailing list
> >      >      > DDI-SRG at icpsr.umich.edu <mailto:DDI-SRG at icpsr.umich.edu>
> >     <mailto:DDI-SRG at icpsr.umich.edu <mailto: DDI-SRG at icpsr.umich.edu>>
> >      >      > http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
> >      >
> >      >
> >      >     --
> >      >     GESIS - German Social Science Infrastructure Services
> >      >     http://www.gesis.org/en/
> >      >
> >      >     _______________________________________________
> >      >     DDI-SRG mailing list
> >      >     DDI-SRG at icpsr.umich.edu <mailto:DDI-SRG at icpsr.umich.edu>
> >     <mailto:DDI-SRG at icpsr.umich.edu <mailto:DDI-SRG at icpsr.umich.edu>>
> >      >     http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
> >      >
> >      >
> >      >
> >      >
> >      > --
> >      > I-Lin Kuo
> >
> >
> >     --
> >     GESIS - German Social Science Infrastructure Services
> >     http://www.gesis.org/en/
> >
> >
> >
> >
> > --
> > I-Lin Kuo
>
>
> --
> GESIS - German Social Science Infrastructure Services
> http://www.gesis.org/en/
>



-- 
I-Lin Kuo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.icpsr.umich.edu/pipermail/ddi-srg/attachments/20070920/e4a8fdaf/attachment-0001.html 


More information about the DDI-SRG mailing list