[DDI-SRG] Single child Reference element in referenecs
I-Lin Kuo
ikuoikuo at gmail.com
Thu Sep 20 17:00:12 EDT 2007
Achim,
The performance difference between the XPaths
/*[URN|ID][contains(local-name(), 'Reference')] and /*[Reference] may or may
not be significant depending on what you're doing. if it's
<apply-templates select="/*[URN|ID][contains(local-name(),
'Reference')]"/>
<apply-templates select="/*[Reference]"/>
I doubt it's that much. (it should be difference in evaluation time of
XPathExpression/total execution time of code executed by apply-templates, at
most)
If, however, you're worried about lookup cost in resolving the value of the
reference, that again is not that much once you use keys -- you only pay the
cost in the construction of the key index, which is very, very fast.
There is a thread from 5/4/06 with the subect: "[DDI-SRG] Size and the
Ethiopia Example" in which performance of a stylesheet was applied to the
277k ethiopia example:
24 min - XMLSpy with original styles sheet
just over 1 min - Saxon with original stylesheet
Significant improvement in performance was obtained by simply adding two
keys to the stylesheet:
<xsl:key name="CategoryGroup-key" match="l:CategoryGroup"
use="concat(r:Identification/r:ID,'--',r:Identification/r:IdentifyingAgency,'--',
r:Identification/r:VersionNumber)" />
<xsl:key name="Category-key" match="l:Category"
use="concat(r:Identification/r:ID,'--',r:Identification/r:IdentifyingAgency,'--',
r:Identification/r:VersionNumber)" />
2 min - XMLSpy with new stylesheet
<5sec - Saxon with new stylesheet
So, what I'm trying to say is that, in my opinion, the "performance
difference" between the XPaths /*[URN|ID][contains(local-name(),
'Reference')] and /*[Reference] is not a problem that I think really exists,
and so doesn't need a solution.
On 9/20/07, Joachim Wackerow <joachim.wackerow at gesis.org> wrote:
>
> Hi I-Lin,
>
> Sorry for the confusion. I meant XPath predicates like
> [URN|ID][contains(local-name(), 'Reference')], which filter a node-set.
> I didn't mean to avoid XSLT at all.
>
> The reasoning is that predicates are slower processed by the XSLT
> processor than simply selecting elements by their names (using XSLT
> templates without predicates).
>
> Hope this clarifies the issue.
>
> Achim
>
> I-Lin Kuo wrote:
> > Hi Achim,
> >
> > I'm confused by "Nevertheless it should be generally avoided to be
> > dependent heavily from
> > XSLT filters at central places of processing"
> >
> > What are XSLT filters? Do you mean this:
> >
> > http://www.caucho.com/resin-3.0/xml/xslt-filter.xtp
> >
> > Or do you mean we should avoid XSLT processing in general? Also, while I
>
> > understand the statement, I don't understand the reasoning, the why's
> > behind it. To me, it's similar to saying "Nevertheless it should be
> > generally avoided to be dependent heavily upon Windows servers at
> > central places of processing" which, though it may be true, requires
> > some justification. Is the reason because of XSLT's performance, or
> > because XSLT skills are not widely available, etc.?
> >
> > On 9/20/07, *Joachim Wackerow* <joachim.wackerow at gesis.org
> > <mailto:joachim.wackerow at gesis.org>> wrote:
> >
> > Hi I-Lin,
> >
> > I agree, that the key construction for the identification will be
> > complicated and is indeed a very good candidate for optimization.
> >
> > The approach with the contains function seems to be not bad.
> >
> > Nevertheless it should be generally avoided to be dependent heavily
> from
> > XSLT filters at central places of processing. If the design of the
> > schemas would allow it to choose a more adapted alternative in this
> > sense, I would opt for that.
> >
> > Regarding StAX: I think this can be a good approach for special
> > applications. I'm not familiar enough with it, but I'm doubting if
> this
> > can be a general approach.
> >
> > Achim
> >
> > I-Lin Kuo wrote:
> > > Hi Achim,
> > >
> > > I'm inclined to think this is premature optimization. A general
> > > heuristic about optimization is that you profile first before you
> > > optimize, so that you spend your efforts on the bottlenecks
> > rather than
> > > every little thing. Not having profiled the key construction,
> > however,
> > > I'm going to make a guess on where the bottleneck is going to be
> with
> > > XSLT key construction.
> > >
> > > First, while you're right about ends-with and //*[URN|ID], you're
>
> > > fixating on the "Reference-ness" of the element and missing the
> trees
> > > for the forest. How about combining the two conditions into
> > > //*[URN|ID][contains(local-name(), 'Reference')]? That should
> > work, I
> > > think.
> > >
> > > From what I can see, the computational bottleneck isn't going to
> > be the
> > > identification of Reference elements, (this requires only two
> > > conditions, see previous paragraph), but the construction of the
> > > Reference value for the key. That involves selecting either the
> URN
> > > value or the ID value. Selecting the URN is not difficult if no
> > parsing
> > > of the URN is necessary and right now I don't think it is.
> > Selecting the
> > > ID involves handling inheritance of agency code and version
> number up
> > > the ancestor:: axis from only those appropriate ancestral types
> > > (Maintainable, etc.) and is a complicated pain in the @r5e. That
> > to me,
> > > is more likely to be a bottleneck. I remember looking at
> > DexTris's XSLT
> > > stylesheets over 6 months to see how they handled it and was
> > impressed
> > > to see that it handled it reasonably. However, I seem to recall
> > that it
> > > didn't handle agency or version. The addition of agency and
> version
> > > complicate things greatly. In addition, we still don't have a
> > document
> > > detailing the construction of URNs and the algorithm translating
> > a URN
> > > reference to and ID reference and back. So... I would suggest
> > targeting
> > > the reference/URN system if you're looking for a performance
> > bottleneck.
> > >
> > > On 9/13/07, *Joachim Wackerow* < joachim.wackerow at gesis.org
> > <mailto:joachim.wackerow at gesis.org>
> > > <mailto:joachim.wackerow at gesis.org
> > <mailto:joachim.wackerow at gesis.org>>> wrote:
> > >
> > > Pascal and others,
> > >
> > > What are now the reasons for removing the ReferenceType
> again?
> > > We should collect the pro/cons and record them in Mantis. So
> > a decision
> > > is later comprehensible.
> > > I have the impression, that the decision regarding this issue
> is
> > > dependent which persons are attending the meeting :) .
> > >
> > >
> > > 106 elements (22 in reusable, 84 in others) in the current
> > Schema are
> > > reference elements, i.e. they are now using ReferenceType.
> > >
> > > Some thoughts regarding the processing of references:
> > > (I have the impression that some of these thoughts have open
> > ends)
> > >
> > > References are heavily used in DDI 3.0. Therefore the
> > processing of
> > > references should be easy and clear, and should have a good
> > > performance.
> > > Every detail will affect the complexity of the XML document,
> the
> > > complexity of an XSLT stylesheet, and the processing
> > performance as
> > > well.
> > >
> > > As the references and identification will be used heavily in
> > a DDI 3
> > > document, XSLT keys seem to very important for processing
> both. A
> > > construction of a key based on a defined element like
> > ReferenceType or
> > > IdentifiableID seems to be straight forward (the construction
> > of the URN
> > > is complicated anyway). A construction of a key for
> > references which are
> > > represented by 106 different elements is a lot of work. I
> > don't see an
> > > easy way. I'm asking myself for what this can be really
> > useful. It is
> > > not as necessary as for identifications. But without this
> > possibility I
> > > have a bad feeling. I think references are in DDI 3 so
> important.
> > > Probably there will be an application requirement to identify
> > easily
> > > references. So the suggested attribute "isReference" with a
> > fixed value
> > > "true" should be at least realized.
> > >
> > > Then a key must be constructed with
> > "//*[@isReference='true']". This is
> > > an XPath expression which makes a filter test on every
> > element. It is
> > > very costly when processing large documents to go over EVERY
> > element.
> > > Furthermore filter tests are not really quickly processed.
> DDI 3
> > > documents will be large. This approach can make only sense
> > with keys,
> > > then this process is necessary only once.
> > >
> > > When ReferenceType is not available every referencing element
> > needs an
> > > own template, which can call a general template for
> > processing the
> > > reference details.
> > >
> > > What is really the problem with the current explicit
> reference
> > > solution?
> > > It can be only the size of the document, the complexity is
> > not really
> > > larger for applications.
> > >
> > > I was wondering if the complexity of an XSLT stylesheet can
> > be reduced
> > > in using a generic approach for references, so not for each
> > of the 106
> > > referencing elements an own template will be necessary. This
> > can make
> > > sense in a generic reporting tool using a data-driven
> > approach (push
> > > approach). This approach could make also use of the
> > systematic names of
> > > the referencing elements and/or a look-up table with the
> > field-level
> > > documentation.
> > >
> > > Thinking this one step further a general reference container
> > can make
> > > sense, where the referenced subject is defined in an
> > attribute or a
> > > child element. For example instead of UniverseReference,
> > > Reference/ReferencedSubject=Universe.
> > >
> > >
> > > BTW "//*[URN|ID]" will not work. This would catch also the
> > > identification. Something would be necessary like:
> > > //*
> > > [ local-name() != 'MaintainableID' ]
> > > [ local-name() != 'VersionableID' ]
> > > [ local-name() != 'IdentifiableID' ]
> > > [ r:URN | r:ID ]
> > > This would result in five test on every element.
> > >
> > > The function 'ends-with' is only available in XSLT/XPath 2.0.
> > XSLT 1.0
> > > cannot be used again. Only one processor (Saxon) does exist
> > for XSLT
> > > 2.0. This seems to be a limiting approach.
> > >
> > > This is a general note:
> > > I would prefer data-driven processing of the complex DDI
> > documents
> > > and let do the XSLT processor the work, i.e. element-specific
> > templates,
> > > not a declarative programming style. The schema should be
> > constructed
> > > accordingly when possible.
> > >
> > > Any response welcome.
> > >
> > > I'll be available again starting at the meeting September 20.
>
> > >
> > > Achim
> > >
> > > I-Lin Kuo wrote:
> > > > I would even go so far as to say that @isReference is
> > redundant.
> > > >
> > > > I vaguely recollect that the reason given for the
> > Reference was to
> > > > identify elements of ReferenceType via a test of
> > //*[Reference]. It
> > > > didn't convince me at the time, as I think that with the
> > > elimination of
> > > > the extra Reference element,
> > //*[ends-with(local-name(),'Reference')]
> > > > would work, or //*[URN|ID].
> > > >
> > > > On 9/7/07, *Pascal Heus* <pascal.heus at gmail.com
> > <mailto:pascal.heus at gmail.com>
> > > <mailto: pascal.heus at gmail.com <mailto:pascal.heus at gmail.com
> >>
> > > > <mailto: pascal.heus at gmail.com
> > <mailto:pascal.heus at gmail.com> <mailto:pascal.heus at gmail.com
> > <mailto: pascal.heus at gmail.com>>>> wrote:
> > > >
> > > > Achim, I-Lin:
> > > > we reviewed yesterday bug #2
> > > > ( http://mantis.ddialliance.org/view.php?id=2) related
> > to the
> > > extra
> > > > */Reference element that appears under every
> > referencing type
> > > in the
> > > > current schema. We had a general agreement that this
> > is not
> > > necessary
> > > > and that it should be removed or possibly replaced
> > with a fixed
> > > > @isReference attribute. Since I believe you initially
> > > requested this
> > > > change, we would like to have your perspective on the
> > issue
> > > before
> > > > making a final decision.
> > > > Would appreciate your prompt input as this change
> impacts
> > > most of the
> > > > ongoing tool development and the earlier we can make
> it
> > > happen, the
> > > > better.
> > > > many thanks
> > > > Pascal
> > > >
> > > > _______________________________________________
> > > > DDI-SRG mailing list
> > > > DDI-SRG at icpsr.umich.edu
> > <mailto:DDI-SRG at icpsr.umich.edu> <mailto: DDI-SRG at icpsr.umich.edu
> > <mailto: DDI-SRG at icpsr.umich.edu>>
> > > <mailto:DDI-SRG at icpsr.umich.edu
> > <mailto: DDI-SRG at icpsr.umich.edu> <mailto:DDI-SRG at icpsr.umich.edu
> > <mailto:DDI-SRG at icpsr.umich.edu>>>
> > > > http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > I-Lin Kuo
> > > >
> > > >
> > > >
> > >
> >
> ------------------------------------------------------------------------
> > > >
> > > > _______________________________________________
> > > > DDI-SRG mailing list
> > > > DDI-SRG at icpsr.umich.edu <mailto:DDI-SRG at icpsr.umich.edu>
> > <mailto:DDI-SRG at icpsr.umich.edu <mailto: DDI-SRG at icpsr.umich.edu>>
> > > > http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
> > >
> > >
> > > --
> > > GESIS - German Social Science Infrastructure Services
> > > http://www.gesis.org/en/
> > >
> > > _______________________________________________
> > > DDI-SRG mailing list
> > > DDI-SRG at icpsr.umich.edu <mailto:DDI-SRG at icpsr.umich.edu>
> > <mailto:DDI-SRG at icpsr.umich.edu <mailto:DDI-SRG at icpsr.umich.edu>>
> > > http://www.icpsr.umich.edu/mailman/listinfo/ddi-srg
> > >
> > >
> > >
> > >
> > > --
> > > I-Lin Kuo
> >
> >
> > --
> > GESIS - German Social Science Infrastructure Services
> > http://www.gesis.org/en/
> >
> >
> >
> >
> > --
> > I-Lin Kuo
>
>
> --
> GESIS - German Social Science Infrastructure Services
> http://www.gesis.org/en/
>
--
I-Lin Kuo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.icpsr.umich.edu/pipermail/ddi-srg/attachments/20070920/e4a8fdaf/attachment-0001.html
More information about the DDI-SRG
mailing list