XSLTC Compiler Design
- Compiler Overview
- Building the Abstract Syntax Tree
- Type-check and Cast Expressions
- JVM byte-code generation
Compiler overview
The main component of the XSLTC compiler is the class
-
org.apache.xalan.xsltc.compiler.XSLTC
This class uses three parsers to consume the input stylesheet(s):
-
javax.xml.parsers.SAXParser
is used to parse the stylesheet document and pass its contents to the compiler as basic SAX2 events.
-
com.sun.xslt.compiler.XPathParser
is a parser used to parse XPath expressions and patterns. This parser is generated using JavaCUP and JavaLEX from Princeton University.
-
com.sun.xslt.compiler.Parser
is a wrapper for the other two parsers. This parser is responsible for using the other two parsers to build the compiler's abstract syntax tree (which is described in more detail in the next section of this document).
Building an Abstract Syntax Tree
An abstract syntax tree (AST) is a data-structure commonly used by compilers to separate the parse-phase from the later phases of the compilation. The AST has one node for each parsed token from the stylesheet and can easily be parsed at the stages of type-checking and bytecode generation.
- Mapping stylesheet elements to AST nodes
- Building the AST from AST nodes
- Mapping XPath expressions and patterns to additional AST nodes
The SAX parser passes the contents of the stylesheet to XSLTC's main parser. The SAX events represent a decomposition of the XML document that contains the stylesheet. The main parser needs to create one AST node from each node that it receives from the SAX parser. It also needs to use the XPath parser to decompose attributes that contain XPath expressions and patterns. Remember that XSLT is in effect two languages: XML and XPath, and one parser is needed for each of these languages. The SAX parser breaks down the stylesheet document, the XPath parser breaks down XPath expressions and patterns, and the main parser maps the decomposed elements into nodes in the abstract syntax tree.
Mapping stylesheets elements to AST nodes
Every element that is defined in the XSLT 1.0 spec is represented by a
a class in the org.apache.xalan.xsltc.compiler
package. The
main parser class contains a Hashtable
that that maps XSL
elements into Java classes that make up the nodes in the AST. These Java
classes all reside in the org.apache.xalan.xsltc.compiler
package and extend either the TopLevelElement
or the
Instruction
classes. (Both these classes extend the
SyntaxTreeNode
class.)
The mapping from XSL element names to Java classes/AST nodes is set up
in the initClasses()
method of the main parser:
private void initStdClasses() { try { initStdClass("template", "Template"); initStdClass("param", "Param"); initStdClass("with-param", "WithParam"); initStdClass("variable", "Variable"); initStdClass("output", "Output"); : : : } } private void initClass(String elementName, String className) throws ClassNotFoundException { _classes.put(elementName, Class.forName(COMPILER_PACKAGE + '.' + className)); }
Building the AST from AST nodes
The parser builds an AST from the various syntax tree nodes. Each node contains a reference to its parent node, a vector containing references to all child nodes and a structure containing all attribute nodes:
protected SyntaxTreeNode _parent; // Parent node private Vector _contents; // Child nodes protected Attributes _attributes; // Attributes of this element
These variables should be accessed using these methods:
protected final SyntaxTreeNode getParent(); protected final Vector getContents(); protected String getAttribute(String qname); protected Attributes getAttributes();
At this time the AST only contains nodes that represent the XSL elements from the stylesheet. A SAX parse is generic and can only handle XML files and will not break up and identify XPath patterns/expressions (these are stored as attributes to the various nodes in the tree). Each XSL instruction gets its own node in the AST, and the XPath patterns/expressions are stored as attributes of these nodes. A stylesheet looking like this:
<xsl:stylesheet .......> <xsl:template match="chapter"> <xsl:text>Chapter</xsl:text> <xsl:value-of select="."> </xsl:template> </xsl>stylesheet>
will be stored in the AST as indicated in the following picture:
Figure 1: The AST in its first stage
All objects that make up the nodes in the initial AST have a
parseContents()
method. This method is responsible for:
- parsing the values of those attributes that contain XPath expressions or patterns, breaking each expression/pattern into AST nodes and inserting them into the tree.
- reading/checking all other required attributes
- propagate the
parseContents()
call down the tree
Mapping XPath expressions and patterns to additional AST nodes
The nodes that represent the XPath expressions and patterns extend
either the Expression
or Pattern
class
respectively. These nodes are not appended to the _contents
vectory of each node, but rather stored as individual references in each
AST element node. One example is the ForEach
class that
represents the <xsl:for-each>
element. This class has
a variable that contains a reference to the AST sub-tree that represents
its select
attribute:
private Expression _select;
There is no standard way of storing these XPath expressions and each AST node that contains one or more XPath expression/pattern must handle that itself. This handling basically involves passing the attribute's value to the XPath parser and receiving back an AST sub-tree.
With all XPath expressions/patterns expanded, the AST will look somewhat like this:
Fiugre 2: The AST in its second stage
Type-check and Cast Expressions
In many cases we will need to typecast the top node in the expression
sub-tree to suit the expected result-type of the expression, or to typecast
child nodes to suit the allowed types for the various operators in the
expression. This is done by calling 'typeCheck()' on the root-node in the
XSL tree. Each SyntaxTreeNode node is responsible for inserting type-cast
nodes between itself and its child nodes or XPath nodes. These type-cast
nodes will convert the output-types of the child/XPath nodes to the expected
input-type of the parent node. Let look at our AST again and the node that
represents the <xsl:value-of>
element. This element
expects to receive a string from its select
XPath expression,
but the Step
expression will return either a node-set or a
single node. An extra node is inserted into the AST to perform the
necessary type conversions:
Figure 3: XPath expression type cast
The typeCheck()
method of each SyntaxTreeNode object will
call typeCheck()
on each of its XPath expressions. This method
will return the native type returned by the expression. The AST node will
insert an additional type-conversion node if the return-type does not match
the expected data-type. Each possible return type is represented by a class
in the org.apache.xalan.xsltc.compiler.util
package. These
classes all contain methods that will generate bytecodes needed to perform
the actual type conversions (at runtime). The type-cast nodes in the AST
mainly consist of calls to these methods.
JVM byte-code generation
- Compiling the stylesheet
- Compiling top-level elements
- Compiling template code
- Compiling instructions, functions expressions and patterns
Evey node in the AST extends the SyntaxTreeNode
base class
and implements the translate()
method. This method is
responsible for outputting the actual bytecodes that make up the
functionality required for each element, function, expression or pattern.
Compiling the stylesheet
Some nodes in the AST require more complex code than others. The best
example is the <xsl:stylesheet>
element. The code that
represents this element has to tie together the code that is generated by
all the other elements and generate the actual class definition for the main
translet class. The Stylesheet
class generates the translet's
constructor and methods that handle all top-level elements.
Compiling top-level elements
The bytecode that handles top-level elements must be generated before any
other code. The 'translate()
' method in these classes are
mainly called from these methods in the Stylesheet class:
private String compileBuildKeys(ClassGenerator); private String compileTopLevel(ClassGenerator, Enumeration); private void compileConstructor(ClassGenerator, Output);
These methods handle most top-level elements, such as global variables
and parameters, <xsl:output>
and
<xsl:decimal-format>
instructions.
Compiling template code
All XPath patterns in <xsl:apply-template>
instructions are converted into numeric values (known as the pattern's
kernel 'type'). All templates with identical pattern kernel types are
grouped together and inserted into a table known as a test sequence.
(The table of test sequences is found in the Mode class in the compiler
package. There will be one such table for each mode that is used in the
stylesheet). This table is used to build a big switch()
statement in the translet's applyTemplates()
method. This
method is initially called with the root node of the input document.
The applyTemplates()
method determines the node's type and
passes this type to the switch()
statement to look up the
matching template. The test sequence code (the TestSeq
class)
is responsible for inserting bytecodes to find one matching template
in cases where more than one template matches the current node type.
There may be several templates that share the same pattern kernel type. Here are a few examples of templates with patterns that all have the same kernel type:
<xsl:template match="A/C"> <xsl:template match="A/B/C"> <xsl:template match="A | C">
All these templates will be grouped under the type for
<C>
and will all get the same kernel type (the type for
"C"
). The last template will be grouped both under
"C"
and "A"
, since it matches either element.
If the type identifier for "C"
in this case is 8, all these
templates will be put under case 8:
in
applyTemplates()
's big switch()
statement. The
TestSeq
class will insert some code under the
case 8:
statement (similar to if's and then's) in order to
determine which of the three templates to trigger.
Compiling instructions, functions, expressions and patterns
The template code is generated by calling translate()
on
each Template
object in the abstract syntax tree. This call
will be propagated down the abstract syntax tree and every element will
output the bytecodes necessary to complete its task.
The Java Virtual Machine is stack-based, which goes hand-in-hand with
the tree structure of a stylesheet and the AST. A node in the AST will
call translate()
on its child nodes and any XPath nodes before
it generates its own bytecodes. In that way the correct sequence of JVM
instructions is generated. Each one of the child nodes is responsible of
creating code that leaves the node's output value (if any) on the stack.
The typical procedure for the parent node is to create JVM code that
consumes these values off the stack and then leave its own output on the
stack (for its parent).
The tree-structure of the stylesheet is in this way closely tied with the stack-based JVM. The design does not offer any obvious way of extending the compiler to output code for other non-stack-based VMs or processors.