Chaperon - XML grammar format
http://xml.apache.org/http://www.apache.org/http://www.w3.org/

Main

How-Tos
Index

Chaperon Parser
Introduction
XML grammar format
Text grammar format

XML Grammar Format
Structure

The root structure consists of four parts. One part to declare all needed tokens, and also one part to declare all tokens, which could be neglected, e.g whitespaces.

In another part the grammar declares the productions, which are used to generate greater aggregations of the tokens.

And as last part, the grammar must specified, which symbol should be the start symbol, which is similar to the root element of the generated XML document.

<grammar uri="[Namespace of the generating XML documents]">
 <tokens>
  [token definitions]
 </tokens>

 <ignorabletokens>
  [token definition, which could be ignored]
 </ignorabletokens>

 <productions>
  [definitions of the productions]
 </productions>

 <ssymbol ntsymbol="[Name of start symbol]"/>
</grammar>
Lexical tokens

Every token has an entry in the tokens section. Each token is mapped to a terminal symbol. By terminal we mean that this symbol can not be broken down into smaller structures.

<tokens>
 <token tsymbol="Name of the symbol">
  [definition of the token]
 </token>
</tokens>

For the definition of tokens Chaperon uses a structure similar to Regex. It contains alternations, concatenations, characters classes, etc.

Every element can contain the attributes "minOccurs" and "maxOccurs"

Alternations

Alternation means that one of the contained elements must match.

<token tsymbol="Name of the symbol">
 <alt>
  [element 1]
  [element 2]
  [element 3]
 </alt>
</token>
Concatenations

Concatenation means that all elements in a sequence must match.

<token tsymbol="Name of the symbol">
 <concat>
  [element 1]
  [element 2]
  [element 3]
 </concat>
</token>
Character classes

A character class compares a character to the characters which this class contains. There are two options for a character class. Either a character class or a negated character class. The negated character class implies that the character should not match to any of the characters in the class.

<token tsymbol="Name of the symbol">
 <cc>
  [Characters, which should match]
 </cc>

 <ncc>
  [Characters, which shouldn't match]
 </ncc>
</token>

The character class can contain two elements:

  • Character sets
  • Character intervals
<token tsymbol="Name of the symbol">
 <cc>
  <cs content="abcd"/>
  <ci min="e" max="z"/>
 </cc>
</token
Strings

The string must match to every character in a sequence.

<token tsymbol="Name of the symbol">
 <string content="Sequence of characters"/>
</token>
Universal character

This character matches all characters except carriage return and line feed

<token tsymbol="Name of the symbol">
 <dot/>
</token>
Begin of line

This symbol matches the beginning of a line

<token tsymbol="Name of the symbol">
 <bol/>
</token>
End of line

This symbol matches the end of a line

<token tsymbol="Name of the symbol">
 <eol/>
</token>
Productions

A Production arranges tokens in a structure. It is defined by a sequence of symbols

<productions>
 <production ntsymbol="Name of of the production">
  <ntsymbol name="symbol1"/><tsymbol name="symbol2"/><ntsymbol name="symbol3"/>
 </production>
</productions>

"tsymbol" refers to a terminal symbol of name specified by the name attribute. Similarly "ntsymbol" refers to a nonterminal symbol.

Here is an example which reads a line of words.

<production ntsymbol="line">
 <ntsymbol name="line"/><tsymbol name="WORD"/>
</production>

<production ntsymbol="line">
 <tsymbol name="word"/>
</production>

A production has two more attributes.

Precedence

The first attribute is "prec". It is used to define precedences. This means that the production gets the same priority as a terminal symbol.

<production ntsymbol="line" prec="WORD">
 <ntsymbol name="line"/><tsymbol name="WORD"/>
</production>

The second attribute is "reducetype". This attribute is used by the tree builder.

Reduce type : NORMAL
<production ntsymbol="line" reducetype="normal">
 <ntsymbol name="line"/><tsymbol name="WORD"/>
</production>

This example will produce the following XML document.

<line>
 <line>
  <line>
   <line>
    <WORD>This</WORD>
   </line>
   <WORD>is</WORD>
  </line>
  <WORD>an</WORD>
 </line>
 <WORD>example</WORD>
</line>
Reduce type : APPEND
<production ntsymbol="line" reducetype="append">
 <ntsymbol name="line"/><tsymbol name="WORD"/>
</production>

"append" means that a production will be resolved, if the parent has the same name.

<line>
 <WORD>This</WORD>
 <WORD>is</WORD>
 <WORD>an</WORD>
 <WORD>example</WORD>
</line>
Reduce type : RESOLVE
<production ntsymbol="line" reducetype="resolve">
 <ntsymbol name="line"/><tsymbol name="WORD"/>
</production>

"resolve" means that a production will resolved.

<WORD>This</WORD>
<WORD>is</WORD>
<WORD>an</WORD>
<WORD>example</WORD>
Reduce type : NEGLECT

And finally "neglect" means that a production will not appear in the generated XML document.

Copyright © 1999-2002 The Apache Software Foundation. All Rights Reserved.