XML Grammar Format
Structure
The root structure consists of four parts. One part to declare all needed tokens, and
also one part to declare all tokens, which could be neglected, e.g whitespaces.
In another part the grammar declares the productions, which are used to generate greater
aggregations of the tokens.
And as last part, the grammar must specified, which symbol should be the start symbol,
which is similar to the root element of the generated XML document.
| | |
|
<grammar uri="[Namespace of the generating XML documents]">
<tokens>
[token definitions]
</tokens>
<ignorabletokens>
[token definition, which could be ignored]
</ignorabletokens>
<productions>
[definitions of the productions]
</productions>
<ssymbol ntsymbol="[Name of start symbol]"/>
</grammar>
| |
| | |
Lexical tokens
Every token has an entry in the tokens section. Each token is mapped to a
terminal symbol. By terminal we mean that this symbol can not be broken
down into smaller structures.
| | |
|
<tokens>
<token tsymbol="Name of the symbol">
[definition of the token]
</token>
</tokens>
| |
| | |
For the definition of tokens Chaperon uses a structure similar to Regex. It contains alternations,
concatenations, characters classes, etc.
Every element can contain the attributes "minOccurs" and "maxOccurs"
Alternations
Alternation means that one of the contained elements must match.
| | |
|
<token tsymbol="Name of the symbol">
<alt>
[element 1]
[element 2]
[element 3]
</alt>
</token>
| |
| | |
Concatenations
Concatenation means that all elements in a sequence must match.
| | |
|
<token tsymbol="Name of the symbol">
<concat>
[element 1]
[element 2]
[element 3]
</concat>
</token>
| |
| | |
Character classes
A character class compares a character to the characters
which this class contains. There are two options for
a character class. Either a character class or a negated character class.
The negated character class implies that the character should not match to any
of the characters in the class.
| | |
|
<token tsymbol="Name of the symbol">
<cc>
[Characters, which should match]
</cc>
<ncc>
[Characters, which shouldn't match]
</ncc>
</token>
| |
| | |
The character class can contain two elements:
-
Character sets
-
Character intervals
| | |
|
<token tsymbol="Name of the symbol">
<cc>
<cs content="abcd"/>
<ci min="e" max="z"/>
</cc>
</token
| |
| | |
Strings
The string must match to every character in a sequence.
| | |
|
<token tsymbol="Name of the symbol">
<string content="Sequence of characters"/>
</token>
| |
| | |
Universal character
This character matches all characters except carriage return and line feed
| | |
|
<token tsymbol="Name of the symbol">
<dot/>
</token>
| |
| | |
Begin of line
This symbol matches the beginning of a line
| | |
|
<token tsymbol="Name of the symbol">
<bol/>
</token>
| |
| | |
End of line
This symbol matches the end of a line
| | |
|
<token tsymbol="Name of the symbol">
<eol/>
</token>
| |
| | |
Productions
A Production arranges tokens in a structure. It is defined by a sequence of symbols
| | |
|
<productions>
<production ntsymbol="Name of of the production">
<ntsymbol name="symbol1"/><tsymbol name="symbol2"/><ntsymbol name="symbol3"/>
</production>
</productions>
| |
| | |
"tsymbol" refers to a terminal symbol of name specified by the name attribute.
Similarly "ntsymbol" refers to a nonterminal symbol.
Here is an example which reads a line of words.
| | |
|
<production ntsymbol="line">
<ntsymbol name="line"/><tsymbol name="WORD"/>
</production>
<production ntsymbol="line">
<tsymbol name="word"/>
</production>
| |
| | |
A production has two more attributes.
Precedence
The first attribute is "prec". It
is used to define precedences. This means that the production gets the same
priority as a terminal symbol.
| | |
|
<production ntsymbol="line" prec="WORD">
<ntsymbol name="line"/><tsymbol name="WORD"/>
</production>
| |
| | |
The second attribute is "reducetype". This attribute is used by the tree builder.
Reduce type : NORMAL
| | |
|
<production ntsymbol="line" reducetype="normal">
<ntsymbol name="line"/><tsymbol name="WORD"/>
</production>
| |
| | |
This example will produce the following XML document.
| | |
|
<line>
<line>
<line>
<line>
<WORD>This</WORD>
</line>
<WORD>is</WORD>
</line>
<WORD>an</WORD>
</line>
<WORD>example</WORD>
</line>
| |
| | |
Reduce type : APPEND
| | |
|
<production ntsymbol="line" reducetype="append">
<ntsymbol name="line"/><tsymbol name="WORD"/>
</production>
| |
| | |
"append" means that a production will be resolved, if the parent has the same name.
| | |
|
<line>
<WORD>This</WORD>
<WORD>is</WORD>
<WORD>an</WORD>
<WORD>example</WORD>
</line>
| |
| | |
Reduce type : RESOLVE
| | |
|
<production ntsymbol="line" reducetype="resolve">
<ntsymbol name="line"/><tsymbol name="WORD"/>
</production>
| |
| | |
"resolve" means that a production will resolved.
| | |
|
<WORD>This</WORD>
<WORD>is</WORD>
<WORD>an</WORD>
<WORD>example</WORD>
| |
| | |
Reduce type : NEGLECT
And finally "neglect" means that a production will not appear in the generated XML document.
|