-
Notifications
You must be signed in to change notification settings - Fork 4
Guidelines: Authoring Grammars
The specification for ABNF syntax specifications does not provide any advice on how to write a good syntax specification. That is not surprising: defining a formal syntax specification is not an exact science.
This page contains guidelines for authors of syntax specifications. The advice that is given is based on real experiences.
Modularity is an important concept, even for the authors of a syntax specification. This is because the implementation will closely match the specification. A grammar that is not modular can never result in a modular program.
There is a one-to-one relation between the number of grammar rules and the number of rule parsers. This is important to understand. When you put too much complexity into a single rule, you make it difficult for the programmer to write a module for that rule. Instead of doing that, split the rule into smaller, manageable chunks. You are doing the programmer a huge favor.
Example of a rule that is easy to parse: DIGIT
, taken from the core ABNF rules in RFC 5234.
The DIGIT
rule represents a choice of 10 elements.
; RFC 5234
DIGIT = "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"
Writing code for the DIGIT
rule is pretty straightforward.
Example of a rule that is difficult to parse: IPv6address
, taken from RFC 3986.
The IPv6address
rule represents a choice of 9 alternatives.
; RFC 3986
IPv6address = 6( h16 ":" ) ls32
/ "::" 5( h16 ":" ) ls32
/ [ h16 ] "::" 4( h16 ":" ) ls32
/ [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
/ [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
/ [ *3( h16 ":" ) h16 ] "::" h16 ":" ls32
/ [ *4( h16 ":" ) h16 ] "::" ls32
/ [ *5( h16 ":" ) h16 ] "::" h16
/ [ *6( h16 ":" ) h16 ] "::"
Writing code for the IPv6address
is difficult due to its size and complexity.
- The first alternative is a sequence of two elements.
- The first element in the sequence is a repetition of exactly 6 sequences of two elements.
- The first element in the sequence is a non-terminal
- The second element in the sequence is a terminal
- The second element in the sequence is a non-terminal
- The first element in the sequence is a repetition of exactly 6 sequences of two elements.
- The second alternative is a sequence of three elements
- The third alternative is a sequence of four elements
- and so on
Furthermore, the rule naming is also a bit unfortunate. It is not obvious what h16
and ls32
are just from looking at the IPv6address
rule specification.
In the sections that follow, I will lay out a few techniques that you can use to break up some of that complexity.
For whatever reason, the names of core ABNF rules are capitalized. Rule names are actually case-insensitive. To differentiate custom rules from core rules, define custom rules in lowerCamelCase.
For example: leastSignificant32bits
instead of ls32
An alias rule is a rule that points to another rule without extending it.
num = DIGIT
Rules can get complicated, fast. By separating individual pieces out to sub-rules, the individual chunks become easier to manage.
- An element in a sequence may be a terminal.
- An element in a sequence may be a non-terminal.
- An element in a sequence must not be an option.
- An element in a sequence must not be a repetition.
- An element in a sequence must not be a group of alternatives.
- An element in a sequence must not be a nested sequence group.
count = ( "one" / "1" ) ( "two" / "2" ) ( "three" / "3" )
The count
sequence is a sequence of three elements that violate the guidelines: an element in a sequence must not be a group of alternatives.
Let's refactor:
count = one two three
one = "one" / "1"
two = "two" / "2"
three = "three" / "3"
- A repeating element may be a terminal.
- A repeating element may be a non-terminal.
- A repeating element must not be an option.
- A repeating element must not be a nested repetition.
- A repeating element must not be a group of alternatives.
- A repeating element must not be a sequence group.
The same rules apply to optional elements, because they are also a form of repetition.
linearWhiteSpace = 1*( WSP / ( CRLF WSP ) )
WSP = SP / HTAB
The linearWhiteSpace
repetition is a repeating element that violates the guidelines: a repeating element must not be a group of alternatives.
Let's refactor:
linearWhiteSpace = 1*( multiLineWSP )
multiLineWSP = WSP / newLineWSP
newLineWSP = CRLF WSP
WSP = SP / HTAB
Even the simplest grammar rules can usually be written in more than one way. To demonstrate this: all of the following rules match the same input text.
rule1 = [ "Example" ]
rule2 = *1( "Example" )
rule3 = "Example" / ""
rule4 = ("E" "x" "a" "m" "p" "l" "e") / ""
These rules all match the same syntax. The difference is in what they represent.
- Rule 1 represents an optional element.
- Rule 2 represents a collection of elements with 0 or 1 occurrences.
- Rule 3 Represents a choice of two elements.
- Rule 4 represents a choice of two element sequences.
Define rules in a way that closely matches what is being represented.