Skip to content
This repository has been archived by the owner on Jun 24, 2020. It is now read-only.

Guidelines: Authoring Grammars

Steven Liekens edited this page Mar 20, 2015 · 3 revisions

Authoring Grammar Rules

The specification for ABNF syntax specifications does not provide any advice on how to write a good syntax specification. That is not surprising: defining a formal syntax specification is not an exact science.

This page contains guidelines for authors of syntax specifications. The advice that is given is based on real experiences.

Modularity

Modularity is an important concept, even for the authors of a syntax specification. This is because the implementation will closely match the specification. A grammar that is not modular can never result in a modular program.

There is a one-to-one relation between the number of grammar rules and the number of rule parsers. This is important to understand. When you put too much complexity into a single rule, you make it difficult for the programmer to write a module for that rule. Instead of doing that, split the rule into smaller, manageable chunks. You are doing the programmer a huge favor.

Exhibit A - DIGIT

Example of a rule that is easy to parse: DIGIT, taken from the core ABNF rules in RFC 5234.

The DIGIT rule represents a choice of 10 elements.

; RFC 5234
DIGIT = "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9"

Writing code for the DIGIT rule is pretty straightforward.

Exhibit B - IPv6address

Example of a rule that is difficult to parse: IPv6address, taken from RFC 3986.

The IPv6address rule represents a choice of 9 alternatives.

; RFC 3986
IPv6address   =                             6( h16 ":" ) ls32
                    /                       "::" 5( h16 ":" ) ls32
                    / [               h16 ] "::" 4( h16 ":" ) ls32
                    / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
                    / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
                    / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
                    / [ *4( h16 ":" ) h16 ] "::"              ls32
                    / [ *5( h16 ":" ) h16 ] "::"              h16
                    / [ *6( h16 ":" ) h16 ] "::"

Writing code for the IPv6address is difficult due to its size and complexity.

  • The first alternative is a sequence of two elements.
    • The first element in the sequence is a repetition of exactly 6 sequences of two elements.
      • The first element in the sequence is a non-terminal
      • The second element in the sequence is a terminal
    • The second element in the sequence is a non-terminal
  • The second alternative is a sequence of three elements
  • The third alternative is a sequence of four elements
  • and so on

Furthermore, the rule naming is also a bit unfortunate. It is not obvious what h16 and ls32 are just from looking at the IPv6address rule specification.

In the sections that follow, I will lay out a few techniques that you can use to break up some of that complexity.

Rule Naming

Use lowerCamelCase

For whatever reason, the names of core ABNF rules are capitalized. Rule names are actually case-insensitive. To differentiate custom rules from core rules, define custom rules in lowerCamelCase.

Prefer long, descriptive rule names

For example: leastSignificant32bits instead of ls32

Avoid alias rules

An alias rule is a rule that points to another rule without extending it.

num = DIGIT

Complexity

Rules can get complicated, fast. By separating individual pieces out to sub-rules, the individual chunks become easier to manage.

A sequence must only contain terminals and non-terminals

  • An element in a sequence may be a terminal.
  • An element in a sequence may be a non-terminal.
  • An element in a sequence must not be an option.
  • An element in a sequence must not be a repetition.
  • An element in a sequence must not be a group of alternatives.
  • An element in a sequence must not be a nested sequence group.

Example

count = ( "one" / "1" ) ( "two" / "2" ) ( "three" / "3" )

The count sequence is a sequence of three elements that violate the guidelines: an element in a sequence must not be a group of alternatives.

Let's refactor:

count = one two three
one   = "one" / "1"
two   = "two" / "2"
three = "three" / "3"

A repetition must be a repeating terminal or a repeating non-terminal

  • A repeating element may be a terminal.
  • A repeating element may be a non-terminal.
  • A repeating element must not be an option.
  • A repeating element must not be a nested repetition.
  • A repeating element must not be a group of alternatives.
  • A repeating element must not be a sequence group.

The same rules apply to optional elements, because they are also a form of repetition.

Example

linearWhiteSpace = 1*( WSP / ( CRLF WSP ) )
WSP              = SP / HTAB

The linearWhiteSpace repetition is a repeating element that violates the guidelines: a repeating element must not be a group of alternatives.

Let's refactor:

linearWhiteSpace = 1*( multiLineWSP )
multiLineWSP     = WSP / newLineWSP
newLineWSP       = CRLF WSP
WSP              = SP / HTAB

Semantics

Even the simplest grammar rules can usually be written in more than one way. To demonstrate this: all of the following rules match the same input text.

rule1 = [ "Example" ]
rule2 = *1( "Example" )
rule3 = "Example" / ""
rule4 = ("E" "x" "a" "m" "p" "l" "e") / ""

These rules all match the same syntax. The difference is in what they represent.

  • Rule 1 represents an optional element.
  • Rule 2 represents a collection of elements with 0 or 1 occurrences.
  • Rule 3 Represents a choice of two elements.
  • Rule 4 represents a choice of two element sequences.

Define rules in a way that closely matches what is being represented.