Parsers Computer programming

Description

need codes and stuffs….please make a video also how to get the output

Don't use plagiarized sources. Get Your Custom Assignment on
Parsers Computer programming
From as Little as $13/Page

Unformatted Attachment Preview

Chapter 3
Chomsky Grammars
3.1
Formal Grammars
Formal grammars are a special form of a more general concept called a rewriting system. There are many
general rewriting systems. Some are Markov algorithms, Post productions, and Thue and semi-Thue systems.
These are not covered here, but the reader should be aware that the formal grammars presented here are those
studied by Noam Chomsky and can be viewed as restricted semi-Thue system. Before showing the classes
of grammars and the corresponding languages generated by these Chomsky grammars, several definitions
along with notation is presented.
In the context of studying programming languages, we are interested in using these grammars for not only
describing what is and is not in a given programming language, but also use these for writing translators.
The grammar-related concepts of parsing, parse trees, ambiguity, and the relationship to translation and
compiler building are presented. It will be shown that two restricted forms of Chomsky grammars, namely
Context-Free and Regular, are the most useful for writing translators.
3.1.1
Generating Languages with Grammars
The definition given next is one for a general Unrestricted Grammar, also known as a Phrase-Structured
grammar.
Definition 3.1.1 Grammar:
A grammar defined by four items represented as a quadruple, < VN , VT , P, S >, where:
ˆ VN is a nonempty, finite set of symbols called nonterminals.
ˆ VT is a nonempty, finite set of symbols called terminals such that VN ∩ VT = ∅.
ˆ P ⊂ V + × V ∗ is a non-empty, finite set of production rules, where V = VN ∪ VT .
ˆ S ∈ VN is a one and only one designated symbol called the start symbol.
63
3.1 Formal Grammars
Chomsky Grammars
Definition 3.1.2 Immediate (Direct) Derivation: If α → β is a production and ωαϕ is a string, where
ω, ϕ ∈ V ∗ and α ∈ V + , then ωαϕ ⇒ ωβϕ is an immediate (or direct) derivation.
Definition 3.1.3 Derives and Reduces Relations: A sequence of immediate derivations α1 ⇒ α2 , α2 ⇒
α3 , . . . , αn−1 ⇒ αn , usually written, α1 ⇒ α2 ⇒ . . . ⇒ αn−1 ⇒ αn is a derivation of length n. We say that
α1 derives αn and αn reduces to α1 ; similarly, αn is derivable from α1 and α1 is reducible from αn . We
represent this as α1 ⇒+ αn (n > 1). We write α1 ⇒∗ αn (n ≥ 1).
Definition 3.1.4 Sentential Forms and Sentences: A string α derivable from the start symbol, S, of a
grammar, G =< VN , VT , P, S >, is called a sentential form, i.e. S ⇒∗ α, where α ∈ V ∗ = (VN ∪ VT )∗ . A
string α, is a sentence iff α is a sentential form and α ∈ VT∗ .
Definition 3.1.5 Grammar-Generated L(G): Language, L(G), generated by grammar G, is defined as the
set of sentences:
L(G) = {α ∈ VT∗ | S ⇒+ α}
Generally, for definitions and abstract examples, we will continue to use the metasymbols ’→’ and ’|’. In
addition, when the non-terminal and terminal sets are not explicitly listed, the following conventions will be
used:
ˆ Nonterminal symbols:strings starting with upper-case Latin letters (e.g. A,B,C,Expr,. . . )
ˆ Terminal symbols: strings starting with lower-case Latin letters and strings of printable non-alpha
characters, excluding ’|’, and ’→’) (e.g. a,b,c, id, *, (, ) . . . )
ˆ Lower-case Greek letters to denote strings in V ∗ (e.g. α, β, γ, . . .).
3.1.2
Examples of Grammars
Example 3.1 Binary Number Language:
Let G =< {N, A, B}, {., 0, 1}, P, N > where P :
N
→ A | A.A
A
→ B | AB
B
→ 0|1
Here, we have:
V = VN ∪ VT = {N, A, B, ., 0, 1}
N
Copyright 2024
⇒ A.A ⇒ AB.A
64
⇒ A1.A
Chomsky Grammars
3.1 Formal Grammars
A1.A is a sentential form, but it is not a sentence.
N

A1.A ⇒
11.B

⇒ AB.A
A.A
B1.A ⇒
11.A


11.0
N ⇒∗ 11.0 so 11.0 is a sentential form and is also a sentence. After trying some derivations of sentences
you should see (intuitively) that L(G) is the language of binary numbers.
Example 3.2 Expression Grammar: E, T, F
Let G =< {E, T, F }, {id, (, ), ∗, +}, P, E > where P:
E

E+T |T
T

T ∗F |F
F

(E) | id
This grammar is used to describe arithmetic expressions in many programming languages. E, T, and F
stand for Expression, Term, and Factor, respectively. Some derivations of sentences are shown below:
1.
E ⇒ T ⇒ F ⇒ id
2.
E

E+T

E+T ∗F

E+F ∗F

E + id ∗ F

E + id ∗ id

F + id ∗ id ⇒
id + id ∗ id
T + id ∗ id ⇒
3.
E

E+T

E+F

E + (E)

E + (T )

E + (T ∗ F )

E + (T ∗ id)

E + (F ∗ id)

E + (id ∗ id)

T + (id ∗ id) ⇒ F + (id ∗ id) ⇒ id + (id ∗ id)
4.
E

E+T

T +T

F +T

id + T

id + F

id + (E)

id + (T )

id + (T ∗ F )

id + (id ∗ F ) ⇒
id + (id ∗ id)
id + (F ∗ F ) ⇒
Copyright 2024
65
3.2 The Chomsky Hierarchy
Chomsky Grammars
Note that the last two derivations produced the same sentence. Derivations 3 and 4 are called rightmost and
leftmost, respectively. In the right(left)most derivations the right(left)most nonterminal is replaced. The
formal definition for a rightmost derivation is as follows:
Let G =< VN , VT , P, S > be a grammar. The derivation
α0 ⇒ α1 ⇒ · · · ⇒ αn−1
(n > 1) is a rightmost derivation if and only if each direct derivation αi−1 ⇒ αi (1 ≤ i ≤ n − 1) is of the
form αi−1 = ϕAω and αi = ϕβω where ω ∈ VT∗ , A ∈ VN , and ϕ ∈ V ∗ Remember V = (VN ∪ VT ). A leftmost
derivation is defined similarly — just swap the ϕ with the ω.
3.2
The Chomsky Hierarchy
Up to this point the reader may have observed that the productions of all the examples have been restricted
forms of that given in the definition of a grammar: each left-hand side of the productions have consisted
only of a single nonterminal. Grammars with these types of productions are called context-free grammars
(CFGs). Computer scientists have found that CFGs are very useful in describing many artificial languages
used in computer science, especially programming languages. The other Chomsky grammars are also very
important to computer scientists and other researchers who find their research intersecting the general areas
of theory of computation and linguistics — there are too many specific areas to begin listing them.
In this section we will examine four types of grammars that make up the Chomsky hierarchy and mention
a few of the relationships between them.
3.2.1
Definitions of the Grammar Classes
We begin by considering the grammar given in the definition of grammars as the most general type and
proceed by imposing more restrictions on the production forms. Note that the type-0 grammar is identical
to the original definition of a grammar given in Definition 3.1.1. For the more restrictive grammar types,
1,2,and 3, we add the condition that only the productions with ϵ on the right-hand side, is the ones with
only the start symbol on the left-hand side; no other production can have ϵ on the right-hand side.
There are four classes of Chomsky grammars, each adding more constraints on the production rule form.
Definition 3.2.1 Phrase-Structured Grammar (type 0):
A phrase-structured grammar (PSG) has pro-
ductions of the form:
α→β
where α ∈ V + and β ∈ V ∗ . This type of grammar is like a semi-Thue with the designated string restricted
to the start symbol; this type 0 grammar form is also called unrestricted in many computer science and
mathematics journals and textbooks.
If G is a PSG, the L(G) is a phrase-structured language (PSL).
Copyright 2024
66
Chomsky Grammars
3.2 The Chomsky Hierarchy
Definition 3.2.2 Context-Sensitive Grammar (type 1):
A context-sensitive grammar (CSG) has pro-
ductions of the form:
α ∈ V + , α ̸= β, and |β| ≥ |α|.
α→β
Note that since β must be at least as long as α, so obviously β ∈ V + . This length restriction classifies this
type of grammar as non-contracting.
The reason CSGs are called “context-sensitive” is because each CSG production set can be transformed
into productions that generate the same but have the following form:
ϕ1 Aϕ2 → ϕ1 ωϕ2
where ϕ1 , ϕ2 ∈ V ∗ and ω ∈ V + . In other words nonterminal A can only be replaced in the “context” of ϕ1
and ϕ2 . We will not prove that all CSGs can be written in this form.
If G is a CSG, then L(G) is a context-sensitive language (CSL).
Definition 3.2.3 Context-Free Grammar (type 2):
A context-free grammar (CFG) has productions of
the form:
A ∈ VN and β ∈ V + .
A→β
Note that the productions of CFGs are special cases of CSG productions: simply let ϕ1 = ϵ and ϕ2 = ϵ.
Also note that all of the examples presented in the preceding sections are CFGs.
If G is a CFG, then L(G) is a context-free language (CFL).
Definition 3.2.4 Regular (or linear) Grammar (type 3): A regular (RG) grammar is either a left or right
linear grammar. A left linear grammar (RG) is a restricted CFG that has productions only of the form:
A → Ba
or
A→a
A, B ∈ VN and a ∈ VT .
A right linear grammar has productions only of the form:
A → aB
or
A→a
A, B ∈ VN and a ∈ VT .
If G is an RG, L(G) is a regular language (RL).
We need to discuss languages that contain the empty string, ϵ. For these languages we may derive it as
simply: S ⇒ ϵ, where S ∈ VN is the start symbol, no matter what type of grammar. For describing some
languages, it may be a matter of convenience to allow some productions of the form A → ϵ, where A ∈ Vn
and A not the start symbol. However, it can be shown that if grammar G1 has n > 0 ϵ-productions, where
ϵ may or may not be in L(G1 ), then there exists a grammar G2 where G2 has no ϵ-productions and where
L(G2 ) = L(G1 ). If ϵ ∈ L(G1 ), then one must include in G2 the production S → ϵ, where S is the start
symbol.
Copyright 2024
67
3.2 The Chomsky Hierarchy
Chomsky Grammars
For the remainder of this textbook, we will assume that ϵ-productions are allowed for grammars. Just
remember that the remaining CSG productions may not contract.
3.2.2
Examples of Classifying Grammars
When classifying a grammar within the context of the Chomsky hierarchy, always begin by trying to see
if every one of the grammar’s production rules has the form of a regular grammar. If it does not, check
whether it is a CFG. If not, then check whether it is a CSG. Remember that every Chomsky grammar is a
PSG and that RGs are the smallest, most restrictive class; always start with the smallest class, RGs, and
“work your way up” to the most general class.
Note that if there is more than one symbol, or if there is a terminal on the left-hand side of any production,
then it cannot be an RG or CFG. To help distinguish between CSGs and PSGs remember that PSGs can
have productions α → β with |α| ≥ |β|; i.e. PSGs are contracting grammars and CSGs are non-contracting.
This can help reduce the amount of work classifying a Chomsky grammar.
Example 3.3 Context-Sensitive Grammar
S
→ cE | cS | dS | b
cE
→ cdS | cd
The last two productions do not fit the form of CFG production rules. Thus, we must check to see if they
fit the CSG form. Recalling the definition of CSGs all CSG productions are non-contracting or have the
following form ϕ1 Aϕ2 → ϕ1 ωϕ2 where ϕ1 , ϕ2 ∈ V ∗ and ω ∈ V + . Grammar G2 is non-contracting so it is
a CSG. Moreover, it fits the second form also. Consider the production cE → cdS. We see that if ϕ1 = c,
A = E, ϕ2 = ϵ, and ω = dS, the production fits the form. Similarly, the last production also fits the form if
the same assignments are made with the exception of ω = d.
Example 3.4 Context-Free Grammar
A context-free grammar, G3:
S
→ bDc | Db | b
D
→ bbDc | b | ϵ
Grammar G3 is not an RG because the first and fourth productions do not fit the form of an RG. Since we
are allowing ϵ-productions in our CFG production forms all productions in G3 do fit the form of a CFG. Try
describing the language L(G3).
Example 3.5 Regular Grammar
A regular grammar, G4:
Copyright 2024
S
→ aU | bV
U
→ bS | b
V
→ aS | a
68
Chomsky Grammars
3.2 The Chomsky Hierarchy
The grammar G4 is a right linear grammar. An RG is either right linear or left linear and does not mix
productions of the two types.
3.2.3
Relationships of the Grammar and Language Classes
As discussed and illustrated earlier, the four types of Chomsky grammars are identified by the form of their
productions. Regular grammar productions are special cases of context free productions and context free
productions are special cases of context sensitive productions. All productions are phrase structured. Each
class of grammars has a corresponding class of languages. Given a language, L, one is not only interested
in constructing a grammar G such that L = L(G), but also constructing a grammar of the most restrictive
type possible. This begs the questions: are there languages that no regular grammar can generate? Are
there languages that no context free grammar can generate? The answers to these questions are all yes. To
discuss this we introduce some definitions and notation and then look at some examples.
Let Li be the class of languages that can be generated by a grammar of type i, i ∈ {0, 1, 2, 3}.
It is beyond the scope of this textbook, but it can be shown that:
L3 ⊂ L2 ⊂ L1 ⊂ L0
Example 3.6 Three Languages
Shown below are three languages and their respective types and sample grammars.
L(G)
Language Type
G

S
n m
{a b
| n > 0, m > 0}
{an bn | n > 0}
{an bn cn | n > 0}
Regular
A →
Context Free
Context Sensitive
aA
aA | bB | b
B

bB | b
S
→ aSb | ab
S

aSBC | aBC
CB

BC
aB

ab
bB

bb
bC

bc
cC
→ cc
Suppose that a software engineer is given a language L and must find a grammar G, such that L = L(G).
How does the software engineer know that they have found the most restrictive grammar type that can
generate language L? Computer scientists have a collection of theorems and lemmas that can identify the
smallest class of languages to which a language belongs. If the smallest language type is known, grammars
of a particular form can be pursued and properties can be exploited. Once the language type is known the
Copyright 2024
69
3.3 Parse Trees
Chomsky Grammars
appropriate known algorithms and models of computations can be utilized in solving the decision language
membership problem: “α ∈ L(G)?” Some of these algorithms try to construct a derivation of the string
being tested for membership in the usual sense by starting with the start symbol and reading one symbol at
a time in one direction (usually left-to-right) with possibly looking k ≥ 0 symbols ahead. Alternatively, one
can re-construct the derivation “backward” through a series of reductions while reading the string in one
direction with possibly looking ahead.
Another approach to testing membership is to utilize a model of computation, such as a finite state
automaton or push-down automaton. Later we will see that finite state machines can be used for testing
string membership in regular languages. Push-down automata can be used for testing string membership
in context-free languages. It should be pointed out that linear-bounded Turing machines can be used for
testing string membership in context-sensitive languages and Turing machines for testing membership of any
language. We will see that these machines can be represented as formal rewrite rules and also have direct
relationships with the corresponding grammars in that there are algorithms for translating between the two
types of formalisms.
These membership algorithms are referred to as parsers. An important underlying abstraction used
implicitly and, sometimes explicitly, by these parsers is a parse tree.
3.3
Parse Trees
As mentioned earlier, to solve problem of the testing α ∈ L(G), one tries to construct a derivation of the
string α using productions of G. The construction of a derivation is called parsing. A parse of some sentence
(or more generally a sentential form) shows how the sentence (or sentential form) was derived. A parser
is simply the name for the procedure that performs the construction of the derivation. Each parse has a
corresponding parse tree. We define a parse tree.
3.3.1
Terminology
A directed acyclic graph is a directed graph with no cycles. A tree is a directed acyclic graph with a unique
path from a designated node called the root to each of the other nodes. A node n1 is a descendant of another
node n2 if there is a directed path from that n2 to n1 . Leaf nodes have no descendants and the root node
is not a descendant of any node. A node that is neither a root node nor a leaf node is an interior node. A
node n1 is a child of node n2 if (n2 , n1 ) is an edge (branch) in the tree. An ordered tree has a linear order
on the children of every node.
Copyright 2024
70
Chomsky Grammars
3.3 Parse Trees
A

X1
X2
Xm
Figure 3.1: A → X1 X2 . . . Xm
Definition 3.3.1 Parse Tree: Given a CFG, G =< VN , VT , P, S > an ordered tree is a parse tree if the
following conditions are met:
1. The root is labeled by the start symbol.
2. A leaf node is labeled by a terminal or ϵ.
3. Each interior node is labeled by a nonterminal.
4. If A is an interior node and X1 , X2 , . . . , Xm are children of the node, then A → X1 X2 . . . Xm is a
production of the G, where Xi ∈ V ∪ {ϵ}. See Figure 3.1.
Note that we have defined parse trees for CFGs. Since the usage of one production rule in a particular
context might modify the context of applying another production rule, trees are not usually used as a means
of studying context-sensitive grammars. The orders of nonterminal substitution in a CFG derivation does
not matter, whereas for CSGs it does matter in some cases.
3.3.2
A Parse Tree Example
Example 3.7 Parse Tree for id + id ∗ id using EFT-Grammar
Consider the expression grammar:
E

E+T |T
T

T ∗F |F
F

(E) | id
The parse tree for the derivation of the sentence id + id ∗ id is shown in Figure 3.2. The derivation of
id + id ∗ id is:
E

E+T

E+T ∗F

E+F ∗F

E + id ∗ F

E + id ∗ id

T + id ∗ id ⇒ F + id ∗ id ⇒ id + id ∗ id
Try doing a rightmost derivation then a leftmost derivation of the same sentence: id + id ∗ id. Draw and
compare the parse trees for each of the derivations.
Copyright 2024
71
3.3 Parse Trees
Chomsky Grammars
E
E
+
T
T
T
*
F
F
id
id
F
id
Figure 3.2
3.3.3
Ambiguity
In the previous Example (3.7) you should have discovered that each derivation corresponded to the same
parse tree. Also, you should note that there is only one left- and rightmost derivation for that sentence.
Some grammars, however, may allow more than one parse tree for a sentence or more than one left- or
rightmost derivation. This is referred to as an ambiguous grammar. We define ambiguity formally as:
Definition 3.3.2 Ambiguous Sentences, Grammars, and Languages: An ambiguous sentence in L(G) is
a sentence that has two or more distinct leftmost derivations with respect to grammar G. An ambiguous
grammar is a grammar that generates at least one ambiguous sentence. A language is (inherently) ambiguous
if and only if there are no unambiguous grammars that generate the language.
NOTE: We could have used “rightmost derivations” or “parse trees” in place of “leftmost derivations.”
Each definition would be equivalent. Note that a sentence may still be considered ambiguous even if the two
corresponding unlabeled parse trees have the same structure, i.e. a different labeling of the interior nodes
indicates ambiguity.
Example 3.8 Structural Ambiguity
Consider the following grammar:
G =< {E}, {∗, +, id}, P, E > where P is:
E → E + E | E ∗ E | (E) | id
This grammar generates the same language as the grammar of example 3.7. One of the differences between
the two grammars is that this grammar is ambiguous and the one in example 3.7 is not. For example, there
are two leftmost derivations for the following sentence id + id ∗ id. Using grammar G in this example we
have at least two leftmost derivations:
1.
E
id + E ∗ E

E+E

id + E
⇒ id + id ∗ E
⇒ id + id ∗ id



2.
E
id + E ∗ E
Copyright 2024
E∗E
⇒ id + id ∗ E
72
E+E∗E
⇒ id + id ∗ id

Chomsky Grammars
3.4 MetaLanguages for Presenting Grammars
E
E
E
+
E
id
E
*
id
E
E
id
id
(a)
E
*
E
+
E
id
id
(b)
Figure 3.3: Structural ambiguity
The parse trees in Figures 3.3(a) and (b) correspond to derivations 1 and 2, respectively. This example
illustrates a sentence with two structurally distinct parse trees. Compare these trees to the one in Figure 3.2.
The parse tree in Figure 3.3(a) is the structure that corresponds to the semantics or evaluation that is used
in mathematics. When defining the semantics or meaning of programs, one must keep in mind the underlying intended semantics when designing a grammar for the syntax. Many translators build parse trees when
performing the syntax analysis and attach semantic attributes to the nodes for use by other components of
the compiler, such as the code generator.
Example 3.9 Labeling Ambiguity
Consider the following grammar:
G =< {S, A}, {a, b}, P, S >
where P is:
S
→ Sa | Aa | b
A
→ Aa | b
The sentence baa can be shown to be ambiguous by looking at the two parse trees for it shown in figure 3.4.
This illustrates labeling ambiguity.
3.4
MetaLanguages for Presenting Grammars
The prefix, “meta-”, in this context is used for symbols or syntactic structures that are used to describe,
express and define grammars, that, in turn, express the syntactic structure of languages. In this section,
several metanotations that allow designers, implementors, and programmers to communicate the syntactic
structure of programming languages are shown. The material covered in this section provides the background
needed for designing and implementing parsers for programming languages.
Copyright 2024
73
3.4 MetaLanguages for Presenting Grammars
S
A
A
S
a
a
b
Chomsky Grammars
S
S
a
a
b
Figure 3.4: Labeling ambiguity
3.4.1
BNF Notation
Backus-Naur Form (BNF) is a metasyntactic notion for expressing grammars. It was originally introduced
by two computer language and compiler designers, John Backus and Peter Naur. Emil Post’s notation for
his production system computational provided a starting point for John Backus to utilize an early version of
BNF for describing a language, called IAL, that would later evolve into Algol 58. Later Peter Naur utilized
it for describing Algol 60. Although John Backus is most known for his leadership on the design team for
the first designs of Fortran in the mid 1950’s, he and Peter Naur were on the design team for Algol 60. For
details and the history of the development of BNF, the interested reader should consult [1] and [9].
A variation of BNF has been utilized earlier in the text for presenting grammar-related and language
related concepts. Thus far, for the purposes definitions, theorems, and examples illustrating the definitions,
the meta-symbols,of → and | have been utilized. Uppercase and lowercase letters have been used to represent
nonterminals and terminals, respectively. For “real” programming languages, more meaningful names, such
as parameter-list as apposed to A, need to be used by language designers, implementors, when writing
and reading grammars. In order to distinguish between terminals and nonterminal symbols, the strings
representing nonterminals are simply surrounded with a pair of angle-brackets, . Several metanotations
are used for terminals. With the ability to distinguish between nonterminals and terminals, one need not
explicitly list the elements of the nonterminal and terminal sets, VN and VT , respectively. One can express
the four sets that comprise a grammar, < VN , VT , P, S >, by simply listing the production rule set in BNF
and putting the production rule(s) with the start symbol as the left-hand side symbol as the first one(s)
listed. A summary of the BNF metasymbols and their meanings is shown in Table 3.1.
When BNF was first introduced in the late 1950’s, there were not many fonts and typefaces available
on most printers used within general purpose computer systems; even if one could interchange the font and
typeface, the ability to mix fonts and typeface during the printing of one file was almost non-existent at
that time. Word processors with bold and underline functions were not widely available. So, the adoption
of quoting terminals or leaving strings without surrounding angle brackets as terminals were utilized also.
Usually one font was used using a single character set, such as ASCII. So, many of these notational conven-
Copyright 2024
74
Chomsky Grammars
3.4 MetaLanguages for Presenting Grammars
MetaSymbol
Meaning
::= or →
delimits left-hand side from right-hand side
|
“or”
Surrounds nonterminal names
Unaltered strings not between < >
Underlined string
bold-faced string
terminals
Single quotes ’string’
Double quotes ”string”
Table 3.1: BNF Metasymbols
tions were adopted because of this. For example, the ::=1 was adopted instead of →, since ASCII does not
include →, but consists of a subset of ASCII characters.
Note that when metasymbols are part of the language generated by a grammar expressed in BNF, single
or double quotes must be used for those symbols. If single or double quotes are part of the language, then
two consecutive single or double quotes, respectively, are often used when expressing a grammar in BNF for
that language. For example, two consecutive double quotes, ””, used presenting a grammar using EBNF
specifies a single double quote terminal symbol, ”. Note this is how C and C++ programmers can write a
single-quote character literal and express a string literal with a double quote embedded.
For some purposes it is also convenient to mix notations for terminals in a set of productions. For
example, one might put single quotes around single character terminals, underline reserved words and/or
leave operator symbols unquoted in the same EBNF-specified production rule set.
As one can see there are many metanotations for even simple BNF. One must keep in mind that notational
consistency provides the best readability and understandability of the grammar and, in turn, of the language
generated from that grammar. If you are designing a language and/or a grammar, for a language, it is always
best to supply a legend for your readers, the programmers and implementors.
1 A personal speculation:
because = and := were terminals in Algol-like languages at the time, ::= seemed to be the most
natural metasymbol.
Copyright 2024
75
3.4 MetaLanguages for Presenting Grammars
Chomsky Grammars
Example 3.10 Expression Grammar in BNF
In this example a grammar for representing the syntactic structure of simple expressions used in many
imperative languages is shown. What the type of the expressions are or the operator symbols actually represent
is not known; this only expresses the syntax.
| < Expr > − < Term >
| < Term >
< Term > ::= < Term > ∗ < Factor > | < Term > / < Factor >
| < Factor >
< Expr > ::= < Expr > + < Term >
| + < Primary >
< Factor > ::= < Primary >
|
− < Primary > | (< Expr >)
| number
< Primary > ::= identifier
From the BNF notation the set of terminals and nonterminals can be inferred:
3.4.2
VT
= {+, −, ∗, /, (, ), identifier, number}
VN
= {Expr, Term, Factor, Primary}
EBNF
BNF can be extended by introducing additional metasymbols so that production rules can be combined and
reduced. There are many different extensions and a common set of symbols along with their meanings is
summarized in Table 3.2. The variable α represents any legal right-hand side of a BNF or EBNF rule. One
BNF + Additional MetaSymbols = EBNF
MetaSymbols Added to BNF
Meaning (α is a EBNF expression)
[α]
α is optional
{α}
α can be repeated 0 or more times
{α}+
α can be repeated 1 or more times
(α)
Groups α into a EBNF subexpression
Table 3.2: EBNF’s Additional Metasymbols
can view the right-hand side of a production rule as an expression with these additional metasymbols as
operators. The precedence rules for these additional metasymbol operators are described with the following
metagrammar expressed in EBNF in Figure 3.5: From this grammar one can see that the metasymbol, |,
has lower precedence than concatenation, which is implicit by the juxtaposition of EBNF expressions. The
metasymbol pairs, (), [], and {}, are used to form subexpressions, with the parentheses having no operational
meaning; the parentheses, (), simply can alter the precedence rules.
Copyright 2024
76
Chomsky Grammars
3.4 MetaLanguages for Presenting Grammars
< EBN F >

< A > { ’|’ < A > }

< B > {< B >}

’[’< C >’]’

’(’< EBN F >’)’ | < terminal > | < nonterminal >
< terminal >

string | ””string”” | ’ ’string’ ’ | string
< nonterminal >

’’
|
’{’< C >’}’
|
Figure 3.5: Metagrammar for EBNF written in EBNF
Example 3.11 Expression Grammar in EBNF
In this example we utilize EBNF notation and rewrite Example 3.10.
< Expr > ::=
< Term > {(+ | −) < Term >}
< Term > ::=
< Factor > {(∗ | /) < Factor >}
< Factor > ::=
[+ | −] < Primary >| ’(’ < Expr > ’)’
< Primary > ::=
identifier | number
Note that in the parentheses are quoted since they are metasymbols in EBNF. As with BNF used in Example 3.10, the EBNF allows one to distinguish nonterminal symbols from terminals.
3.4.3
VT
= {+, −, ∗, /, (, ), identifier, number}
VN
= {Expr, Term, Factor, Primary}
Syntax Diagrams
Just as finite state diagrams give a two-dimensional representation of a finite state machine, context free
grammars can be depicted visually by utilizing a set of diagrams. Both types of diagrams are intended
to aid programmers with understanding the syntactic structure of a given language and to aid compiler
implementors with designing parsers, especially top-down parsers, of a given language. Although all contextfree grammars can be represented with syntax diagrams, traditionally syntax diagrams have been utilized for
designing a particular subset of top-down parsers, called recursive-descent parsers. Before discussing this, it
is shown how to represent a grammar G a set of syntax diagrams.
Each syntax diagram represents a group of related production rules, namely ones with the same left-hand
side nonterminal. Each group is called a syntactic category. Using the BNF (or EBNF) metasymbol, |, we
combine rules that are not already combined as follows. Production rules

α1

α2

Copyright 2024

77
αn
3.4 MetaLanguages for Presenting Grammars
Chomsky Grammars
are grouped into:
A → α1 | α2 | . . . | αn
This is done for each nonterminal. Each syntactic category, for example A, is represented by a syntax
diagram. The diagram is named by the name of the syntactic category it represents.
Toward a more formal characterization we note that each syntax diagram is a node-labeled, directed
graph. We will see that a finite state diagram is a node-labeled, edge-labeled, directed graph. Whereas
finite state machine has one corresponding diagram (graph), a grammar is represented by possibly many
syntax diagrams (graphs). In a finite state diagram the nodes are labeled with state names, represented as
circles; transitions are directed edges that are labeled by a symbol by one symbol from the terminal set. In
a syntax diagram there are three types of nodes, junction nodes, terminal nodes, and nonterminal nodes.
Conventionally junction nodes are implicit in the drawing, a terminal node is represented by an ellipse and
labeled with a terminal (token) symbol, and a nonterminal node is represented by a rectangle and labeled
with a nonterminal symbol. All edges in a syntax graph are unlabeled, directed edges.
When using a finite state diagram, one can trace a path through the state diagram by reading the input
string and following the labels of the transition edges. Parsing the input string requires following a directed
path through a syntax diagram and may require jumping between several syntax graphs, each of which
correspond to a group of production rules. Informally, when traversing the syntax graph, if one encounters
a terminal node, it requires matching the next input symbol. In such a case, this requires moving the input
pointer to the next input symbol and continuing down a path in the graph. upon encountering a nonterminal
rectangle, one jumps to the syntax graph named by that nonterminal (maybe even the same graph in the
case of a direct recursive rule); after traversing that graph and, possibly several others, one returns and
continues on the edge emanating from the rectangle. One can convert each syntactic category into a syntax
graph by putting each right-hand side as a branch at a junction node. Each right-side is a sequence of EBNF
elements: terminals, nonterminals, optional elements, iterative elements, and nested alternative elements.
The junction nodes represents forks in the road where decisions a have to be made or they represent the
merging of two or more paths.
Example 3.12 Syntax Graph of a Simple Graph
Consider the grammar G with start symbol E, written in EBNF:

(a|b) < A > {b}|c < B > d

c < A > d|ef [g]

{cd}
The corresponding syntax graphs is:
Copyright 2024
78
Chomsky Grammars
3.4 MetaLanguages for Presenting Grammars
S
a
A
b
b
c
B
d
c
A
d
A
e
f
g
B
d
c
To trace a path of edges in syntax graph is equivalent to parsing an input string. For example, the string
ccdcdd may