Table of Contents

NAME

yacc++ - Object-oriented parser generator

SYNOPSIS

/usr/local/cosc/greg/bin/yacc++ [-dvnk] filename

DESCRIPTION

Yacc++ is a parser generator that generates object-oriented parsers. The grammar specifications it accepts are an extended and modified form of those accepted by yacc. The output of yacc++ is a C++ source file which can be compiled with a C++ compiler to yield a parser for the specified grammar.

A parser generated by yacc++ builds and returns a parse tree constructed from C++ objects. Methods can be attached to the nodes of this parse tree to perform any desired computation. For instance, methods could be defined to perform the declaration analysis, type checking and code generation phases of a compiler.

The yacc++ input file also subsumes the definitions of the lexical tokens, so that there is no need for a separate lexical analyser definition file.

OPTIONS

-d
Turn on debugging in the generated yacc parser.

-v
Verbose. Inform the user of intermediate processing steps being carried out.

-n
Do not insert #line directives in the generated C++ code.

-k
Keep intermediate files. Without this option, all temporary files created during intermediate processing are deleted.

OUTPUT

If the input file name is filename.y++, two output files are generated: filename.H containing declarations of the parsing function and parse tree node classes, which can be included in any source file that wishes to use the parser; and filename.C containing the parser itself.

INPUT

The grammar specification accepted by yacc++ conforms to the following BNF description, in which [...] denotes optional items and ... denotes repetition one or more times.

<grammar> ::=
<header> [<pre-code>]
[<lexical-definition>...]
[<node-class-definition>]
[<class-definition>...]
[<rule>...]
[<post-code>]

<header> ::=
"parser" <parser-name> ";"

<pre-code> ::=
"{" <C++-code> "}"

<lexical-definition> ::=
<token-definition> | <whitespace-definition>

<token-definition> ::=
"token" [<token-type>] <token-name> "=" <pattern> ";"

<token-type> ::=
"char" "*" | "int" | "float" | "double"

<whitespace-definition> ::=
"ignore" ["nested"] <pattern> [".." <pattern>] ";"

<pattern> ::=
<quoted-string> | "("<regular-expression>")"

<node-class-definition> ::=
"node" <class-code> ";"

<class-definition> ::=
[<superclass-name>] "class" <class-name> <class-code> ";"

<rule> ::=
[<superclass-name>] "rule" <nonterminal> [<class-code>] ":" [<alternative>...] ";"

<alternative> ::=
[<item>...] [<class-code>]

<item> ::=
<nonterminal> | <terminal>

<terminal> ::= <pattern> | <token-name>

<class-code> ::=
"{" <C++-code> "}"

<post-code> ::=
"{" <C++-code> "}"

Parse Tree Classes
The nodes from which the parse tree is built are all subclasses of yynode, which yacc++ predefines as follows:

struct yynode {

char *yysource;
// Source file name
int *yyline;
// Line number };

The fields yysource and yyline hold the source file name and line number at the point where the node was created, so that they are readily available for the purpose of generating error messages later on.

For each nonterminal symbol in the grammar, yacc++ defines a subclass of yynode whose name is the same as the name of the nonterminal. Then for each alternative expansion of the nonterminal, a further subclass of the nonterminal class is defined. When the parser recognises an instance of the nonterminal during parsing, an instance of one of these subclasses is created, depending on which alternative was encountered. (The names of the subclasses are internally generated, and it is not necessary to know them.)

For each nonterminal appearing in one of the right hand sides of the rule, there is a field in the subclass for that RHS, pointing to the corresponding subtree. Methods defined in the yacc++ file can refer to these fields as $1, $2, $3, etc. (However, they are actually named yyval1, etc., and if you want to refer to them from code outside the yacc++ file, you will need to use these names.)

As an example, consider the following yacc++ rule:

rule TERM : "-" TERM
| FACTOR "+" TERM
| FACTOR ;

This rule results in the following class definitions:

struct TERM: yynode {
};

struct TERM_1: TERM {
TERM *yyval2;
};

struct TERM_2: TERM {
FACTOR *yyval1;
TERM *yyval3;
}; struct TERM_3: TERM {
FACTOR *yyval1;
};

(The class names TERM_1, etc. are for illustration purposes only, and are not necessarily the actual names generated by yacc++.)

This rule as it stands is almost useless, because there is no way of distinguishing between the different subclasses of TERM that a TERM * can point to. The next section describes how to make the parse tree nodes more useful by adding methods to their class definitions.

Method Definitions
Both the left and right hand sides of a rule can be augmented by C++ declarations enclosed in braces. Anything that can appear inside a C++ struct or class declaration can be placed between the braces, and they will be added to the appropriate class definition. For instance, we might augment the above rule as follows:

rule TERM {
virtual int Eval() = 0;
}

: "-" TERM {
int Eval() {return - $2->Eval();} }

| FACTOR "+" TERM {
int Eval() {return $1->Eval() + $3->Eval();} }

| FACTOR {
int Eval() {return $1->Eval();} }

;

Lexical Tokens
Terminal symbols, or lexical tokens, can appear anywhere in the right hand side of a rule. A lexical token is one of:

(a)
A string of characters enclosed in double quotes.

(b)
A regular expression enclosed in parentheses.

A given lexical token can be used in as many places as needed; there is no requirement to have a unique definition of each token. However, a token can be given a symbolic name using a token definition, e.g.

token OPERATOR = ([+-*/%]) ;

A token definition can also be used to assign a type to a token, e.g.

token int INTEGER = ([+-]?[0-9]+) ; token char *IDENT = ([A-Za-z][A-Za-z0-9]*) ;

The available token types are char *, int, float and double. The value of a char * token is the text of the token. The value of an int, float or double token is obtained by converting its text value using atoi() or atof() as appropriate.

When the name of a typed token appears in the right hand side of a rule, a field of the appropriate type is placed in the node and may be referred to using the $n notation. For example, we might extend our expression grammar with:

rule FACTOR {
virtual int Eval() = 0;
}

: INTEGER {
int Eval() {return $1;}
}

;

White Space
By default, the generated parser knows nothing about white space. Usually you will want some set of characters, such as blanks, tabs and newlines, to be treated as token separators and otherwise ignored. To specify this, you use the ignore construct:

ignore ([ \t\n]) ;

Multiple ignore statements can be given. To treat everything from "//" to the end of a line as a comment:

ignore ("//".*$) ;

A second form allows text between a pair of patterns to be ignored. To ignore C-style comments:

ignore "/*" .. "*/" ;

This form disallows nested occurrences of the pattern. A third form is used to specify that nesting is allowed:

ignore nested "(*" .. "*)" ;

You should always use the second or third form of ignore to absorb text which can span more than one line. Attempting to match a large amount of text using a single-pattern ignore can cause the lexical analyser's token buffer to overflow.

Custom Node Classes
Sometimes it is desirable for all the nodes in the parse tree to have certain properties in common. For instance, the nodes TERM and FACTOR above both had an Eval() method, and it was necessary to declare this method on the left hand side of both rules, since the two classes were independently derived from yynode.

An easier way is to use the node construct. This construct declares a class from which all nodes are to be derived:

node ExprNode {
virtual void Eval() = 0;
}

This declares a subclass of yynode called ExprNode. The classes of all left hand sides are now derived from ExprNode instead of directly from yynode. As a result, there is no need to redeclare Eval() on the left hand side of any rule, since the single declaration in ExprNode serves for all.

In some cases, a single superclass for all nodes is not sufficient. For instance, a programming language might have several rules for different kinds of STATEMENT, as well as rules for different kinds of EXPR. All the STATEMENT nodes need to be similar, as do all the EXPR nodes, but they are quite different from each other.

To provide a solution to this, the class construct can be used to declare as many classes as desired, and individual rules may be declared as being derived from a particular class. For example,

class StatNode {
/* Things that all Statements are to have */ };

class ExprNode {
/* Things that all Expressions are to have */ };

StatNode rule STATEMENT : IFSTAT | WHILESTAT | ... /* etc */ StatNode rule IFSTAT : ...
StatNode rule WHILESTAT : ...

ExprNode rule EXPR : TERM | EXPR "+" TERM | ... ExprNode rule TERM : FACTOR | TERM "*" FACTOR | ...

By default, a class declared with class is derived from the class declared with node, or if none, from yynode. A specific superclass can also be specified:

ExprNode class ConstExpr {
/* Things that constant expressions are to have */ }

Note: Every class that is used as the superclass of a rule must ultimately inherit from yynode. If you only use the names of classes defined using the node or class statements of yacc++ as rule superclasses, this requirement is automatically satisfied.

Pre-code and post-code
The pre-code and post-code sections of the input file provide places to put arbitrary pieces of C++ code. The precode is inserted at the beginning of the generated yacc file between %{ and %}, and is typically used for #include statements and declarations used in later pieces of C++ code. The post-code is appended to the end of the yacc file after %%. The definitions of utility functions used elsewhere in the input file can be placed here.

Using the Parser
The header at the beginning of the input file establishes the name of the generated parsing function. The first rule in the grammar defines the start symbol, and its class determines the return type of the parsing function. For instance, if the header is

parser ParseExpr;

and the first rule begins

ExprNode rule EXPR : ...

then the parsing function will have the following prototype:

ExprNode *ParseExpr(class istream &input, yyresult &result,
char *source);

If the start rule has no superclass specified, the parsing function return type will be a pointer to the class declared using the node statement, or if there is no node statement, it will be yynode *.

When the parsing function is called, it reads input from the stream input. If parsing is succesful, it builds a parse tree and returns a pointer to the root node. The source parameter is assumed to contain the filename corresponding to the input stream, and it becomes the default value assigned to the yysource field of each parse tree node.

If a parse error occurs, the parsing function returns 0 with information about the error in result structure:

struct yyresult {

char *message;
// String describing the error
char *source;
// Name of the source file
int line;
// Line number in the source file
char *token;
// Token near where error occured };

As an example, here is a short main program that invokes our expression parser and evaluates the expression:

#include <iostream.h>

#include "ParseExpr.H"
// The header file generated by yacc++

main() {
yyresult result;
ExprNode *tree = ParseExpr(cin, result, "standard input"); if (tree)
cout << tree->Eval();
else
cerr << result.source << " line " << result.line << " near " << result.token << ": " << result.message << "0;
}

NOTES

The word "rule" is a completely reserved word and cannot be used as an identifier in C++ code in the input file. This is to ensure that errors involving mismatched braces are detected in a timely fashion.

All of the possible expansions for a given nonterminal must be specified in the same rule statement. Unlike yacc, you cannot have two rules for the same nonterminal.

Yacc++ uses bison(1) and flex(1) to process the generated yacc and lex files. Your PATH needs to include these programs.

FILES

filename.y++
Grammar specification input file

filename.H
Parser interface output file

filename.C
Parser implementation output file

BUGS

Certain kinds of error result in error messages that are referenced to the generated yacc file rather than the original input file.

AUTHOR

G. C. Ewing

SEE ALSO

yacc(1), lex(1), bison(1), flex(1)


Table of Contents