`
`O'REILLY®
`
`John R. Levine,
`Tony Mason & Doug Brown
`
`Page 1 of 30
`
`Instacart, Ex. 1013
`
`
`
`lex& yacc
`by John R. Levine, Tony Mason and Doug Brown
`
`Copyright© 1990, 1992 O'Reilly & Associates, Inc. All rights reserved.
`Printed in the United States of America.
`
`Editor: Dale Dougherty
`
`Printing History:
`May 1990:
`
`First Edition.
`
`January 1991:
`
`Minor corrections.
`
`October 1992:
`
`Second Edition. Major revisions.
`
`February 1995:
`
`Minor corrections.
`
`Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are
`registered trademarks and The Java Series is a trademark of O'Reilly & Associates,
`Inc. Many of the designations used by manufacturers and sellers to distinguish their
`products are claimed as trademarks. Where those designations appear in this book,
`and O'Reilly and Associates, Inc. was aware of a trademark claim, the designations
`have been printed in caps or initial caps.
`
`While every precaution has been taken in the preparation of this book, the publisher
`assumes no responsibility for errors or omissions, or for damages resulting from the
`use of the information contained herein.
`
`ISBN: 1-56592-000-7
`[M)
`
`[l/00]
`
`Page 2 of 30
`
`
`
`lex&yacc
`
`Availabili"ty of Lex and Yacc
`Lex and yacc were both developed at Bell Laboratories in the 1970s. Yacc
`was the first of the two, developed by Stephen C. Johnson. Lex was
`designed by Mike Lesk and Eric Schmidt to work with yacc. Both lex and
`yacc have been standard UNIX utilities since 7th Edition UNIX. System V
`and older versions of BSD use the original AT&T versions, while the newest
`version of BSD uses flex (see below) and Berkeley yacc. The articles writ(cid:173)
`ten by the developers remain the primary source of information on lex and
`yacc.
`
`The GNU Project of the Free Software Foundation distributes bison, a yacc
`replacement; bison was written by Robert Corbett and Richard Stallman.
`The bison manual, written by Charles Donnelly and Richard Stallman, is
`excellent, especially for referencing specific features. Appendix D dis(cid:173)
`cusses bison.
`
`BSD and GNU Project also distribute flex (Past Le.xical Analyzer Generator),
`"a rewrite of lex intended to right some of that tool's deficiencies," accord(cid:173)
`ing to its reference page. Flex was originally written by Jef Poskanzer; Vern
`Paxson and Van Jacobson have considerably improved it and Vern currently
`maintains it. Appendix E covers topics specific to flex.
`
`There are at least two versions of lex and yacc available for MS-DOS and
`OS/2 machines. MKS (Mortice Kem Systems Inc.), publishers of the MKS
`Toolkit, offers lex and yacc as a separate product that supports many PC C
`compilers. MKS lex and yacc comes with a very good manual. Appendix F
`covers MKS lex and yacc. Abraxas Software publishes PCYACC, a version of
`lex and yacc which comes with sample parsers for a dozen widely used
`programming languages. Appendix G covers Abraxas' version lex and
`yacc.
`
`Sample Programs
`The programs in this book are available free from UUNET (that is, free
`except for UUNET's usual connect-time charges). If you have access to
`UUNET, you can retrieve the source code using UUCP orFrP. For UUCP, find
`a machine with direct access to UUNET, and type the following command:
`uucp uunet\ !~/nutshell/lexyacc/progs. tar .z yourbosf\r/youmamel
`The backslashes can be omitted if you use the Bourne shell (sh) instead of
`the C shell (csh). The file should appear some time later (up to a day or
`more) in the directory /usr/spooVuucppubliclyoumame. If you don't have
`
`xx
`
`Page 3 of 30
`
`
`
`Preface
`
`an account but would like one so that you can get electronic mail, then
`contact UUNET at 703-204-8000.
`
`To use ftp, find a machine with direct access to the Internet. Here is a
`sample session, with commands in boldface.
`% ftp ftp.oreilly.com
`Connected to ftp.oreilly.com.
`220 FTP server {Version 5.99 Wed May 23 14:40:19 EIJl' 1990) ready.
`Name {ftp.oreilly.com:youmame): anonymous
`331 Guest login ok, send ident as password.
`Password: ambar@ora.com ( use your user name and host here)
`230 Guest login ok, access restrictions apply.
`ftp> cd published/oreilly/nutshell/lexyacc
`250 CWD command successful.
`ftp> binary {you must specify binary transfer for compressed files)
`200 Type set to I.
`ftp> get progs.tar.z
`200 PORT command successful.
`150 Opening BINARY mode data connection for progs.tar.z.
`226 Transfer complete.
`ftp> quit
`221 Goodbye.
`%
`
`The file is a compressed tar archive. To extract files once you have
`retrieved the archive, type:
`% zcat progs.tar.z I tar xf -
`
`System V systems require the following tar command instead:
`% zcat progs.tar.z I tar xof -
`
`Conventions Used in This Handbook
`The following conventions are used in this book:
`
`Bold
`
`Italic
`
`is used for statements and functions, identifiers, and program
`names.
`
`is used for file, directory, and command names when they
`appear in the body of a paragraph as well as for data types
`and to emphasize new terms and concepts when they are
`introduced.
`
`xxi
`
`Page 4 of 30
`
`
`
`lex&yacc
`
`Constant
`Width
`
`is used in examples to show the contents of files or the out-
`put from commands.
`
`constant
`Bold
`
`is used in examples to show command lines and options that
`you type literally.
`
`Quotes
`
`are used to identify a code fragment in explanatory text. Sys(cid:173)
`tem messages, signs, and symbols are quoted as well.
`
`%
`
`is the Shell prompt.
`
`[]
`
`surround optional elements in a description of program syn(cid:173)
`tax. (Don't type the brackets themselves.)
`Acknowledgments
`This first edition of this book began with Tony Mason's MGL and SGL com(cid:173)
`pilers. Tony developed most of the material in this book, working with
`Dale Dougherty to make it a "Nutshell." Doug Brown contributed Chapter
`8, Yacc Ambiguities and Conflicts. Dale also edited and revised portions of
`the book. Tim O,Reilly made it a better book by withholding his editorial
`blessing until he found what he was looking for in the book. Thanks to
`Butch Anton, Ed Engler, and Mike Loukides for their comments on technical
`content. Thanks also to John W. Lockhart for reading a draft with an eye
`for stylistic issues. And thanks to Chris Reilley for his work on the graphics.
`Finally, Ruth Terry brought the book into print with her usual diligence and
`her sharp eye for every editorial detail. Though she was trying to work odd
`hours to also care for her family, it seemed she was caring for this book all
`hours of the day.
`
`For the second edition, Tony rewrote chapters 1 and 2, and Doug updated
`Chapter 8. John Levine wrote Chapters 3, 5, 6, 7, and most of the appen(cid:173)
`dices, and edited the rest of the text. Thanks to the technical reviewers, Bill
`Burke, Warren Carithers, Jon Mauney, Gary Merrill, Eugene Miya, Andy
`Oram, Bill Torcaso, and particularly Vern Paxson whose detailed
`page-by-page suggestions made the fine points much clearer. Margaret
`Levine Young's blue pencil (which was actually pink) tightened up the text
`and gave the book editorial consistency. She also compiled most of the
`index. Chris Reilly again did the graphics, and Donna Woonteiler did the
`final editing and shepherded the book through the production process.
`
`xxtt
`
`Page 5 of 30
`
`
`
`_ In __ this chapter:
`• Tbe: Simplest Lex
`Program
`. • -Recognizing Words
`with Lex
`• Grammars
`• Running Lex and
`Yacc
`• Lex vs. Hand-written
`Lexers _
`• Exercises
`
`Lex and Yacc
`
`Lex and yacc help you write programs that transform structured input. This
`includes an enormous range of applications-anything from a simple text
`search program that looks for patterns in its input file to a C compiler that
`transforms a source program into optimized object code.
`
`In programs with structured input, two tasks that occur over and over are
`dividing the input into meaningful units, and then discovering the relation(cid:173)
`ship among the units. For a text search program, the units would probably
`be lines of text, with a distinction between lines that contain a match of the
`target string and lines that don't. For a C program, the units are variable
`names, constants, strings, operators, punctuation, and so forth. This divi(cid:173)
`sion into units ( which are usually called tokens) is known as lexical analy(cid:173)
`sis, or lexing for short. Lex helps you by taking a set of descriptions of pos(cid:173)
`sible tokens and producing a C routine, which we call a lexical analyzer, or
`a lexer, or a scanner for short, that can identify those tokens. The set of
`descriptions you give to lex is called a lex specification.
`
`The token descriptions that lex uses are known as regular expressions,
`extended versions of the familiar patterns used by the grep and egrep com(cid:173)
`mands. Lex turns these regular expressions into a form that the lexer can
`use to scan the input text extremely fast, independent of the number of
`expressions that it is trying to match. A lex lexer is almost always faster
`than a lexer that you might write in C by hand.
`
`As the input is divided into tokens, a program often needs to establish the
`relationship among the tokens. A C compiler needs to find the expressions,
`statements, declarations, blocks, and procedures in the program. This task
`is known as parsing and the list of rules that define the relationships that
`
`1
`
`Page 6 of 30
`
`
`
`lex&yacc
`
`the program understands is a grammar. Yacc takes a concise description
`of a grammar and produces a C routine that can parse that grammar, a
`parser. The yacc parser automatically detects whenever a sequence of
`input tokens matches one of the rules in the grammar and also detects a
`syntax error whenever its input doesn't match any of the rules. A yacc
`parser is generally not as fast as a parser you could write by hand, but the
`ease in writing and modifying the parser is invariably worth any speed loss.
`The amount of time a program spends in a parser is rarely enough to be an
`issue anyway.
`
`When a task involves dividing the input into units and establishing some
`relationship among those units, you should think of lex and yacc. (A
`search program is so simple that it doesn't need to do any parsing so it uses
`lex but doesn't need yacc. We'll see this again in Chapter 2, where we
`build several applications using lex but not yacc.)
`
`By now, we hope we've whetted your appetite for more details. We do not
`intend for this chapter to be a complete tutorial on lex and yacc, but rather
`a gentle introduction to their use.
`
`The Simplest Lex Program
`This lex program copies its standard input to its standard output:
`
`%%
`.l\n
`%%
`
`EX::HO;
`
`It acts very much like the UNIX cat command run with no arguments.
`
`Lex automatically generates the actual C program code needed to handle
`reading the input file and sometimes, as in this case, writing the output as
`well.
`
`Whether you use lex and yacc to build parts of your program or to build
`tools to aid you in programming, once you master them they will prove
`their worth many times over by simplifying difficult input handling prob(cid:173)
`lems, providing more easily maintainable code base, and allowing for eas(cid:173)
`ier "tinkering" to get the right semantics for your program.
`
`2
`
`Page 7 of 30
`
`
`
`Lex and Yacc
`
`Recognizing Words with Lex
`Let's build a simple program that recognizes different types of English
`words. We start by identifying parts of speech (noun, verb, etc.) and will
`later extend it to handle multiword sentences that conform to a simple
`English grammar.
`
`We start by listing a set of verbs to recognize:
`
`is
`was
`do
`would
`has
`
`am
`be
`does
`should
`have
`
`are
`being
`did
`can
`had
`
`were
`been
`will
`could
`go
`
`Example 1-1 shows a simple lex specification to recognize these verbs.
`
`&le 1-1: Word recognizercb1-02.l
`
`%{
`I*
`* this sample dem::mstrates (vei:y) s:inple recognition:
`* a verb/not a verb.
`*/
`
`[\t ]+
`
`is I
`am I
`are I
`were I
`was I
`be I
`being I
`been I
`do I
`does I
`did I
`will I
`would I
`should I
`can I
`could I
`has I
`have I
`had I
`go
`
`/* ignore whitespace*/
`
`{ printf("%s: is a verb\n•, yytext); }
`
`3
`
`Page 8 of 30
`
`
`
`lex &yacc
`
`&le 1-1: Word recognizercbl-02.l (continued)
`
`[a-zA-Z]+
`
`{ printf ("%s: is not a verb\n", yytext);
`
`{ ECHO; /* normal default a.rt:JWaY */}
`
`.l\n
`%%
`
`main()
`{
`
`yylex();
`
`Here's what happens when we compile and run this program. What we
`type is in bold.
`% examplel
`did :r have fun?
`did: is a verb
`I: is not a verb
`have: is a verb
`fun: is not a verb
`?
`~D
`%
`
`To explain what's going on, let's start with the first section:
`
`%{
`I*
`* This sample demonstrates ver:y sinple recognition:
`* a verb/not a verb.
`*/
`
`This first section, the definition section, introduces any initial C program
`code we want copied into the final program. This is especially important if,
`for example, we have header files that must be included for code later in
`the file to work. We surround the C code with the special delimiters"%{"
`and "%}." Lex copies the material between "%{" and "%}" directly to the
`generated C file, so you may write any valid C code here.
`
`In this example, the only thing in the definition section is some C com(cid:173)
`ments. You might wonder whether we could have included the comments
`without the delimiters. Outside of "%{" and "%}", comments must be
`indented with whitespace for lex to recognize them correctly. We've seen
`some amazing bugs when people forgot to indent their comments and lex
`interpreted them as something else.
`
`4
`
`Page 9 of 30
`
`
`
`lex and Yacc
`
`The %% marks the end of this section.
`
`The next section is the rules section. Each rule is made up of two parts: a
`pattern and an action, separated by whitespace. The lexer that lex gener(cid:173)
`ates will execute the action when it recognizes the pattern. These patterns
`are UNIX-style regular expressions, a slightly extended version of the same
`expressions used by tools such as grep, sed, and ed. Chapter 6 describes all
`the rules for regular expressions. The first rule in our example is the fol(cid:173)
`lowing:
`
`[\t ]+
`
`/* ignore whitespace*/;
`
`The square brackets, "[ ]", indicate that any one of the characters within the
`brackets matches the pattern. For our example, we accept either "\ t" (a tab
`character) or" "(a space). The"+" means that the pattern matches one or
`more consecutive copies of the subpattern that precedes the plus. Thus,
`this pattern describes whitespace (any combination of tabs and spaces.)
`The second part of the rule, the action, is simply a semicolon, a do-nothing
`C statement. Its effect is to ignore the input.
`
`The next set of rules uses the " I " ( vertical bar) action. This is a special
`action that means to use the same action as the next pattern, so all of the
`verbs use the action specified for the last one.*
`
`Our first set of patterns is:
`
`is I
`am I
`are I
`were I
`was I
`be I
`being I
`been I
`do I
`does I
`did I
`should
`can I
`could I
`has I
`have I
`had I
`go
`
`{ printf(•%s: is a verb\n•, yytext); }
`
`*You can also use a vertical bar within a pattern, e.g., fool bar is a pattern that matches either
`the string "foo" or the string "bar." We leave some space between the pattern and the vertical
`bar to indicate that "bar" is the action, not part of the pattern.
`
`5
`
`Page 10 of 30
`
`
`
`lex&yacc
`
`Our patterns match any of the verbs in the list. Once we recognize a verb,
`we execute the action, a C printf statement. The array yytext contains the
`text that matched the pa~ern. This action will print the recognized verb fol(cid:173)
`lowed by the string": is a verb\n".
`
`The last two rules are:
`{ printf(•%s: is not a verb\n•, yytext);
`
`[a-zA-Z]+
`
`.l\n
`
`{ EX::HO; /* nonnal default a.rryway */}
`
`The pattern "[a-zA-Z]+" is a common one: it indicates any alphabetic string
`with at least one character. The"-" character has a special meaning when
`used inside square brackets: it denotes a range of characters beginning
`with the character to the left of the"-" and ending with the character to its
`right. Our action when we see one of these patterns is to print the matched
`token and the string": is not a verb\n".
`
`It doesn't take long to realize that any word that matches any of the verbs
`listed in the earlier rules will match this rule as well. You might then
`wonder why it won't execute both actions when it sees a verb in the list.
`And would both actions be executed when it sees the word "island," since
`"island" starts with "is"? The answer is that lex has a set of simple disambi(cid:173)
`guating rules. The two that make our lexer work are:
`1. Lex patterns only match a given input character or string once.
`2. Lex executes the action for the longest possible match for the current
`input. Thus, lex would see "island" as matching our all-inclusive rule
`because that was a longer match than "is."
`
`If you think about how a lex lexer matches patterns, you should be able to
`see how our example matches only the verbs listed.
`
`The last line is the default case. The special character"." (period) matches
`any single character other than a newline, and "\n" matches a newline
`character. The special action ECHO prints the matched pattern on the out(cid:173)
`put, copying any punctuation or other characters. We explicitly list this
`case although it is the default behavior. We have seen some complex
`lexers that worked incorrectly because of this very feature, producing occa(cid:173)
`sional strange output when the default pattern matched unanticipated input
`characters. (Even though there is a default action for unmatched input
`characters, well-written lexers invariably have explicit rules to match all
`possible input.)
`
`6
`
`Page 11 of 30
`
`
`
`Lex and Yacc
`
`The end of the rules section is delimited by another%%.
`
`The final section is the user subroutines section, which can consist of any
`legal C code. Lex copies it to the C file after the end of the lex generated
`code. We have included a main() program.
`%%
`
`main()
`{
`
`yylex();
`
`The lexer produced by lex is a C routine called yylex(), so we call it.*
`Unless the actions contain explicit return statements, yylex() won't return
`until it has processed the entire input.
`
`We placed our original example in a file called chl-02.l since it is our sec(cid:173)
`ond example. To create an executable program on our UNIX system we
`enter these commands:
`% lex chl-02.1
`% cc lex.yy.c -o first -11
`
`Lex translates the lex specification into a C source file called lex.yy.c which
`we compiled and linked with the lex library -II. We then execute the
`resulting program to check that it works as we expect, as we saw earlier in
`this section. Try it to convince yourself that this simple description really
`does recognize exactly the verbs we decided to recognize.
`
`Now that we've tackled our first example, let's "spruce it up." Our second
`example, Example 1-2, extends the lexer to recognize different parts of
`speech.
`
`Example 1-2: Lex example with multiple parts of speech chl-03.l
`
`%{
`I*
`* We expand upon the first example by adding recognition of some other
`* parts of speech.
`*/
`
`[\t ]+
`
`/* ignore whitespace*/;
`
`* Actually, we could have left out the main program because the lex library contains a default
`main routine just like this one.
`
`7
`
`Page 12 of 30
`
`
`
`lex &yacc
`
`Example 1-2: Lex example with multiple parts of speech cbl-03.l (continued)
`
`is I
`am I
`are I
`were I
`was I
`be I
`being I
`been I
`do I
`does I
`did I
`will I
`would I
`should I
`can I
`could I
`has I
`have I
`had I
`go
`
`very
`simply
`gently
`quietly I
`calmly I
`angrily
`
`to I
`fran I
`behind I
`above I
`below ·1
`between
`below
`
`if I
`then I
`and I
`b.lt I
`or
`
`their
`IITf I
`your I
`his I
`her I
`its
`
`I
`I
`you I
`he I
`
`8
`
`{ print£ (•%s: is a verb\n•, yytext); }
`
`{ print£ ( •%a: is an adverb\n•, yytext); }
`
`{ printf(•%s: is a preposition\n•, yytext); }
`
`{ printf(•%s: is a conjunction\n•, yytext); }
`
`{ print£ (•%s: is an adjective\n•, yytext); }
`
`Page 13 of 30
`
`
`
`Lex and Yacc
`
`Example 1-2: Lex example with multiple parts of speech chl-03.l (continued)
`
`she I
`we I
`they
`
`{ print£ ( •ts: is a pronoun\n•, yytext); }
`
`[a-zA-Z]+ {
`print£ ( "%s: don't recognize, might be a noun \n n, yytext) ;
`}
`
`.l\n
`
`{ ECHO; /* nonnal default air:/Wc1::/ *I}
`
`main()
`{
`
`yylex();
`
`Symbol Tables
`Our second example isn't really very different. We list more words than we
`did before, and in principle we could extend this example to as many
`words as we want. It would be more convenient, though, if we could build
`a table of words as the lexer is running, so we can add new words without
`modifying and recompiling the lex program. In our next example, we do
`just that-allow for the dynamic declaration of parts of speech as the lexer
`is running, reading the words to declare from the input file. Declaration
`lines start with the name of a part of speech followed by the words to
`declare. These lines, for example, declare four nouns and three verbs:
`noun dog cat horse cow
`verb chew eat lick
`
`The table of words is a simple symbol table, a common structure in lex and
`yacc applications. A C compiler, for example, stores the variable and struc(cid:173)
`ture names, labels, enumeration tags, and all other names used in the pro(cid:173)
`gram in its symbol table. Each name is stored along with information
`describing the name. In a C compiler the information is the type of symbol,
`declaration scope, variable type, etc. In our current example, the informa(cid:173)
`tion is the part of speech.
`
`Adding a symbol table changes the lexer quite substantially. Rather than
`putting separate patterns in the lexer for each word to match, we have a
`single pattern that matches any word and we consult the symbol table to
`decide which part of speech we've found. The names of parts of speech
`
`9
`
`Page 14 of 30
`
`
`
`lex&yacc
`
`(noun, verb, etc.) are now "reserved words" since they introduce a declara(cid:173)
`tion line. We still have a separate lex pattern for each reserved word. We
`also have to add symbol table maintenance routines,
`in this case
`add_ word(), which puts a new word into the symbol table, and
`lookup_ word(), which looks up a word which should already be entered.
`
`In the program's code, we declare a variable state that keeps track of
`whether we're looking up words, state LOOKUP, or declaring them, in
`which case state remembers what kind of words we're declaring. When(cid:173)
`ever we see a line starting with the name of a part of speech, we set the
`state to declare that kind of word; each time we see a \n we switch back to
`the normal lookup state.
`
`Example 1-3 shows the definition section.
`
`Example 1-3: Lexer with symbol table (part 1 of 3) ch 1-04.l
`
`%{
`I*
`* Word recognizer with a symbol table.
`*I
`
`enum {
`LCX)KUP = 0, /* default - looking rather than defining.*/
`VERB,
`ADJ,
`ADV,
`NOUN,
`PREP,
`PRON,
`CONJ
`
`};
`
`int state;
`
`int add,...word (int type, char *word) ;
`int lookup_word ( char *word) ;
`%}
`
`We define an enum in order to use in our table to record the types of indi(cid:173)
`vidual words, and to declare a variable state. We use this enumerated type
`both in the state variable to track what we're defining and in the symbol
`table to record what type each defined word is. We also declare our sym(cid:173)
`bol table routines.
`
`Example 1-4 shows the rules section.
`
`Page 15 of 30
`
`
`
`Lex and Yacc
`
`Example 1-4: Lexer with symbol table (part 2 of 3) chl-04.l
`
`%%
`\n
`
`{state= LOOKUP;}
`
`/* end of line, return to default state */
`
`/* whenever a line starts with a reserved part of speech name * /
`/* start defining words of that type * /
`"verb { state = VERB; }
`Aadj {state= ADJ; }
`Aadv { state = MJV; }
`Anoun { state = NOUN; }
`"prep {state= PREP; }
`"pron {state= PRON; }
`Aconj {state= CONJ; }
`
`[a-zA-Z]+
`
`/* a nonnal word, define it or look it up*/
`if(state != LOOKUP) {
`/* define the current word*/
`add,_word(state, yytext);
`else {
`switch(lookup_word(yytext))
`case VERB: printf ( 11 %s: verb\n", yytext); break;
`case ADJ: printf ( "%s: adjective\n", yytext); break;
`case MN: printf ( "%s: adverb\n •, yytext) ; break;
`case NOUN: printf ( • %s: noun \n •, yytext) ; break;
`case PREP: printf("%s: preposition\n•, yytext); break;
`case PRON: printf ("%s: pronoun\n", yytext); break;
`case CONJ: printf ("%s: conjunction\n", yytext); break;
`default:
`printf( 11 %s: don't recognize\n", yytext);
`break;
`
`}
`}
`
`/* ignore anything else*/
`
`%%
`
`For declaring words, the first group of rules sets the state to the type corre(cid:173)
`sponding to the part of speech being declared. (The caret,""", at the begin(cid:173)
`ning of the pattern makes the pattern match only at the beginning of an
`input line.) We reset the state to LOOKUP at the beginning of each line so
`that after we add new words interactively we can test our table of words to
`determine if it is working correctly. If the state is LOOKUP when the pattern
`"[a-zA-Z)+" matches, we look up the word, using lookup_ word(), and if
`found print out its type. If we're in any other state, we define the word
`with add_ word().
`
`11
`
`Page 16 of 30
`
`
`
`lex &yacc
`
`The user subroutines section in Example 1-5 contains the same skeletal
`main() routine and our two supporting functions.
`
`Example 1-5: Lexer with symbol table (part 3 o/3) chl-04.l
`
`main()
`{
`
`yylex();
`
`/* define a linked list of words and types*/
`struct word {
`char *word..]lame;
`int worcL.type;
`struct word *next;
`
`};
`
`struct word *worcL.list; /* first element in word list*/
`
`extern void *malloc ( ) ;
`
`int
`adcl..word(int type, char *word)
`{
`
`struct word *wp;
`
`!= LOOKUP) {
`if (lookup_word(word)
`printf (" ! ! ! waming: word %s already defined \n" , word) ;
`retw:n O;
`
`/* word not there, allocate a new entry and link it on the list */
`wp = (struct word*) malloc(sizeof(struct word));
`
`wp->next = worcL.list;
`
`/* have to copy the word itself as well*/
`wp->WOrd,Jlarne = (char *) malloc(strlen(word)+l);
`strcpy (wp->Wt>rd.J}ame, word) ;
`wp->WOrcL.type = type;
`worcL.list = wp;
`retw:n l;
`/* it worked * /
`
`int
`lookup_word ( char *word)
`{
`
`struct word *wp = worcl.list;
`
`/* search down the list looking for the word * /
`for(; wp; wp = wp->next) {
`
`12
`
`Page 17 of 30
`
`
`
`Lex and Yacc
`
`F.xamp/e 1-5: Lexer with symbol table (part 3 of 3) chl-04./ (continued)
`
`if (strcnp(wp->WOrd...,Jlame, word) == 0)
`return wp->WOrd.....type;
`
`return LCX>KUP;
`
`/* not found*/
`
`These last two functions create and search a linked list of words. If there
`are a lot of words, the functions will be slow since, for each word, they
`might have to search through the entire list. In a production environment
`we would use a faster but more complex scheme, probably using a hash
`table. Our simple example does the job, albeit slowly.
`
`Here is an example of a session we had with our last example:
`verb is am are was were be being been do
`is
`is: verb
`noun dog cat horse cow
`verb chew eat lick
`verb run stand sleep
`dog run
`dog: noun
`run: verb
`chew eat sleep cow horse
`chew: verb
`eat: verb
`sleep: verb
`cow: noun
`horse: noun
`verb talk
`talk
`talk: verb
`
`We strongly encourage you to play with this example until you are satisfied
`you understand it.
`
`Grammars
`For some applications, the simple kind of word recognition we've already
`done may be more than adequate; others need to recognize specific
`sequences of tokens and perform appropriate actions. Traditionally, a
`description of such a set of actions is known as a grammar. It seems espe-
`
`13
`
`Page 18 of 30
`
`
`
`lex &yacc
`
`dally appropriate for our example. Suppose that we wished to recognize
`common sentences. Here is a list of simple sentence types:
`
`noun verb.
`noun verb noun.
`
`At this point, it seems convenient to introduce some notation for describing
`grammars. We use the right facing arrow," ➔", to mean that a particular set
`of tokens can be replaced by a new symbol.* For instance:
`
`subject ➔ noun I pronoun
`
`would indicate that the new symbol subject is either a noun or a pronoun.
`We haven't changed the meaning of the underlying symbols; rather we
`have built our new symbol from the more fundamental symbols we've
`already defined. As an added example we could define an object as fol(cid:173)
`lows:
`
`object ➔ noun
`
`While not strictly correct as English grammar, we can now define a sen(cid:173)
`tence:
`
`sentence ➔ subject verb object
`
`Indeed, we could expand this definition of sentence to fit a much wider
`variety of sentences. However, at this stage we would like to build a yacc
`grammar so we can test our ideas out interactively. Before we introduce
`our yacc grammar, we must modify our lexical analyzer in order to return
`values useful to our new parser.
`
`Parser-Lexer Communication
`When you use a lex scanner and a yacc parser together, the parser is the
`higher level routine. It calls the lexer yylex() whenever it needs a token
`from the input. The lexer then scans through the input recognizing tokens.
`As soon as it finds a token of interest to the parser, it returns to the parser,
`returning the token's code as the value of yylex( ).
`
`Not all tokens are of interest to the parser-in most programming Ian- •
`guages the parser doesn't want to hear about comments and whitespace,
`
`*We say symbol rather than token here, because we reserve the name "token" for symbols re(cid:173)
`turned from the lexer, and the symbol to the left of the arrow did not come from the lexer. All
`tokens are symbols, but not all symbols are tokens.
`
`14
`
`Page 19 of 30
`
`
`
`Lex and Yacc
`
`for example. For these ignored tokens, the lexer doesn't return so that it
`can continue on to the next token without bothering the parser.
`
`The lexer and the parser have to agree what the token codes are. We solve
`this problem by letting yacc define the token codes. The tokens in our
`grammar are the parts of speech: NOUN, PRONOUN, VERB, ADVERB, ADJEC(cid:173)
`TIVE, PREPOSIDON, and CONJUNCTION. Yacc defines each of these as a
`small integer using a preprocessor #define. Here are the definitions it used
`in this example:
`# define NOUN 257
`# define PRONOUN 258
`# define VERB 259
`# define ADVERB 260
`# define ADJECTIVE 261
`# define PREPOSITIOO 262
`# define CONJUNCTIOO 263
`
`Token code zero is always returned for the logical end of the input. Yacc
`doesn't define a symbol for it, but you can yourself if you want.
`
`Yacc can optionally write a C header file containing all of the token defini(cid:173)
`tions. You include this file, called y.tab.b on UNIX systems and ytab.b or
`yytab.b on MS-DOS, in the lexer and use the preprocessor symbols in your
`lexer action code.
`
`The Parts of Speech Lexer
`Example 1-6 shows the declarations and rules sections of the new lexer.
`
`F.xample 1-6: Lexer to becalledfrom theparsercbl-05.I
`
`%{
`/*
`* We now build a lexical analyzer to be used by a higher-level parser.
`*/
`•
`
`#include •y.tab.h•
`
`/* token codes from the parser*/
`
`#define
`
`IOOKUP O
`
`/* default - not a defined word type. * /
`
`int state;
`
`%}
`
`\n
`
`{state= IOOKUP; }
`
`\.\n
`
`state= IOOKUP;
`
`15
`
`Page 20 of 30
`
`
`
`lex &yacc
`
`Example J-6: Lexer to be called from the parser chl-05.l (continued)
`
`return O; /* end of sentence*/
`
`"verb { state = VERB; }
`"adj { state = AnJECTIVE;
`"adv {state= ADVERB; }
`"nOllll {state= NOUN; }
`"prep { state = PREPOSITION;
`"pron {state= PRCNOUN; }
`"conj {state= CCNJUNCTION;
`
`[a-zA-Z)+
`
`if(state != LOOKUP) {
`add.,_word(state, yytext);
`} else {
`switch (lookup_word (yytext) )
`case VERB:
`return(VERB);
`case AnJECTIVE:
`return (AnJECTIVE) ;
`case ADVERB:
`retum(ADVERB);
`case NOUN:
`return (NOUN) ;
`case PREPOSITION:
`return(PREPOSITION);
`case PRONOUN:
`return(PRONOUN);
`case CONJUNCTION:
`return (CONJUNCTION) ;
`default:
`printf(•%s: don't recognize\n•, yytext);
`/* don't return, just ignore it*/
`
`}
`}
`
`%%
`... same add_word() and lookup_word() as before ...
`
`There are several important differences here. We've changed the part of
`speech names used in the lexer to agree with the token names in the
`parser. We have also added return statements to pass to the parser the
`token codes for the words that it recognizes. There aren't any return state(cid:173)
`ments for the tokens that define new words to the lexer, since the parser
`doesn't care about them.
`
`16
`
`Page 21 of 30
`
`
`
`Lex and Yacc
`
`These return statements show that yylex() acts like a coroutine. Each time
`the parser calls it, it takes up processing at the exact point it left off. This
`allows us to examine and operate upon the input stream incrementally.
`Our first programs didn't need to take advantage of this, but it becomes
`more useful as we use th