Notes before getting started

To compile the example located on GitHub you'll need a few things. First, you'll need the Flex utility. Some systems come with it pre-installed with developer or build tools, but you'll need to make sure that it is a relatively recent version (e.g., 2.5.37). One issue that might arise for Apple users is an incompatible version of Flex. The default installed version is 2.5.35 (Apple provided), however the header file that is included won't work with this example. One solution is changing the example, however that will make it incompatible with any recent version of Flex. A better solution is simply to upgrade Flex to a newer version. For more information on this issue for Apple users, LMGTFY.

The Flex Scanner

How the Flex scanner itself works is beyond the scope of this tutorial. There are plenty of references on how to get started with a scanner, especially in C. There are lots of other examples out there but I found that most require quite a bit of work to get going and are far from complete. I based the example on some other Flex/Bison examples that create something similar to the Unix word count utility (wc). The language I used differs slightly from wc in the following respects: words are only alpha characters, numbers are simply counted as characters and I count upper and lower case characters.

The Flex Scanner Code

A brief explanation of the scanner file

On line 3 the C++ string header file is included since we use it within this code. The scanner header file (line 6) is included, which we'll talk about next, extends yyFlexLexer found in FlexLexer.h. This is the scanner class itself. Next you'll see a using statement included simply to make the namespacing a bit shorter to type when returning tokens. In previous versions of this tutorial I'd added a macro to create actual string objects from the char string yytext. Bison now defines a template function to cast/create the proper objects for you, so this version of the tutorial uses this more modern construct. If you're interested in the details, see the Bison documenation (direct link). In the past you would manually have to define your own union. Now that is largely done for you based on the token types which is much nicer. On line 7 you'll want to notice that the yylex function that is defined in the mc_scanner.hpp header file is declared as a #define with YY_DECL. This function signature must match that one (minus the virtual or other class header keywords). This is extremely important, otherwise you will hit this section within the generated code, and get the default yylex() function (discussed in detail later).

On line 14 you'll notice that I've defined terminate, I would like it to be a token type instead of NULL so we define it before the code below is read (from the generated file). This way yyterminate() is already defined by the time the compiler reaches this point in the generated code so our version is in the compiled code and not the default one.

Next you'll see that there are several options selected. Most of these are self explanatory or are explained within the Flex documentation, however the one's you don't want to miss are the nodefault, yyclass=, noyywrap and c++ options. The yyclass option indicates what the scanner class is actually called. This is the same class that is imported on line 3.

The Scanner Class Definition

Line 5 simply includes the FlexLexer.h class that defines yyFlexLexer. We then include the bison generated header file which includes the token definitions. The constructor is relatively straight forward, we simply call the yyFlexLexer constructor then we initialize the private yylval pointer to nullptr. The only function that we need to define in this simple example we have yylex which we defined as YY_DECL earlier. One thing to make note of is the fact that yylex() is now declared as virtual, and can throw an override error when defining your own function. To get around this, we've added a using statement to ensure the compiler knows that we meant to define a new yylex on line 24. Just to show why we undefined YY_DECL here's a code snippet from the generated lex.yy.cc:

By defining YY_DECL, we get rid of the native one. That pretty much concludes the scanner portion, on to the parser stuff.

The Bison C++ Parser File

Exactly how to write a language, and express it such that the parser can parse it is definitely beyond the scope of this tutorial, there are plenty of books on how to do this. Here is just a simple description of how to define the C++ parser. The first line declares that we want to use the lalr1 skeleton file (if you want to learn what types of skeletons are available, the documentation does a fairly good job going through them). We then include the required version of Bison (for this example version 3.0+, however you can download an example that works with version 2.5 through 2.7 here: link). There's a bit of new syntax which is the reason for the version differentiation. The debug option is set on line 3. The namespace that we want this parser to use is defined on line 5 along with the parser class on line 6 (Note: its usually a good idea to have a unique namespace and a parser name so when you have multiple parsers they can be kept very distinct). On line 8 the classes that are used within the parser are defined, think of this as a forward declaration. Lines 14 through 21 are included because of a bug (perhaps feature) that removes this definition when %locations isn't used. Lines 25 and 26 are important as they define what will be given to the parser when its instantiated and upon calls to yylex. There's quite a few more options that can be given to Bison, some of those for version 3.0 are listed here (Bison Docs). Here's a snippet from the generated parser header to show our parser constructor mc_parser.tab.hh:

Reference to both the passed in driver and scanner are kept as private members of the parser class. Within the code section we have all the rest of the information that our code will need to compile, include the requisite C++ headers, the driver class that we'll get to shortly. We also define a static yylex function that'll be called within the bison parse function that takes the parameters that were defined on lines 25-26. This function will call the scanner's yylex function, any other behavior need can be defined here. Next we have the simple language which differs from the way the actual wc functions in the manner mentioned in the intro. Last the error function and the static yylex function is defined.

The Driver Header File

One thing you'll notice is that there are two parse functions. One is to take input from a file, the other for input from a C++ stream. The main function defines paths to take input from a pipe and also via file. This is implemented via these parser functions. The cool thing about having a stream input is that you can also use this to parse command line input if you'd like.

The Driver Implementation

All thats left is to define a main class to instantiate everything and a Makefile to run the compilation from the command line (invoked by typing make in the source code directory). Our main class is perhaps a bit more complex than it has to be, however I wanted to include the stream example and the file example all in one, so there is extra code to parse the two flags I added to make this happen. It should be relatively self explanatory for anyone familiar with parsing the argv array.

The Main Function

The Makefile

Be aware that some compilers still use the c++0x flag for standard in lieu or the c++11 flag. If you are not using clang as your compiler you should change the appropriate lines within the Makefile so that make will know which compiler to call. The full code is available for download from my git-hub page here: https://github.com/jonathan-beard/simple_wc_example.git or Zip Download . If you find some errors, feel free to hop on GitHub and correct them!! I'll update the pieces here accordingly. Thanks to all who have contributed to making this as up to date as possible.