Traditional regular expression syntax
# # Example 1 # from Plex import *
lexicon = Lexicon([ (Str("Python"), "my_favourite_language"), (Str("Perl"), "the_other_language"), (Str("rocks"), "is_excellent"), (Str("sucks"), "is_differently_good"), (Rep1(Any(" \t\n")), IGNORE) ])
Associated with each pattern is an action to be performed when that pattern is recognised. The simplest form of action is just a value to be returned. There are also some special actions, such as IGNORE, which causes the text which matched the pattern to be ignored. So the last item above means that any sequence of one or more blanks, tabs or newlines is to be ignored.
To read tokens from the input stream, you call the read() method of the Scanner. Each time read() is called, it returns a tuple
(value, text)where value is the value returned by the token's action, and text is the text from the input stream that matched the token.
We can test out the Lexicon we defined earlier like this:
If we feed it the following input:# # Example 2 # filename = "my_file.txt" f = open(filename, "r") scanner = Scanner(lexicon, f, filename) while 1: token = scanner.read() print token if token[0] is None: break
we get the following sequence of tuples returned by successive read() calls:Python rocks
Note that a token with a value of None is returned when the scanner hits the end of the input.('my_favourite_language', 'Python') ('is_excellent', 'rocks') (None, '')
(filename, line_number, char_number)The line_number is 1-based and char_number is the 0-based position within the line. The filename is whatever string you supplied as the filename parameter, if any, when you created the Scanner. The filename parameter to the Scanner is optional, and is purely for your convenience -- its only use is to be returned by position().
This example introduces some more features:# # Example 3 # letter = Range("AZaz") digit = Range("09") name = letter + Rep(letter | digit) number = Rep1(digit) space = Any(" \t\n") comment = Str("{") + Rep(AnyBut("}")) + Str("}") resword = Str("if", "then", "else", "end") lex = Lexicon([ (name, 'ident'), (number, 'int'), (resword, TEXT), (Any("+-*/=<>"), TEXT), (space | comment, IGNORE) ])
If you wanted, you could also writeresword = NoCase(Str("if", "then", "else", "end"))
However, note that the scanned text returned when an identifier is matched will still be as it was in the input file, so you'll have to do case-conversion on it yourself.letter = NoCase(Range("AZ"))
There is also an inverse operation Case(). Whether a pattern is matched case-sensitively or case-insensitively is determined by the closest enclosing NoCase() or Case(). So, the pattern
will match the words "Mary" and "little" case-insensitively, and "had", "a" and "lambda" case-sensitively.NoCase(Str("Mary") + Case(Str("had a") + NoCase(Str("little")) + Str("lambda")))
Here is an example which uses action procedures to recognise Modula-style nested comments by keeping track of the nesting depth. It uses some of the pattern definitions from the last example.
When an action procedure is called, it is passed the scanner which has just recognised the token, and the text which was matched. If the procedure returns anything other than None, it is returned as the value of the token. If it returns None, scanning continues as if the IGNORE action hadbeen specified.# # Example 4 # def begin_comment(scanner, text): scanner.nesting_level = scanner.nesting_level + 1 def end_comment(scanner, text): scanner.nesting_level = scanner.nesting_level - 1 def maybe_a_name(scanner, text): if scanner.nesting_level == 0: return 'ident' lex = Lexicon([ (Str("(*"), begin_comment), (Str("*)"), end_comment), (name, maybe_a_name), (space, IGNORE) ]) scn = Scanner(lex,...) scn.nesting_level = 0
Here, the procedures begin_comment() and end_comment() maintain a count of the comment nesting level in an extra instance attribute attached to the Scanner. When something that might be an identifier is recognised, the maybe_a_name() procedure checks whether it's inside a comment. If so, 'ident' is returned as the value of the token; otherwise the default value of None is returned, and scanning continues.
At this point you're probably thinking that the last bit is rather clumsy, and you're right. If we wanted to recognise more tokens than just identifiers, we'd have to have maybe_an_operator(), maybe_a_number(), maybe_a_keyword(), etc. -- rather unwieldy! In the next example, we'll see a much better way.
Here is an example which uses scanner states to skip over comments. For simplicity, this one only handles non-nested comments; we'll extend it to nested comments in a later example.
The State() constructor introduces a new scanner state. It is used in place of a token in the token list, and takes two arguments: the nameof the state, and another list of tokens. (In case you're wondering, that second token list can only contain tokens, not States. States can't be nested.)# # Example 5 # lex = Lexicon([ (name, 'ident'), (number, 'int'), (space, IGNORE), (Str("(*"), Begin('comment')), State('comment', [ (Str("*)"), Begin('')), (AnyChar, IGNORE) ]) ])
The way it works is this. Initially, a newly-created Scanner is in the default state, called ''. In this state, only the first four tokens will match. When the beginning of a comment is recognised, the action Begin('comment') is invoked. This is another special action, whose effect is to change the current state of the scanner. Now the scanner is in the state called 'comment', and will only recognise the two tokens belonging to that state. Their effect is to ignore everything up to the next end-comment marker, whereupon the default state is re-entered and normal scanning continues. (AnyChar, as its name suggests, matches any single character.)
Note that the patterns which recognise the insides of the comment rely on the longest match feature of Plex: if more than one token matches at a given input position, and they match strings of different lengths, the longest one takes precedence.
The scanner-state technique makes recognising comments rather easy. Since we're not trying to handle nesting, we could have recognised the whole comment using a single pattern, but it would be a rather tricky and complicated one -- something like Str("(*") + Rep(AnyBut("*") | (Str("*") + AnyBut(")"))) + Str("*)"). By using a separate state for comments, we avoid any such nightmare.
There is another advantage as well. When Plex is scanning a token, it has to buffer up all the characters read until the whole token is recognised, in case you're interested in the matched text. In the case of a comment, you're not, but Plex has no way of knowing that, so if you try to recognise the whole comment as a single pattern, Plex happily buffers it all up before throwing it away. In contrast, the above example skips over the comment while never having to keep more than 2 characters. If someone feeds you a comment a few megabytes long one day, you might be thankful for that.
Here's an extended version which handles the full range of Pascal-style comments:
The comment2 state in this example relies on the priority rule: if two tokens match strings of the samelength, the one that occurs first in the token list takes precedence.# # Example 6 # lex = Lexicon([ (name, 'ident'), (number, 'int'), (space, IGNORE), (Str("(*"), Begin('comment1')), (Str("{"), Begin('comment2')), State('comment1', [ (Str("*)"), Begin('')), (AnyChar, IGNORE) ]), State('comment2', [ (Str("}"), Begin('')), (AnyChar, IGNORE) ]) ])
Now let's extend Example 5 to handle nested comments. While we're at it, we'll illustrate a technique for packaging up action procedures with a scanner in a neater way than we did in Example 4.
The trick here is that we construct the Lexicon inside the class scope of MyScanner, and plug in methods of MyScanner as action procedures. The Lexicon becomes a class attribute which is passed on to the Scanner constructor by the MyScanner constructor. The result is a completely self-contained subclass of Scanner that comes with its own Lexicon and knows how to initialise itself. To use it, all we need to do is# # Example 7 # class MyScanner(Scanner): def begin_comment(self, text): if self.nesting_level == 0: self.begin('comment') self.nesting_level = self.nesting_level + 1 def end_comment(self, text): self.nesting_level = self.nesting_level - 1 if self.nesting_level == 0: self.begin('') lexicon = Lexicon([ (name, 'ident'), (number, 'int'), (space, IGNORE), (Str("(*"), begin_comment), State('comment', [ (Str("(*"), begin_comment), (Str("*)"), end_comment), (AnyChar, IGNORE) ]) ]) def __init__(self, file, name): Scanner.__init__(self, self.lexicon, file, name) self.nesting_level = 0
and away we go.scanner = MyScanner(file, filename)
Note the use of the begin() method of the Scanner in the action procedures. This has the same effect as the Begin() special action.
These patterns work by matching imaginary characters that Plex inserts into the input stream. For example, given an input file consisting offrom = Bol + "From:" to = Bol + "To:" subject = Bol + "Subject:"
Plex sees it asH e l l o \n w o r l d
These imaginary characters are never matched by any of the ordinary patterns, but they are skipped over if necessary in order to get to something which does match. So if you're not interested in them, you can write your patterns as though they didn't exist and everything will work as you expect.<bol> H e l l o <eol> \n <bol> w o r l d <eol> <eof>
Something to keep in mind is that the Bol, Eol and Eof patterns will only match once at any given input position, so, for instance, a pattern such as Bol + Bol + Str("Hello") will never match. It would obviously be silly to write such a pattern directly, but it could arise accidentally when combining separately defined patterns if you're not careful.
A more subtle problem can occur if you have two tokens in your Lexicon such as
where the intention of the second token is to match the end of any line, even if the file does not end with a newline. The problem is that when the first token matches it absorbs the <eol> marker, so the second token fails to match. One way to avoid the problem in this case is to write the second pattern as Str("\n") | Eof.(Str("Hello") + Eol, 'hello'), (Eol + Opt(Str("\n")), 'eol')
We need to keep track of brackets, because indentation and newlines should be ignored inside brackets. We'll just treat all brackets the same and leave the parser to check that they match up properly.# # Example - Python scanner # class PythonScanner(Scanner):
We'll keep a stack of indentation levels as a list of integers. The top item on this stack will be at the end of the list.def open_bracket_action(self, text): self.bracket_nesting_level = self.bracket_nesting_level + 1 return text def close_bracket_action(self, text): self.bracket_nesting_level = self.bracket_nesting_level - 1 return text
We'll be using a separate scanner state to deal with indentation, because this will make it easier to keep track of the various conditions under which leading whitespace should be regarded as indentation. The trigger for entering this state will be encountering a newline, so we'll define an action for that.def current_level(self): return self.indentation_stack[-1]
Note that we only recognise a newline as terminating a statement if we're not inside brackets, otherwise we ignore it.def newline_action(self, text): if self.bracket_nesting_level == 0: self.begin('indent') return 'newline'
Here comes the action we'll call when we recognise some leading space that should be treated as indentation. For simplicity, we'll assume that the file is indented using all tabs or all spaces, so that the indentation level is just the length of the string. The version in the examples directory contains some code to check these assumptions; we omit it here for clarity. Note that we don't bother checking whether we're inside brackets, because the 'indent' state will never be entered in that case.
When we've finished, we change the scanner back to the default state.
Calling the yield() method of the Scanner object has the same effect as returning a value from the action procedure. You can also pass an optional second argument, which is substituted for the matched text in the returned tuple. We're not interested in the matched text here (it's just the indentation string) so we pass an empty string as the second argument.def indentation_action(self, text): current_level = self.current_level() new_level = len(text) if new_level > current_level: self.indent_to(new_level) elif new_level < current_level: self.dedent_to(new_level) self.begin('') def indent_to(self, new_level): self.indentation_stack.append(new_level) self.yield('INDENT', '') def dedent_to(self, new_level): while new_level < self.current_level(): self.indentation_stack.pop() self.yield('DEDENT', '')
The reason we're using yield() instead of simply returning a value is that it can be called more than once before returning from the action procedure, and the values get queued up and returned one at a time by subsequent calls to read(). This behaviour is crucial for when we dedent more than one level at a time, because we have to generate multiple "dedent" tokens from a single pattern match.
The version in the examples directory also contains another test here to make sure that the dedent matches some previous indent level.
One more thing. When we get to the end of the file, we need to put out enough "dedent" tokens to get back to ground zero. To do this, we'll override the eof() method of the Scanner. This method is called when the scanner reaches the end of the file, just before returning the end-of-file token.
Okay, that's the end of the action procedures. Now we need a bunch of patterns, most of which are straightforward but tedious -- see the examples directory if you're interested. We'll just look at the ones related to whitespace handling.def eof(self): self.dedent_to(0)
This pattern matches an indentation string. The fact that it has to occur at the beginning of a line will be taken care of by the scanner states.indentation = Rep(Str(" ")) | Rep(Str("\t"))
This pattern matches the end of a line. It's written this way instead of just Str("\n") because we want it to match even at the end of the last line of a file that doesn't end with a newline.lineterm = Str("\n") | Eof
We'll use this one to soak up a newline preceded by a backslash before lineterm gets hold of it.escaped_newline = Str("\\\n")
This pattern matches everything from a comment character up to, but not including, the end of the line.comment = Str("#") + Rep(AnyBut("\n"))
We'll use this pattern to skip over lines which are completely blank or contain only a comment, so that they won't generate spurious indentation or newline tokens.blank_line = indentation + Opt(comment) + lineterm
Now we're ready for the lexicon definition. Having defined all our patterns and action procedures, it's fairly straightforward.
Finally, we need to initialise the indentation stack and bracket counter, and set the initial state to 'indent' before starting.lexicon = Lexicon([ (name, 'name'), (number, 'number'), (stringlit, 'string'), (punctuation, TEXT), (opening_bracket, open_bracket_action), (closing_bracket, close_bracket_action), (lineterm, newline_action), (comment, IGNORE), (spaces, IGNORE), (escaped_newline, IGNORE), State('indent', [ (blank_line, IGNORE), (indentation, indentation_action), ]), ])
And we're done. That wasn't too hard now, was it?def __init__(self, file): Scanner.__init__(self, self.lexicon, file) self.indentation_stack = [0] self.bracket_nesting_level = 0 self.begin('indent')
See the Reference section for the precise details of the syntax.from Plex.Traditional import re ident = re("[A-Za-z_][A-Za-z0-9_]*")