Input
<b>Hello 1
Can use string.find()
, look for space.
"hello"[1:3] = "el"
"hello"[1:] = "ello"
"Jane Eyre".split() = ["Jane", "Eyre"]
import re
re.findall(r"[0-9]", "1+2==3") = ["1", "2", "3"]
r"[a-c][1-2]"
"a1", "a2", "b1", …
impore re
r = r"[a-z]+|[0-9]+"
re.findall(r, "Goethe 1749") = ["oethe", "1749"]
[0-2] = “0|1|2”
?
import re
r = r"-?[0-9]+"
re.findall(r, "1861-1941 R. Tagore") = ["1861", "-1941"]
Star, *
, zero or more copies.
a+ === aa*
Escape using \
.
r = r"[a-z]+-?[a-z]+"
.
: any character except new-line.[^ab]
: any characters that aren’t a or b.(?:xyz)+
Above is grouped, so matches:
xyz
xyzxyz
…
e.g. want to match any number of any permutation of do re mi
.
r = r"do+|re+|mi+"
r = r"(?:do|re|mi)+"
regexp = r'"(?:(?:\\.)*|[^\\])*"'
edges[(1, 'a')] = 2
accepting = [3]
edges
.edges = {(1, 'a') : 2,
(2, 'a') : 2,
(2, '1') : 3,
(3, '1') : 3}
accepting = [3]
def fsmsim(string, current, edges, accepting):
if len(string) == 0:
return current in accepting
letter = string[0]
next_state = edges.get((current, letter), None)
if next_state is None:
return False
return fsmsim(string[1:], next_state, edges, accepting)
edges
and accepting
for q*
?1
.edges = {(1, 'q'): 1}
accepting = [1]
edges
and accepting
for RE:r"[a-b][c-d]?"
edges = {(1, 'a'): 2,
(1, 'b'): 2,
(2, 'c'): 3,
(2, 'd'): 3}
accepting = [2, 3]
regexp = r'[0-9]+(?:[0-9]|-[0-9])*'
fsmsim
function can handle DFAs.re.findall()
, there are equivalent problems that do not use re.findall()
.
fsmsim()
. This gives you re.match()
, only one search.re.findall()
:s1 = "12+34"
fsmsim() for '[0-9]+'.
call fsmsim("1"), it matches.
call fsmsim("2"), it matches.
call fsmsim("12+"), it doesn't match. Hence one 'token' is '12', and advance input to '3'.
call fsmsim("3"), it matches.
call fsmsim("4"), it matches.
end of string.
result is ["12", "34"].
Given this fragment
Wollstonecraft</a>
Want the following output
word Wollstonecraft
start of closing tag </
word a
end of closing tag >
word wrote
e.g.
LANGLE <
LANGLESLASH </
RANGLE >
EQUAL =
STRING "google.com"
WORD Welcome!
Names of tokens are arbitrary, but would like them to be uppercase.
def t_RANGLE(token):
r'>' # I am a regexp!
return token # return text unchanged, but can transform it.
def t_LANGLESLASH(token):
r'/>'
return token
def t_NUMBER(token):
r'[0-9]+'
token.value = int(token.value)
return token
def t_STRING(token):
r'"[^"]*"'
return token
def t_WHITESPACE(token):
r' '
pass
And if we define a word as any number of characters except <, >, or space, leaving the value unchanges:
def t_WORD(token):
r'[^<> ]+'
return token
def t_STRING(token):
r'"[^"]*"'
token.value = token.value[1:-1]
return token
Making a lexer
import ply.lex as lex
tokens = (
'LANGLE', # <
'LANGLESLASH', # </
'RANGLE', # >
'EQUAL', # =
'STRING', # ".."
'WORD' # dada
)
t_ignore = ' ' # shortcut for whitespace
# note this is before t_LANGLE, want it to win
def t_LANGLESLASH(token):
r'</'
return token
def t_LANGLE(token):
r'<'
return token
def t_RANGLE(token):
r'>'
return token
def t_EQUAL(token):
r'='
return token
def t_STRING(token):
r'"[^"]*"'
token.value = token.value[1:-1]
return token
def t_WORD(token):
r'[^ <>]+'
return token
webpage = "This is <b>my</b> webpage!"
htmllexer = lex.lex()
htmllexer.input(webpage)
while True:
tok = htmllexer.token()
if not tok: break
print tok
LexToken(TYPE, line, character)
. Indicate line and character on that line.def t_newline(token):
r'\n'
token.lexer.lineno += 1
pass
\n
from t_WORD
, just like we’re currently ignoring spaces.<!--
, end with -->
How to add to lexer.
states = (
('htmlcomment', 'exclusive'),
)
If we are in the state htmlcomment
we cannot be doing anything else at the same time, like looking for strings or words.
def t_htmlcomment(token):
r'<!--'
token.lexer.begin('htmlcomment')
def t_htmlcomment_end(token):
r'-->'
token.lexer.lineno += token.value.count('\n')
token.lexer.begin('INITIAL')
def t_htmlcomment_error(token):
token.lexer.skip(1)
INITIAL
just means whatever you were going before coming into this state, i.e. htmlcomment
.pass
we’re actually gathering up all the characters that resulted in the error, so that we can subsequently count for newlines.def t_identifier(token):
r'[A-Za-z][A-Za-z0-9_]+'
return token
def t_NUMBER(token):
r'-?[0-9]+(?:\.[0-9]*)?'
token.value = float(token.value)
return token
Comments to the end of the line in JavaScript.
def t_eolcomment(token):
r'//[^\n]*'
pass
ply
library
ply
can use reflection to get them.e.g.
sentence
-> subject verb
-> students verb
-> students think
Can perform multiple derivations, e.g.
sentence
-> subject verb
-> subject write
-> teachers write
Add just one rule, gives phenomenal power!
Sentence -> Subject Verb
Subject -> students
Subject -> teachers
Subject -> Subject and Subject
Verb -> think
Verb -> write
Formally, the number of strings is countably infinite.
Arithmetic grammar example:
Exp -> Exp + Exp
Exp -> Exp - Exp
Exp -> number
e.g. number number
is not valid, number + number - number
is valid.
Valid in grammar == is in the language of the grammar.
Word rules + sentence rules = creativity!
Exp -> Exp + Exp
Exp -> Exp - Exp
Exp -> Number
and
def t_NUMBER(token):
r'[0-9]+'
token.value = int(token.value)
return token
Can now check for valid sequence of tokens.
1 + 2, good
7 + 2 - 2, good
- - 2, bad.
Stmt -> identifier = Exp
Exp -> Exp + Exp
Exp -> Exp - Exp
Exp -> number
lata = 1, good
lata = lata + 1, bad
Can specify two rewrite rules for the same non-terminal, where one of them goes to epsilon, i.e. the empty string.
Sentence -> OptionalAdjective Subject Verb
Subject -> william
Subject -> tell
OptionalAdjective -> accurate
OptionalAdjective -> \epsilon
Verb -> shoots
Verb -> bows
8 possible utterances!
Grammars can encode regular languages.
number - r'[0-9]+'
Number -> Digit MoreDigits
MoreDigits -> Digit MoreDigits
MoreDigits -> \epsilon
Digit -> 0
Digit -> 1
…
Digit -> 9
Number
-> Digit MoreDigits
-> Digit Digit MoreDigits
-> Digit Digit \epsilon
-> Digit 2
-> 42
Grammar >= Regexp
regexp = r'p+i?' # e.g. p, pp, pi, ppi
Regexp -> Pplus Iopt
Pplus -> p Pplus
Pplus -> p
Iopt -> i
Iopt -> \epsilon
Context-free grammars describe context-free languages.
A -> B
xyzAxyz -> xyzBxyz
Here are three different regular expression forms, and equivalent context-free grammars.
r'ab' => G -> ab
r'a*' => G -> \epsilon
G -> aG
r'a|b' => G -> a
=> G -> b
But regular languages != context-free languages.
Consider:
P -> ( P )
P -> \epsilon
Let’s try:
r'\(*\)*'
But it doesn’t match parentheses :(.
We want:
(^N )^N
But all we can write is:
(* )*
e.g. “I saw Jane Austen using binoculars”.
1 - 2 + 3
could be 2 or -4!
A grammar is ambiguous if there is at least one string in the grammar that has more than one different parse tree.
Parentheses can come to the ( Rescue ).
exp - exp + exp
exp -> exp - exp
exp -> number
exp -> ( exp )
<b>Welcome to <i>my</i> webpage!</b>
Html -> Element Html
Html -> \epsilon
Element -> word
Element -> TagOpen Html TagClose
TagOpen -> < word >
TagClose -> </ word >
def absval(x):
if x < 0:
return 0 - x
else:
return x
function absval(x) {
if x < 0 {
return 0 - x;
} else {
return x;
}
}
JavaScript uses braces to signify lexical scope. Python uses indentation.
In Python
print "hello" + "!"
In JavaScript:
document.write("hello" + "!")
or
write("hello" + "!")
All JavaScript function calls require brackets.
Partial grammar for JavaScript:
Exp -> identifier
Exp -> TRUE
Exp -> FALSE
Exp -> number
Exp -> string
Exp -> Exp + Exp
Exp -> Exp - Exp
Exp -> Exp * Exp
Exp -> Exp / Exp
Exp -> Exp < Exp
Exp -> Exp == Exp
Exp -> Exp && Exp
Exp -> Exp || Exp
Statements ~= Sentences
-> j = 3;
Statement -> identifier = Exp
Statement -> return Exp
Statement -> if Exp CompoundStatement
Statement -> if Exp CompoundStatement else CompoundStatement
CompoundStatement -> { Statements }
Statements -> Statement; Statements
Statements -> \epsilon
CompoundStatement
could also be Statement
without braces?
CompoundStatement
or Statement
in Statement
?JavaScript program the same!
Js -> Element Js
Js -> \epsilon
Element -> function identifier ( OptParams ) CompoundStatement // function definition
Element -> Statement;
OptParams -> Params
OptParams -> \epsilon
Params -> identifier, Params
Params -> identifier
This is a cute property of Context-Free Grammars.
Exp -> … // as before
Exp -> identifier( OptArgs ) // function call
OptArgs -> Args
OptArgs -> \epsilon
Args -> Exp, Args
Args -> Exp
sin(x)
, but then call it as sin(50, 60)
.(1 + (2 + 3))
1 + + + ) 3
.Lambda (make me a function, anonymous function)
def addtwo(x): return x+2
addtwo(2) # = 4
mystery = lambda(x): x+2
mystery(3) # = 5
pele = mystery
pele(4) # = 6
def mysquare(x): return x*x
map(mysquare, [1,2,3,4,5]) # = [1,4,9,16,25]
map(lambda(x): x*x, [1,2,3,4,5]) # same!
[x*x for x in [1,2,3,4,5] # same!
def odds_only(numbers):
for n in numbers:
if n % 2 == 1:
yield n
yield
: not return! A generator.[x for x in [1,2,3,4,5] if x % 2 == 1]
Python program to check a string is in a grammar.
Exp -> Exp + Exp
Exp -> Exp - Exp
Exp -> ( Exp )
Exp -> num
grammar = [
("Exp", ["Exp", "+", "Exp"]),
("Exp", ["Exp", "-", "Exp"]),
("Exp", ["(", "Exp", ")"],
("Exp", ["num"]),
]
Given e.g. print exp;
utterance = ["print", "exp", ";"]
into:
["print", "exp", "-", "exp", ";"]
pos = 1
result = utterance[0:pos] + rule[1] + utterance[pos+1:]
e.g.
start with "a exp"
with depth 1, get:
"a exp + exp"
"a exp - exp"
"a (exp)"
"a num"
Let’s code it up:
grammar = … (as above)
def expand(tokens, grammar):
for i, token in enumerate(tokens):
for (rule_lhs, rule_rhs) in grammar:
if token == rule_lhs:
result = tokens[0:i] + rule_rhs + tokens[i+1:]
yield result
depth = 2
utterances = [["exp"]]
for x in xrange(depth):
for sentence in utterances:
utterances = utterances + [ i for i in expand(sentence, grammar)]
for sentence in utterances:
print sentence
For countably infinite grammars this is pretty useless!
S -> (S)
S -> \epsilon
Is '(()' in grammar?
def memofibo(n, chart = None):
if chart is None:
chart = {}
if n <= 2:
chart[n] = 1
if n not in chart:
chart[n] = memofibo(n-1, chart) + memofibo(n-2, chart)
return chart[n]
And will need more than one finger!
S -> E
E -> E + E
E -> E - E
E -> 1
E -> 2
input = 1 + 2
1 +
, where am I?Example of a parsing state.
If the red dot ends up on the right of the start symbol’s rule, you’ve parsed the string! i.e.
S -> E <dot>
A parsing state is a rewrite rule from the grammar augmented with one red dot on the right-hand side of the rule.
Input: 1 +
State: E -> 1 + <dot> E
E -> 1 + E
is not a rule from the grammar.parse([t_1, T_2, …, t_n, …, t_last])
chart[N]
= all parse states we could be in after seeing t_1, t_2, …, t_n
only!e.g.
E -> E + E
E -> int
Input = int + int
chart[0] =
[E -> <dot> E + E,
E -> <dot> int]
chart[1] =
[E -> int <dot>,
E -> E <dot> + E]
chart[2] =
[E -> E + <dot> E]
We’ll need to keep track of one extra piece of information: how many tokens we’ve seen so far.
E -> E + E
E -> int
Input = int + int
chart[0] =
[E -> <dot> E + E,
E -> <dot> int - seen 0]
chart[1] =
[E -> int <dot>,
E -> E <dot> + E]
chart[2] =
[E -> <dot> int - seen 2,
E -> E + <dot> E]
Because we want to parse. Parsing is the inverse of producing strings.
int + int
E + int # apply E -> int
E + E # apply E -> int
E # apply E -> E + E
Generating is going up.
If you build the chart, you have solved parsing!
S -> E
E -> …
S -> E <dot> - starting at 0 => we've parsed it.
# We want to be in this state!
If inputs is T tokens long:
S -> E <dot> start at 0 in chart[T]
If we can build the chart, and the above is true, then the string is in the language of the CFG.
Start:
chart[0], S -> <dot> E from 0
End:
chart[T], S -> E <dot> from 0.
Suppose:
S -> E + <dot> E, from j, in chart[i] (seen i tokens)
Need to find all rules that go to E and “bring them in”.
Let’s say:
chart[i] has X -> ab <dot> cd, from j.
For all grammar rules:
c -> pqr
We add:
c -> <dot> pqr, from i
To chart[i]
.
Suppose:
E -> E - E
E -> (F)
E -> int
F -> string
Input: int - int
Seen 2 tokens so far
chart[2] has E -> E - <dot> E, from 0
Then the result of computing the closure:
E -> <dot> int from 2
E -> <dot> (F) from 2
E -> <dot> E - E from 2
The following are not in the result:
E -> <dot> E - E from 0 # wrong from
F -> <dot> string from 2 # wrong LHS
Shifting, aka consuming the input, is another method.
Recall parsing state:
X -> ab <dot> cd, from j in chart [i]
If c is terminal, => shift (i.e. consume the terminal).
X -> abc <dot> d from j into chart [i+1]
i
tokens, and the i+1
th token is c
, a terminal.We are not updating from
, because that’s where we’ve come from.
x -> ab <dot> cd
c
is non-terminal, => closurec
is terminal, => shiftcd
is \epsilon
, i.e. nothing. => reduce.
Reduction: apply rewrite rules / productions in reverse.
E -> E + E
E -> int
<dot> int + int + int
int <dot> + int + int
# magical reduction!
E <dot> + int + int
E + <dot> int + int
E + int <dot> + int
# magical reduction!
E + E <dot> + int
# magical reduction!
E <dot> + int
E + <dot> int
...
But how to apply reductions?
E -> E + E <dot> from B in chart [A]
We’ve seen inputs up to B and are about to encounter E + E
:
input_1 input_2 … input_B | E + E
It’s as if we saw the LHS at this point:
input_1 input_2 … input_B | E
Where did we come from? Suppose chart[B] has:
E -> E - <dot> E from C
So add:
E -> E - E <dot> from C to chart[A]
Example!
T -> aBc
B -> bb
input: abbc
N = 0
chart[0]
T -> <dot>aBc, from 0
N = 1, a
chart[1]
# shift
T -> a <dot>Bc, from 0
# and we see a non-terminal, so bring in closure
B -> <dot>bb, from 1
N = 2, ab
# shift
B -> b<dot>b, from 1
N = 3, abb
# shift
B -> bb<dot>, from 1
# - red dot at end of rule, so reduce.
# - came from state 1.
# - Does anyone in state 1 want to see B?
# - Yes! T -> a<dot>Bc is looking for one.
# - So transplant that rule here
T -> aB<dot>c, from 0
Adding state to chart:
def addtochart(chart, index, state):
if not state in chart[index]:
chart[index] = [state] + chart[index]
return True
else:
return False
Grammar:
S -> P
P -> (P)
P ->
In Python:
grammar = [
("S", ["P"]),
("P", ["(", "P", ")"]),
("P", []),
]
Parser state:
X -> ab<dot>cd from j
In Python:
state = ("x", ["a", "b"], ["c", "d"], j)
chart[0]
.chart[n]
for n tokens to see if we’re in the final state.chart[i]
, we see x -> ab<dot>cd from j
.We’ll call:
next_states = closure(grammar, i, x, ab, cd, j)
for next_state in next_states:
any_changes = addtochart(chart, i, next_state)
or any_changes
What is closure()
?
def closure(grammar, i, x, ab, cd, j):
next_states = [
(rule[0], [], rule[1], i)
for rule in grammar
if len(cd) > 0 and
rule[0] == cd[0]
]
return next_states
chart[i]
and we see X -> ab<dot>cd from j
.tokens
.next_state = shift(tokens, i, x, ab, cd, j)
if next_state is not None:
any_changes = addtochart(chart, i+1, next_state)
or any_changes
shift()
?def shift(tokens, i,x, ab, cd, j):
if len(cd) > 0 and tokens[i] == cd[0]:
return (x, ab + [cd[0]], cd[1:], j)
else:
return None
chart[i]
, we see X -> ab<dot>cd from j
.next_states = reductions(chart, i, x, ab, cd, j)
for next_state in next_states:
any_changes = addtochart(chart i, next_state)
or any_changes
def reductions(chart, i, x, ab, cd, j):
# x -> ab<dot> from j
# chart[j] has y -> ... <dot>x ... from k
return [
(jstate[0],
jstate[1] + [x],
jstate[2][1:],
jstate[3])
for jstate in chart[j]
if len(cd) > 0 and
len(jstate[2]) > 0 and
jstate[2][0] == x
]
# see notes/src/programming_languages/ps4_parser.py
# above has closure, shift, and reductions in-lined.
def parse(tokens, grammar):
tokens = tokens + ["end_of_input_marker"]
chart = {}
start_rule = grammar[0]
for i in xrange(len(tokens) + 1):
chart[i] = []
start_state = (start_rule[0], [], start_rule[1], 0)
chart[0] = [start_state]
for i in xrange(len(tokens)):
while True:
changes = False
for state in chart[i]:
# State === x -> ab<dot>cd, j
(x, ab, cd, j) = state
# Current state == x -> ab<dot>cd, j
# Option 1: For each grammar rule
# c -> pqr (where the c's match)
# make a next state:
#
# c -> <dot>pqr, i
#
# English: We're about to start
# parsing a "c", but "c" may be
# something like "exp" with its
# own production rules. We'll bring
# those production rules in.
next_states = closure(grammar, i, x, ab, cd, j)
for next_state in next_states:
changes = addtochart(chart, i, next_state) or changes
# Current State == x -> ab<dot>cd, j
# Option 2: If tokens[i] == c,
# make a next state:
#
# x -> abc<dot>d, j
#
# £nglish: We're looking for a parse
# token c next and the current token
# is exactly c! Aren't we lucky!
# So we can parse over it and move
# to j+1.
next_state = shift(tokens, i, x, ab, cd, j)
if next_state is not None:
any_changes = addtochart(chart, i+1, next_state) or any_changes
# Current state == x -> ab<dot>cd, j
# Option 3: if cd is [], the state is
# just x -> ab<dot>, j
# For each p -> q<dot>xr, l in chart[j]
# Make a new state:
#
# p -> qx<dot>r, l
#
# in chart[i].
#
# English: We've just finished parsing
# an "x" with this token, but that
# may have been a sub-step (like
# matching "exp->2" in "2+3"). We
# should update the higher-level
# rules as well.
next_states = reductions(chart, i, x, ab, cd, j)
for next_state in next_states:
changes = addtochart(chart, i, next_state) or changes
if not changes:
break
accepting_state = (start_rule[0], start_rule[1], [], 0)
return accepting_state in chart[len(tokens)-1]
result = parse(tokens, grammar)
print result
# tokens
def t_STRING(t):
r'"[^"]*"'
t.value = t.value[1:-1]
return t
# parsing rules
def p_exp_number(p):
'exp : NUMBER' # exp -> NUMBER
p[0] = ("number", p[1])
# p[0] is returned parse tree
# p[0] refers to exp
# p[1] refers to NUMBER.
def p_exp_not(p):
'exp : NOT exp' # exp -> NOT exp
p[0] = ("not", p[2])
# p[0] refers to exp
# p[1] refers to NOT
# p[2] refers to exp
p
: parse treesdef p_html(p):
'html : elt html'
p[0] = [p[1]] + p[2]
def p_html_empty(p):
'html : '
p[0] = []
def p_elt_word(p):
'elt : WORD'
p[0] = ("word-element", p[1])
def p_elt_tag(p):
# <span color="red">Text!</span>:
'elt : LANGLE WORD tag_args RANGLE html LANGLESLASH WORD RANGLE'
p[0] = ("tag-element", p[2], p[3], p[5], p[7])
def p_exp_binop(p):
"""exp : exp PLUS exp
| exp MINUS exp
| exp TIMES exp"""
p[0] = ("binop", p[1], p[2], p[3])
1 - 3 - 5
(1-3)-5 = -7
1-(3-5) = 3
def p_exp_call(p):
'exp : IDENTIFIER LPAREN optargs RPAREN'
p[0] = ("call", p[1], p[3])
def p_exp_number(p):
'exp : NUMBER'
p[0] = ("number", p[1])
precedence = (
# lower precedence at the top
('left', 'PLUS', 'MINUS'),
('left', 'TIMES', 'DIVIDE'),
# higher precedence at the bottom
)
def p_exp_call(p):
'exp : IDENTIFIER LPAREN optargs RPAREN'
p[0] = ("call", p[1], p[3])
def p_exp_number(p):
'exp : NUMBER'
p[0] = ("number", p[1])
def p_optargs(p):
"""optargs : exp COMMA optargs
| exp
| """
if len(p) == 1:
p[0] = []
elif len(p) == 2:
p[0] = [p[1]]
else:
p[0] = [p[1]] + p[3]
# or can separate out parsing rules in OR statement
# into its own function. separate rules give better
# performance, as the parser has done all of your
# len() work for you.
What is in chart[2], given:
S -> id(OPTARGS)
OPTARGS ->
OPTARGS -> ARGS
ARGS -> exp,ARGS
ARGS -> exp
input: id(exp,exp)
chart[0]
S -> <dot>id(OPTARGS)$, from 0
chart[1]
# shift
S -> id<dot>(OPTARGS)$, from 0
chart[2]
# shift
S -> id(<dot>OPTARGS)$, from 0
# OPTARGS could be epsilon, hence
# in one world:
S -> id(OPTARGS<dot>)$, from 0
# In another world we see OPTARGS
# and it isn't epsilon, so we closure.
OPTARGS -> <dot>ARGS, from 2
OPTARGS -> <dot>, from 2
# !!AI I think by recursion we apply closure to ARGS; reminiscent of epsilon-closure during DFA->NFA conversion.
ARGS -> <dot>exp,ARGS from 2
ARGS -> <dot>exp from 2
Programming examples
1 + 2 # = 3
"hello" + " world" # = "hello world"
1 + "hello" # ???
len
for string vs. list.+
for numbers, strings, and lists.Interpreting by walking a parse tree.
("word-element", "Hello")
("tag-element", "b", ..., "b")
("javascript-element", "function fibo(N) { ...")
# Embedded JavaScript in HTML.
re
for regexps, ply
for lexing and parsing, timeit
for benchmarking.graphics.word(string)
# draw on screen
graphics.begintag(string, dictionary)
# doesn't draw, just makes a note. like changing pen colours.
# dictionary passes in attributes, e.g. href.
graphics.endtag()
# most recent tag.
graphics.warning(string)
# debugging, in bold red color.
Example:
Nelson Mandela <b>was elected</b> democratically.
# how this calls into graphics API
graphs.word("Nelson")
graphics.word("Mandela")
graphics.begintag("b", {})
graphics.word("was")
graphics.word("elected")
graphics.endtag("b")
graphics.word("democratically.")
Interpret code.
import graphics
def interpret(trees): # Hello, friend
for tree in trees: # Hello,
# ("word-element","Hello")
nodetype=tree[0] # "word-element"
if nodetype == "word-element":
graphics.word(tree[1])
elif nodetype == "tag-element":
# <b>Strong text</b>
tagname = tree[1] # b
tagargs = tree[2] # []
subtrees = tree[3] # ...Strong Text!...
closetagname = tree[4] # b
# QUIZ: (1) check that the tags match
# if not use graphics.warning()
if tagname != closetagname:
graphics.warning("Mismatched tag. start: '%s', end: '%s'" % (tagname, closetagname))
else:
# (2): Interpret the subtree
# HINT: Call interpret recursively
graphics.begintag(tagname, {})
interpret(subtrees)
graphics.endtag()
word-element
- done.tag-element
- done.javascript-element
- not done.
e.g.
input: (1*2) + (3*4)
eval_exp
.Code:
def eval_exp(tree):
# ("number" , "5")
# ("binop" , ... , "+", ... )
nodetype = tree[0]
if nodetype == "number":
return int(tree[1])
elif nodetype == "binop":
left_child = tree[1]
operator = tree[2]
right_child = tree[3]
# QUIZ: (1) evaluate left and right child
left_value = eval_exp(left_child)
right_value = eval_exp(right_child)
# (2) perform "operator"'s work
assert(operator in ["+", "-"])
if operator == "+":
return left_value + right_value
elif operator == "-":
return left_value - right_value
def env_lookup(environment, variable_name):
...
def eval_exp(tree, environment):
nodetype = tree[0]
if nodetype == "number":
return int(tree[1])
elif nodetype == "binop":
# ...
elif nodetype == "identifier":
# ("binop", ("identifier","x"), "+", ("number","2"))
# QUIZ: (1) find the identifier name
# (2) look it up in the environment and return it
return env_lookup(environment, tree[1])
if
, while
, return
change the flow of control.
2+3
or x+1
.def eval_stmts(tree, environment):
stmttype = tree[0]
if stmttype == "assign":
# ("assign", "x", ("binop", ..., "+", ...)) <=== x = ... + ...
variable_name = tree[1]
right_child = tree[2]
new_value = eval_exp(right_child, environment)
env_update(environment, variable_name, new_value)
elif stmttype == "if-then-else": # if x < 5 then A;B; else C;D;
conditional_exp = tree[1] # x < 5
then_stmts = tree[2] # A;B;
else_stmts = tree[3] # C;D;
# QUIZ: Complete this code
# Assume "eval_stmts(stmts, environment)" exists
if eval_exp(conditional_exp, environment):
return eval_stmts(then_stmts, environment)
else:
return eval_stmts(else_stmts, environment)
Python:
x = 0
print x + 1
JavaScript:
var x = 0
write(x+1)
Can have multiple values in different contexts.
x = "outside"
def myfun(x):
print x
myfun("inside")
# get "inside"
def env_lookup(var_name, env):
# env = (parent, dictionary)
if var_name in env[1]:
# do we have it?
return (env[1])[var_name]
elif env[0] is None:
# am global?
return None
else:
# ask parents
return env_lookup(var_name, env[0])
def env_update(var_name, value, env):
if var_name in env[1]:
# do we have it?
(env[1])[var_name] = value
elif not (env[0] is None):
# if not global, ask parents.
env_update(var_name, value, env[0])
def mean(x):
return x
print "one thousand and one nights"
try
, except
.def eval_stmt(true, environment):
stmttype = tree[0]
if stmttype == "return":
return_exp = tree[1] # return 1 + 2
retval = eval_exp(return_exp, environment)
raise Exception(retval)
def eval_stmt(tree,environment):
stmttype = tree[0]
if stmttype == "call": # ("call", "sqrt", [("number","2")])
fname = tree[1] # "sqrt"
args = tree[2] # [ ("number", "2") ]
fvalue = env_lookup(fname, environment)
if fvalue[0] == "function":
# We'll make a promise to ourselves:
# ("function", params, body, env)
fparams = fvalue[1] # ["x"]
fbody = fvalue[2]
fenv = fvalue[3]
if len(fparams) <> len(args):
print "ERROR: wrong number of args"
else:
#QUIZ: Make a new environment frame
newfenv = (fenv, {})
for param, value in zip(fparams, args):
newfenv[1][param] = None
eval_value = eval_exp(value, environment)
env_update(param, eval_value, newfenv)
try:
# QUIZ : Evaluate the body
eval_stmts(fbody, newfenv)
return None
except Exception as return_value:
return return_value
else:
print "ERROR: call to non-function"
elif stmttype == "return":
retval = eval_exp(tree[1],environment)
raise Exception(retval)
elif stmttype == "exp":
eval_exp(tree[1],environment)
In Python and JavaScript functions can be values. Hence we must represent function values.
def myfun(x):
return x+1
function myfun(x) {
return x+1;
}
("function", fparams, fbody, fenv)
fenv
.Code:
def eval_elt(tree, env):
elttype = tree[0]
if elttype == "function":
fname = tree[1]
fparams = tree[2]
fbody = tree[3]
fvalue = ("function", fparams, fbody, env)
add_to_env(env, fname, fvalue)
Can use JavaScript to simulate any Python program.
x = 0
while True:
x = x + 1
print x
halts()
which takes a procedure as an argument and returns True
if that procedure halts and False
if it loops forever.def tsif():
if halts(tsif):
x = 0
while True:
x = x + 1
else:
return 0
tsif
halts, then it loops forever.tsif
loops forever, then it halts.Contradiction, hence halts()
cannot exist.
explode()
, aka Python’s string.split()
, assigns to local variables and trusts user to be friendly. Nope!write()
.write()
.write()
output, and calls graphics
library.5<7
or a>b
is valid JavaScript, but would confuse an HTML lexer.def t_javascript(token):
r'\<script\ type=\"text\/javascript\"\>'
token.lexer.code_start = token.lexer.lexpos
token.lexer.begin('javascript')
# note that lexpos is such that we've already
# stripped off the initial text/javascript part.
def t_javascript_end(token):
r'\<\/script\>' # </script>
token.value = token.lexer.lexdata[token.lexer.code_start:token.lexer.lexpos-9]
token.type = 'JAVASCRIPT'
token.lexer.lineno += token.value.count('\n')
token.lexer.begin('INITIAL')
return token
# note that lexdata is such that we need to
# manually strip off </script>
def p_element_word(p):
'element : WORD'
p[0] = ("word-element", p[1])
# p[0] is the parse tree
# p[1] is the child parse tree
def p_element_javascript(p):
'element : JAVASCRIPT'
p[0] = ("javascript-element", p[1])
JAVASCRIPT
in the parser is the same as token.type
in the lexer. This is intentional: this is the link between the lexer and the parser.HTML input:
hello my
<script type="text/javascript">document.write(99);</script>
luftballons
Parse tree:
[("word-element", "hello"),
("word-element", "my"),
("javascript-element", "document.write(99)"),
("word-element", "luftballons")]
def interpret(trees):
for tree in trees:
treetype = tree[0]
if treetype == "word-element":
graphics.word(node[1])
# covered HTML tags in another quiz...
elif tree.type == "javascript-element":
jstext = tree[1] # "document.write(55);"
# jstokens is an external module
jslexer = lex.lex(module=jstokens)
# jsgrammar is another external module
jsparser = yacc.yacc(module=jsgrammar)
# jstree is a parse tree for JavaScript
jstree = jsparser.parse(jstext, lexer=jslexer)
# We want to call the interpreter on our AST
result = jsinterp.interpret(jstree)
graphics.word(result)
document.write()
more than once, but we still want to only return one string from the jsinterp.interpret()
call.write
appends to the special “javascript output” variable in the global environment.def interpret(trees):
# recall env = (parent, dictionary), and as this is the global environment the parent pointer is None
global_env = (None, {"javascript output": ""})
for elt in trees:
eval_elt(elt, global_env)
return (global_env[1])["javascript output"]
"javascript output"
, in particular the space, is important; the user is not allowed to use a space in an identifier so they can’t ever collide with this.def eval_exp(tree, env):
exptype = tree[0]
if exptype == "call":
fname = tree[1] # myfun in myfun(a,3+4)
fargs = tree[2] # [a,3+4] in myfun(a,3+4)
fvalue = envlookup(fname,env) # None for "write"; built-in
if fname == "write":
argval = eval_exp(fargs[0],env)
output_sofar = env_lookup("javascript output",env)
env_update("javascript output", \
output_sofar + str(argval), env)
return None
function factorial(n) {
if (n == 0) {
return 1;
}
return n * factorial(n-1);
}
document.write(1260 + factorial(6));
factorial
, so 7.n
.def eval_stmt(tree,environment)
” above).Software maintenance (ie.e. testing, debugging, refactoring) carries a huge cost.
When comparing a function’s output to the expected output, code read and see what features of our interepeter we’re using. If you’re not using it, you’re not testing it!
def env_lookup(vname,env):
# env = (parent-poiner, {"x": 22, "y": 33})
if vname in env[1]:
return (env[1])[vname]
else: # BUG
return None # BUG
var a = 1;
function mistletoe(baldr) {
baldr = baldr + 1;
a = a + 2;
baldr = baldr + a;
return baldr;
}
write(mistletoe(5));
return baldr
, it works!greeting = "hola"
def makegreeter(greeting):
def greeter(person):
print greeting + " " + person
return greeter
sayhello = makegreeter("hello")
sayhello("gracie")
var greeting = "hola";
function makegreeter(greeting) {
var greeter = function(person) {
write(greeting + " " + person);
}
return greeter;
}
var sayhello = makegreeter("hello");
def eval_exp(tree,env):
exptype = tree[0]
# function(x,y) { return x+y }
if exptype == "function":
# ("function", ["x","y"], [ ("return", ("binop", ...) ])
fparams = tree[1]
fbody = tree[2]
return ("function", fparams, fbody, env)
# "env" allows local functions to see local variables
# can see variables that were in scope *when the function was defined*
return ("function", fparams, fbody, global_env)
function factorial(n) {
if (n == 0) { return 1; }
return 1 * n * factorial(n-1);
}
1
a lot.1 \* n
can be replaced by n
.Smaller AST, faster recursive walk, fewer multiplications.
Think of optimizations
x \* 1 == x
x + 0 == x
Transform parse tree
x/x
can’t be optimized to 1
because if x=0
then raises an exception, and we want to keep the same semantics after optimization.def optimize(tree):
etype = tree[0]
if etype == "binop": # a * 1 = a
a = tree[1]
op = tree[2]
b = tree[3]
if op == "*" and b == ("number","1"):
return a
elif op == "*" and b == ("number","0"):
return ("number","0")
elif op == "+" and b == ("number","0"):
return a
return tree
i.e. this:
("binop",
("number", "5"),
("*"),
("number", "1")
)
becomes:
("number", "5")
a \* 1 \* 1
.def optimize(tree): # Expression trees only
etype = tree[0]
if etype == "binop":
# Fix this code so that it handles a + ( 5 * 0 )
# recursively! QUIZ!
a = optimize(tree[1])
op = tree[2]
b = optimize(tree[3])
if op == "*" and b == ("number","1"):
return a
elif op == "*" and b == ("number","0"):
return ("number","0")
elif op == "+" and b == ("number","0"):
return a
return ("binop", a, op, b) # return optimized tree, not original
+, -, /, len()
.x * 1 === x
x / x !== 1
document.write()
.a^N
: it’s a+
.a^N b^N
, because this involves memory / context / counting. Same as balancing parantheses.A context-free grammar (CFG) can capture a^N b^N
. It looks like this:
S -> aSb
S ->
S -> aSb
S -> \epsilon
S -> c
Input: acb
What parsing states are in chart[2]?
S -> <dot> a S b
is in chart[0] from 0
, so no.S -> a <dot> S b
is in chart[1] from 0
, so no.S -> <dot>
is in chart[1] from 1
, so no.S -> c <dot>
is in chart[2] from 1
, so yes.S -> a S <dot> b
is in chart[2] from 0
, so yes.
from
is not 1
. from 1
means that there is one hidden input not shown in this rule, whereas we can see all the input here.optimization_OK(f, g)
, which compares a function before-and-after optimization and tells you if it’s a safe optimization.optimization_OK
so that it returns a safe answer for optimization in all cases - just never optimize!optimization_OK
that works precisely in all cases - it is undecidable like the Halting Problem.
optimization_OK
in all cases then we could compare any optimization_OK
itself to an infinite loop def loops()
, and hence we’d have solved the Halting Problem.optimization_OK
is commutative; O(f,g) == O(g,f)
Regular Expressions in Python (Google Code University)
tan()
and tanh()
.