Natural Language Processing

Columbia University, via Coursera

Creative Commons License
Natural Language Processing notes by Asim Ihsan is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.

Readings policy

There are excellent readings assigned to the class. They’re explicitly inlined into the respective lecture, to save typing stuff out twice.

Other readings (papers, textbooks, other courses) are explicitly inlined as well.

Rendering

In order to use pandoc run (need to include custom LaTeX packages for some symbols):

    pandoc \[course\]\ natural\ language\ processing.md
    -o pdf/nlp.pdf --include-in-header=latex.template

or, for Markdown + LaTex to HTML + MathJax output:

    pandoc \[course\]\ natural\ language\ processing.md
    -o html/nlp.html
    --include-in-header=html/_header.html
    --mathjax -s --toc --smart -c _pandoc.css

and, for the ultimate experience, after pip install watchdog:

    watchmedo shell-command --patterns="*.md"
    --ignore-directories --recursive
    --command='<command above>' .

Week 1 - Introduction to Natural Language Processing

Introduction (Part 1)

Tasks

Basic NLP problems

Introduction (Part 2)

Why is NLP hard?

What will this course be about

Syllabus

Week 1 - The Language Modeling Problem

Introduction to the Language Modeling Problem (Part 1)

\[V = \{the, a, man telescope, Beckham, two, ...\}\]

\[V^+ = \{"the\:STOP", "a\:STOP", "the\:fan\:STOP", ...\}\]

\[\sum_{x \in V^+} p(x) = 1, \quad p(x) \ge 0 \; \forall \; x \in V^+\]

Introduction to the Language Modeling Problem (Part 2)

\[p(x_1, \ldots, x_n) = \frac{c(x_1, \ldots, x_n)}{N}\]

Markov Processes (Part 1)

\[P(X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n)\]

\[P(A,B) = P(A) \times P(B|A)\] \[P(A,B,C) = P(A) \times P(B|A) \times P(C|A,B)\]

\[P(X_1 = x_1, X_2 = x_2) = P(X_1 = x_1) \times P(X_2 = x_2 | X_1 = x_1)\] \[P(X_1 = x_1, X_2 = x_2, X_3 = x_3) = ... P(X_3 = x_3 | P(X_2 = x_2, X_1 = x_1)\]

\[P(X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n)\] \[=P(X_1 = x_1) \prod_{i=2}^{n} P(X_i = x_i\;|\;X_1 = x_1, \dots, X_{i-1} = x_{i-1})\]

\[= P(X_1 = x_1) \prod_{i=2}^{n} P(X_i = x_i\;|\; X_{i-1} = x_{i-1})\]

\[P(X_i=x_i|X_1=x_1, \ldots, X_{i-1} = x_{i-1}) = P(X_i=x_i | X_{i-1} = x_{i-1})\]

 Markov Processes (Part 2)

\[P(X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n)\] \[=P(X_1 = x_1) P(X_2 = x_2 | X_1 = x_1) \prod_{i=3}^{n} P(X_i = x_i | X_{i-2} = x_{i-2}, X_{i-1} = x_{i-1})\]

\[= \prod_{i=1}^{n} P(X_i = x_i | X_{i-2} = x_{i-2}, X_{i-1} = x_{i-1})\]

Modelling Variable Length Sequences

Trigram Language Models

\[p(x_1, \dots, x_n) = \prod_{i=1}^{n}q(x_i\;|\;x_{i-2},x_{i-1})\]

An example. For the sentence

    the dog barks STOP

we could have

\(p(\textrm{the dog barks STOP}) =\)
\(q(\textrm{the | *, *})\)
\(\times q(\textrm{dog | *, the})\)
\(\times q(\textrm{barks | the, dog})\)
\(\times q(\textrm{STOP | dog, barks})\)


\[ \begin{align} &\begin{aligned} q(\textrm{the | *, *}) & = 1 \\ q(\textrm{dog | *, the}) & = 0.5 \\ q(\textrm{STOP | *, the}) & = 0.5 \\ q(\textrm{runs | the, dog}) & = 0.5 \\ q(\textrm{STOP | the, dog}) & = 0.5 \\ q(\textrm{STOP | dog, runs}) & = 1 \end{aligned} \end{align} \]


The Trigram Estimation Problem

\[q(w_i\;|\;w_{i-2},w_{i-1}) = \frac{\textrm{Count}(w_{i-2},w_{i-1},w_{i})}{\textrm{Count}(w_{i-2},w_{i-1})}\]

\[q(\textrm{laughs | the, dog}) = \frac{\textrm{Count(the, dog, laughs)}}{\textrm{Count(the, dog)}}\]


Correct:

\(q_{ML}({\textrm{walks | *, dog}})\)
\(q_{ML}({\textrm{dog | walks, the}})\)
\(q_{ML}({\textrm{walks | the, dog}})\)

Incorrect:

\(q_{ML}({\textrm{walks | dog, the}})\)
\(q_{ML}({\textrm{fast | dog, the}})\)
\(q_{ML}({\textrm{STOP | walks, dog}})\)


Spare Data problems

Evaluating Language Models: Perplexity

\[\textrm{log}\;\prod_{i=1}^{m} p(s_i) = \sum_{i=1}^{m} \textrm{log}\;p(s_i)\]

\[p(s_i) = q(\textrm{the | *, *}) \times q(\textrm{dog | *, the}) \times \ldots\]

\[\textrm{Perplexity} = 2^{-l},\;\textrm{where}\] \[l = \frac{1}{M} \sum_{i=1}^{m} \textrm{log}\;p(s_i)\]

Some Intuition about Perplexity

\[q(w|u,v) = \frac{1}{N},\;\forall\;w \in V \cup \{\textrm{STOP}\},\;\forall\;u,v \in V \cup \{\textrm{*}\}\].

\[\textrm{Perplexity} = 2^{-l},\;\textrm{where}\;l=\textrm{log}\;\frac{1}{N}\] \[\implies\; \textrm{Perplexity} = N\]


Quiz: define a trigram language model with the following parameters:

Now consider a test corpus with the following sentences:

Note that the number of words in this corpus, M, is 12.

What is the perplexity of the language model, to 3dp?

\[P = 2^{-l}\] \[l = \frac{1}{M} \sum \textrm{log}_2\{p(s_i)\}\]

\(p(\textrm{the dog runs STOP}) = q(\textrm{the | *, *}) \times q(\textrm{dog | *, the}) \times q(\textrm{runs | the, dog}) \times q(\textrm{STOP | dog, runs})\)
\(= 1 \times 0.5 \times 1 \times 1 = 0.5\)

\(p(\textrm{the cat walks STOP}) = q(\textrm{the | *, *}) \times q(\textrm{cat | *, the}) \times q(\textrm{walks | the, cat}) \times q(\textrm{STOP | cat walks})\)
\(= 1 \times 0.5 \times 1 \times 1 = 0.5\)

\(l = \frac{1}{12} \{ 3 \times \textrm{log}_2(0.5) \}\) \(=\frac{1}{12}(-3) = \frac{-1}{4}\)
\(p=2^{\frac{1}{4}} = \sqrt[4]{2} = 1.189\;\textrm{(3dp)}\)


 Typical values of perplexity (Goodman)

Some history

Week 1 - Parameter Estimation in Language Models

Linear Interpolation (Part 1)

The Bias-Variance Trade-Off

\[q_{ML}(w_i\;|\;w_{i-2},w_{i-1}) = \frac{\textrm{Count}(w_{i-2},w_{i-1},w_i)}{\textrm{Count}(w_{i-2},w_{i-1})}\]

\[q_{ML}(w_i\;|\;w_{i-1}) = \frac{\textrm{Count}(w_{i-1},w_i)}{\textrm{Count}(w_{i-1})}\]

\[q_{ML} = \frac{\textrm{Count}(w_i)}{\textrm{Count}()}\]

 Linear Interpolation (Part 2)

 Linear Interpolation

\(= \lambda_1 \times q_{ML}(w_i\;|\;w_{i-2},w_{i-1})\)
\(+ \lambda_2 \times q_{ML}(w_i\;|\;w_{i-1})\)
\(+ \lambda_3 \times q_{ML}(w_i)\)

\(q(\textrm{laughs | the, dog})\)
\(= \frac{1}{3} \times q_{ML}(\textrm{laughs | the, dog})\)
\(+ \frac{1}{3} \times q_{ML}(\textrm{laughs | dog})\)
\(+ \frac{1}{3} \times q_{ML}(\textrm{laughs})\)


Quiz: we are given the following corpus:

Assume we compute a language model based on this corpus using linear interpolation with \(\lambda_i = \frac{1}{3}\;\forall\;i \in \{1,2,3\}\).

What is the value of the parameter \(q_{LI}(\textrm{book | the, green})\) in this model to 3dp? (Note: please include STOP words in your unigram model).

\(q_{LI}(\textrm{book | the, green})\)
\(= \frac{1}{3} \times q_{ML}(\textrm{book | the, green})\)
\(+ \frac{1}{3} \times q_{ML}(\textrm{book | green})\)
\(+ \frac{1}{3} \times q_{ML}(\textrm{book})\)

\(= \frac{1}{3} \times \frac{\textrm{Count(the, green, book)}}{\textrm{Count(the, green)}}\)
\(+ \frac{1}{3} \times \frac{\textrm{Count(green, book)}}{\textrm{Count(green)}}\)
\(+ \frac{1}{3} \times \frac{\textrm{Count(book)}}{\textrm{Count()}}\)

\(= \frac{ \frac{1}{3}(1) }{(1)} + \frac{ \frac{1}{3}(1) }{(2)} + \frac{ \frac{1}{3}(3) }{(14)}\)
\(= 0.571\;\textrm(3dp)\)


Our estimate correctly defines a distribution. Define \(V^{'} = V \cup \{STOP\}.\)

\(\sum_{w \in V^{'}} q(w|u,v)\)
\(=\sum_{w \in V^{'}} [\lambda_1 \times q_{ML}(w|u,v) + \lambda_2 \times q_{ML}(w|v) + \lambda_3 \times q_{ML}(w)]\)

move out the constant lambdas:

\(=\lambda_1 \sum_w q_{ML}(w|u,v) + \lambda_2 \sum_w q_{ML}(w|v) + \lambda_3 \sum_w q_{ML}(w)\)

By definition the maximum likelihood estimates in a given trigram, bigram, or unigram model sum to 1. Intuitively, the probability of each given trigram, bigram, or unigram probability in the model sums to 1.

\(= \lambda_1 + \lambda_2 + \lambda_3 = 1\)

(Can also show that \(q(w|u,v) \ge 0\;\forall\;w \in V^{'}\)).


Quiz: say we have \(\lambda_1 = -0.5, \lambda_2 = 0.5, \lambda_3 = 1.0\). Note that these satisfy the constraint \(\sum_i \lambda_i = 1\), but violate the constraint that \(\lambda_i \ge 0\).

(Credit to Philip M. Hession for the explanations).

Recalling our definition of \(q\) above within: \(\sum_{w \in V^{'}} q(w|u,v)\), it’s hence true that there might be a trigram \(u,v,w\) such that \(q(w|u,v) \lt 0\):

\[q(\text{barks}|\text{the,dog})=-\frac{1}{2}\frac{c(\text{the,dog,barks})}{c(\text{the,dog})}+\frac{1}{2}\frac{c(\text{dog,barks})}{c(\text{dog})}+1\cdot\frac{c(\text{barks})}{c()}\]

then \(q(\text{barks}|\text{the,dog})=-\frac{1}{2}(\sim 1)+\frac{1}{2}(\ll 1)+1(\ll 1) \lt0\)

and there might be a trigram \(u,v,w\) such that \(q(w|u,v) \gt 1\):

\[q(\text{barks}|\text{the,dog})=-\frac{1}{2}\frac{c(\text{the,dog,barks})}{c(\text{the,dog})}+\frac{1}{2}\frac{c(\text{dog,barks})}{c(\text{dog})}+1\cdot\frac{c(\text{barks})}{c()}\]

then \[q(\text{barks}|\text{the,dog})=-\frac{1}{2}(\ll 1)+\frac{1}{2}(\sim 1)+1(\sim 1) \gt 1\]

It is not true that we may have a bigram \(u,v\) such that \(\sum_{w \in V} q(w|u,v) \neq 1\):

\[\sum_{w}q(w|u,v) = -\frac{1}{2}\frac{\sum_w c(u,v,w)}{c(u,v)}+\frac{1}{2}\frac{\sum_w c(v,w)}{c(v)}+1\cdot\frac{\sum_w c(w)}{c()} = -\frac{1}{2}(1)+\frac{1}{2}(1)+1(1)=1\]

since \(\sum_w c(u,v,w)=c(u,v)\), \(\sum_w c(v,w)=c(v)\), and \(\sum_w c(w)=c()\).


How to estimate the \(\lambda\) values?

\[L(\lambda_1,\lambda_2,\lambda_3) = \sum_{w_1,w_2,w_3} c^{'}(w_1,w_2,w_3)\;\textrm{log}\;q(w_3|w_1,w_2)\]

such that \(\lambda_1 + \lambda_2 + \lambda_3 = 1\) and \(\lambda_i \ge 0\;\forall\;i\) and where:

\(q(w_i|w_{i-2},w_{i-1}) =\)
\(\lambda_1 \times q_{ML}(w_i|w_{i-2},w_{i-1})\)
\(+\lambda_2 \times q_{ML}(w_i|w_{i-1})\)
\(+\lambda_3 \times q_{ML}(w_i)\)

Allowing the \(\lambda\)’s to vary

\[ \begin{equation} \Pi(w_{i-2},w_{i-1}) = \begin{cases} 1, & \textrm{If Count}(w_{i-1},w_{i-2}) = 0\\ 2, & \textrm{If 1} \le \textrm{Count}(w_{i-1},w_{i-2}) \le 2\\ 3, & \textrm{If 3} \le \textrm{Count}(w_{i-1},w_{i-2}) \lt 5\\ 4, & \textrm{Otherwise} \end{cases} \end{equation} \]

\[ \begin{align} &\begin{aligned} q(w_i\;|\;w_{i-2},w_{i-1}) & = \lambda_1^{\Pi(w_{i-2},w_{i-1})} \times q_{ML}(w_i\;|\;w_{i-2},w_{i-1}) \\ &\; + \lambda_2^{\Pi(w_{i-2},w_{i-1})} \times q_{ML}(w_i\;|\;w_{i-1}) \\ &\; + \lambda_3^{\Pi(w_{i-2},w_{i-1})} \times q_{ML}(w_i) \end{aligned} \end{align} \]

Discounting Methods (Part 1)

x Count(x) \(q_{ML}(w_i\;|\;w_{i-1})\)
the 48
the, dog 15 \(^{15}/_{48}\)
the, woman 11 \(^{11}/_{48}\)
the, man 10 \(^{10}/_{48}\)
the, park 5 \(^{5}/_{48}\)
the, job 2 \(^{2}/_{48}\)
the, telescope 1 \(^{1}/_{48}\)
the, manual 1 \(^{1}/_{48}\)
the, afternoon 1 \(^{1}/_{48}\)
the, country 1 \(^{1}/_{48}\)
the, street 1 \(^{1}/_{48}\)
x Count(x) Count*(x) \(\frac{\textrm{Count*(x)}}{\textrm{Count(the)}}\)
the 48
the, dog 15 14.5 \(^{14.5}/_{48}\)
the, woman 11 10.5 \(^{10.5}/_{48}\)
the, man 10 9.5 \(^{9.5}/_{48}\)
the, park 5 4.5 \(^{4.5}/_{48}\)
the, job 2 1.5 \(^{1.5}/_{48}\)
the, telescope 1 0.5 \(^{0.5}/_{48}\)
the, manual 1 0.5 \(^{0.5}/_{48}\)
the, afternoon 1 0.5 \(^{0.5}/_{48}\)
the, country 1 0.5 \(^{0.5}/_{48}\)
the, street 1 0.5 \(^{0.5}/_{48}\)

\[\alpha(w_{i-1}) = 1 - \sum_{w} \frac{\textrm{Count}^{*}(w_{i-1},w)}{\textrm{Count}(w_{i-1})}\]


Quiz: assume that we are given a corpus with the following properties:

Furthermore assume that the discounted counts are defined as \(c^{*}(\textrm{the,w}) = c(\textrm{the,w}) - 0.3\). Under this corpus, what is the missing probability mass \(\alpha(\textrm{the})\) to 3dp?

\[ \begin{align} &\begin{aligned} \alpha(\textrm{the}) & = 1 - \sum_{w} \frac{\textrm{Count}^{*}(\textrm{the, w})}{\textrm{Count(the)}} \\ & = \frac{\textrm{Count(the)}}{\textrm{Count(the)}} - \frac{1}{\textrm{Count(the)}} \times \sum_{w} \textrm{Count}^{*}(\textrm{the,w}) \\ & = \frac{\textrm{Count(the)} - \sum_{w} \textrm{Count}^{*}\textrm{(the, w)}}{\textrm{Count(the)}} \\ & = \frac{\textrm{Count(the)} - \sum_{w} \left\{ \textrm{Count(the, w)} - 0.3\right\}}{\textrm{Count(the)}} \\ & = \frac{\textrm{Count(the)} + \sum_{w}(0.3) - \textrm{Count(the)}}{\textrm{Count(the)}} \\ & = \frac{0.3w}{\textrm{Count(the)}} \\ & = \frac{(0.3)(15)}{70} = 0.064\;\textrm{(3 dp)} \end{aligned} \end{align} \]


Katz Back-Off Models (Bigrams)

\[ \begin{align} &\begin{aligned} A(w_{i-1}) & = \left\{w : \textrm{Count}(w_{i-1},w) \gt 0\right\} \\ B(w_{i-1}) & = \left\{w : \textrm{Count}(w_{i-1},w) = 0\right\} \end{aligned} \end{align} \]

\[\alpha(w_{i-1}) = 1 - \sum_{w \in A(w_{i-1})} \frac{\textrm{Count}^{*}(w_{i-1},w)}{\textrm{Count}(w_{i-1})}\]

\[\textrm{Count}^{*}(w_{i-1},w_i) = \textrm{Count}(w_{i-1},w_i) - \gamma\\ \textrm{where $\gamma$ is a constant}\]

\[ \begin{equation} q_{BO}(w_i\;|\;w_{i-1}) = \begin{cases} \frac{\textrm{Count}^{*}(w_{i-1},w_i)}{\textrm{Count}(w_{i-1})}, & \textrm{If } w_i \in A(w_{i-1})\\ \alpha(w_{i-1})\frac{q_{ML}(w_i)}{\sum_{w \in B(w_{i-1})} q_{ML}(w)}, & \textrm{If } w_i \in B(w_{i-1}) \end{cases} \end{equation} \]


Quiz: Let’s return to a smaller version of our corpus:

This time we computer a bigram language model using Katz back-off with \(c^{*}(v,w) = c(v,w) - 0.5\).

What is the value of \(q_{BO}(\textrm{book | his})\) estimated from this corpus?

\[w_i = \textrm{book}, w_{i-1} = \textrm{his}\]

\[ \begin{align} &\begin{aligned} A(\textrm{his}) & = \textrm{{house}} \\ B(\textrm{his}) & = \textrm{{his, the, book, STOP}} \end{aligned} \end{align} \]

Draw a table for \(w_{i-1}\) and all words that follow it, in order to determine \(\alpha(w_{i-1})\)

x Count(x) Count*(x)
his 1
his, house 1 0.5

\[\alpha(\textrm{his}) = 1 - (0.5)/(1) = 0.5\]

Since \(\textrm{book} \in B(\textrm{his})\), i.e. since “book” never follows “his” in the corpus:

\[ \begin{align} &\begin{aligned} \sum_{w \in B(w_{i-1})} q_{ML}(w) & = q_{ML}(\textrm{his}) + q_{ML}(\textrm{the}) + q_{ML}(\textrm{book}) + q_{ML}(\textrm{STOP}) \\ & = (1/6) + (1/6) + (1/6) + (2/6) \\ & = 5/6 \end{aligned} \end{align} \]

\[ \begin{align} &\begin{aligned} q_{BO}(\textrm{book | his}) & = \alpha(w_{i-1})\frac{q_{ML}(w_i)}{\sum_{w \in B(w_{i-1})} q_{ML}(w)} \\ & = (0.5) \times \frac{(1/6)}{(5/6)} \\ & = 0.1 \end{aligned} \end{align} \]


Discounting Methods (Part 2)

Katz Back-Off Models (Trigrams)

\[ \begin{align} &\begin{aligned} A(w_{i-2},w_{i-1}) & = \left\{w : \textrm{Count}(w_{i-2},w_{i-1},w) \gt 0\right\} \\ B(w_{i-2},w_{i-1}) & = \left\{w : \textrm{Count}(w_{i-2},w_{i-1},w) = 0\right\} \end{aligned} \end{align} \]

\[ \begin{equation} q_{BO}(w_i\;|\;w_{i-2},w_{i-1}) = \begin{cases} \frac{\textrm{Count}^{*}(w_{i-2},w_{i-1},w_i)}{\textrm{Count}(w_{i-2},w_{i-1})}, & \textrm{If } w_i \in A(w_{i-2},w_{i-1})\\ \alpha(w_{i-2},w_{i-1})\frac{q_{BO}(w_i|w_{i-1})}{\sum_{w \in B(w_{i-2},w_{i-1})} q_{BO}(w|w_{i-1})}, & \textrm{If } w_i \in B(w_{i-2},w_{i-1}) \end{cases} \end{equation} \]

where

\[\alpha(w_{i-2},w_{i-1}) = 1 - \sum_{w \in A(w_{i-2},w_{i-1})} \frac{\textrm{Count}^{*}(w_{i-2},w_{i-1},w)}{\textrm{Count}(w_{i-2},w_{i-1})}\]

 Summary

 Week 2 - Tagging Problems and Hidden Markov Models

The Tagging Problem

 Part-of-Speech Tagging

Profits soared at Boeing Co., easily topping forecasts on Wall
Street, as their CEO Alan Nulally announced first quarter
results.
N   =   Noun
V   =   Verb
P   =   Preposition
Adv =   Adverb
Adj =   Adjective
...
Profits/N soared/V at/P Boeing/N Co./N ,/, easily/ADV
topping/V forecasts/N on/P Wall/N Street/N ,/, as/P
their/POSS CEO/N Alan/N Mulally/N announced/V first/ADJ
quarter/N results/N ./.

 Named Entity Recognition

Profits soared at [Company: Boeing Co.], easily ...
[Location: Wall Street], ..., [Person: Alan Mulally]

 Named Entity Extraction as Tagging

NA  =   No entity
SC  =   Start Company
CC  =   Continue Company
SL  =   Start Location
CL  =   Continue Location
...
Profits/NA soared/NA at/NA Boeing/SC Co./CC ,/NA easily/NA
topping/NA ... Wall/SL Street/CL ,/NA ... CEO/NA Alan/SP
Mulally/CP ...

Quiz: given sentence: Profits are topping all estimates.

We also know:

How many tag sequences are possible?

\[= 2 \times 1 \times 3 \times 3 \times 2 = 36\]


Two Types of Contraints

Influential/JJ members/NNS of/IN ... bailout/NN agency/NN
can/MD raise/VB capital/NN ./.
The trash can is in the garage.

Generative Models for Supervised Learning

Supervised Learning Problems

\[ \begin{align} &\begin{aligned} & x^{(1)} = \textrm{the dog laughs}, & y^{(1)} = \textrm{DT NN VB} \\ & x^{(2)} = \textrm{the dog barks}, & y^{(2)} = \textrm{DT NN VB} \\ & \ldots & \ldots \end{aligned} \end{align} \]

 Generative Models

\[p(y|x) = \frac{p(y)p(x|y)}{p(x)}\]

\[p(x) = \sum_y p(y)p(x|y)\]

\[ \begin{align} &\begin{aligned} f(x) & = \textrm{argmax}_{y}\;p(y|x) \\ & = \textrm{argmax}_{y}\;\frac{p(y)p(x|y)}{p(x)} \\ & = \textrm{argmax}_{y}\;p(y)p(x|y) \end{aligned} \end{align} \]

Hidden Markov Models

\[p(x_1, x_2, \ldots, x_n, y_1, y_2, \ldots, y_n)\]

\[\textrm{arg}\underset{y_1 \ldots y_n}{\textrm{max}} p(x_1, x_2, \ldots, x_n, y_1, y_2, \ldots, y_n)\]

Trigram Hidden Markov Models (Triagram HMMs)

\[p(x_1, x_2, \ldots, x_n, y_1, y_2, \ldots, y_{n+1}) = \prod_{i=1}^{n+1} q(y_i|y_{i-2},y_{i-1}) \prod_{i=1}^{n} e(x_i|y_i)\]


Quiz: Given tagset \(S = \{\textrm{D, N}\}\), a vocabulary \(V = \{\textrm{the, dog}\}\), and a HMM with transition parameters:

and emission parameters:

Under this model how many pairs of sequences \(x_1, x_2, \ldots, x_n, y_1, y_2, \ldots, y_{n+1}\) satisfy \(p(x_1, x_2, \ldots, x_n, y_1, y_2, \ldots, y_{n+1}) \gt 0\)?

First: how many non-zero-probability tag sequences are there? Enumerate them by drawing a graph of nodes and edges, where a node is a word and an edge is labelled with the transition probability to another word. Then follow all paths from any start symbol to any stop symbol whose product of probabilities is \(\gt\) 0.

D, N, STOP

There’s only one! OK. Refer back to your taq sequence graph and copy it for each possible word that a given tag (i.e. node) that it may “generate”. If e.g. N could generate two words, not one, we would have four possible sentences.

the dog
dog dog

There’s only two! OK. Hence the answer itself is two, because we have just generated a sentence for each possible (tag, word) pair.


An example

If we have:

Then:

\[ \begin{align} &\begin{aligned} & p(x_1, x_2, \ldots, x_n, y_1, y_2, \ldots, y_{n+1}) \\ = & q(\textrm{D | *, *}) \times q(\textrm{N | *, D}) \times q(\textrm{V | D, N}) \times q(\textrm{STOP | N, V}) \times \\ & e(\textrm{the | D}) \times e(\textrm{dog | N}) \times e(\textrm{laughs | V}) \end{aligned} \end{align} \]


Quiz: given set \(S = \{\textrm{D, N, V}\}\), and vocabulary \(V = \{\textrm{the, cat, drinks, milk, dog}\}\), and an HMM model:

What is the value, under this model, of:

\[p(\textrm{the, cat, drinks, milk, D, N, V, N, STOP})\]

\[ \begin{align} &\begin{aligned} & p(\textrm{the, cat, drinks, milk, STOP, D, N, V, N}) \\ = & \prod_{i=1}^{n+1} q(y_i|y_{i-2},y_{i-1}) \prod_{i=1}^{n} e(x_i|y_i) \\ = & \{ p(\textrm{the | *, *}) \times p(\textrm{cat | *, the}) \times p(\textrm{drinks | the, cat}) \times p(\textrm{milk | cat, drinks}) \times p(\textrm{STOP | drinks, milk}) \} \times \\ & e(\textrm{the | D}) \times e(\textrm{cat | N}) \times e(\textrm{drinks | V}) \times e(\textrm{milk | N}) \\ = & \left(\frac{1}{4}\right)^5 \times \left(\frac{1}{5}\right)^4 \end{aligned} \end{align} \]


Why the Name?


Quiz: for a bigram HMM:

\[p(x_1, x_2, \ldots, x_n, y_1, y_2, \ldots, y_n) = \prod_{i=1}^{n+1} q(y_i|y_{i-1}) \prod_{i=1}^{n} e(x_i|y_i)\]


Parameter Estimation in HMMs

Smoothed Estimation

e.g.

\[ \begin{align} &\begin{aligned} q(\textrm{Vt | DT, JJ}) & = \lambda_1 \times \frac{\textrm{Count(Dt, JJ, Vt)}}{\textrm{Count(Dt, JJ)}} \\ & + \lambda_2 \times \frac{\textrm{Count(JJ, Vt)}}{\textrm{Count(JJ)}} \\ & + \lambda_3 \times \frac{\textrm{Count(Vt)}}{\textrm{Count()}} \end{aligned} \end{align} \]

\[\lambda_1 + \lambda_2 + \lambda_3 = 1\] \[\forall\;i, \lambda_i \ge 0\]

\[e(\textrm{base | Vt}) = \frac{\textrm{Count(Vt, base)}}{\textrm{Count(Vt)}}\]


Quiz: Given the following corpus:

Assume we’ve calculated MLEs of a trigram HMM from this data. What is the value of the emission parameter \(e(\textrm{cat | N})\) from this HMM?

\[ \begin{align} &\begin{aligned} e(\textrm{cat | N}) = & \frac{\textrm{Count(N, cat)}}{\textrm{Count(N)}} \\ = & \frac{(1)}{(2)} \end{aligned} \end{align} \]

Say we estimate the transition parameters for a trigram HMM using linear interpolation, such that \(\lambda_i = \frac{1}{3}\) for \(i = \{1, 2, 3\}\). What is the value of the transition parameter \(q(\textrm{STOP | N, V})\) under this model?

\[ \begin{align} &\begin{aligned} q(\textrm{STOP | N, V}) = & \lambda_1 \times \frac{\textrm{Count(N, V, STOP)}}{\textrm{Count(N, V)}} \\ + & \lambda_2 \times \frac{\textrm{Count(V, STOP)}}{\textrm{Count(V)}} \\ + & \lambda_3 \times \frac{\textrm{Count(STOP)}}{\textrm{Count()}} \\ = & \left(\frac{1}{3} \times \frac{(2)}{(2)}\right) \\ + & \left(\frac{1}{3} \times \frac{(2)}{(2)}\right) \\ + & \left(\frac{1}{3} \times \frac{(2)}{(8)}\right) \\ = & 0.75 \end{aligned} \end{align} \]


Dealing with Low-Frequency Words: An Example

Profits soared at Boeing Co., easily topping ...
CEO Alan Mulally.
Word class Example Intuition
twoDigitNum 90 Two digit year
fourDigitNum 1990 Four digit year
containsDigitAndAlpha A8956-67 Product code
containsDigitAndDash 09-96 Date
containsDigitAndSlash 11/9/89 Date
containsDigitAndComma 23,000.00 Monetary amount
containsDigitAndPeriod 1.00 Monetary, financial
othernum 456789 Other
allCaps BBN Organization
capsPeriod M. Initial
firstWord first no useful capitalisation infomation
initCap Sally Capitalized word
lowercase can Uncapitalized word
other , Punctuation, other words

Return to an old example. Before transformation:

Profits/NA soared/NA at/NA Boeing/SC Co./CC easily/NA
topping/NA forecasts/NA on/NA Wall/SL Street/CL ,/NA their/NA
CEO/NA Alan/SP Mulally/CP announced/NA first/NA quarter/NA
results/NA ./NA

After transformation:

firstword/NA soared/NA at/NA initCap/SC Co./CC ,/NA easily/NA
lowercase/NA forecasts/NA on/NA initCap/SL Street/CL ,/NA as/NA
their/NA CEO/NA Alan/SP initCap/CP announced/NA first/NA
quarter/NA results/NA ./NA

The Viterbi Algorithm for HMMs

Problem

\[\textrm{arg}\underset{y_1 \dots y_{n+1}}{\textrm{max}} p(x_1 \ldots x_n, y_1 \ldots y_{n+1})\]

\[p(x_1 \ldots x_n, y_1 \ldots y_{n+1}) = \prod_{i=1}^{n+1} q(y_i | y_{i-2}, y_{i-1}) \prod_{i=1}^{n} e(x_i | y_i)\]

\[\textrm{arg}\underset{y_1 \dots y_{n+1}}{\textrm{max}} p(x_1 \ldots x_n, y_1 \ldots y_{n+1}) = \textrm{arg}\underset{y_1 \dots y_{n+1}}{\textrm{max}} \textrm{log} \left\{ p(x_1 \ldots x_n, y_1 \ldots y_{n+1}) \right\}\]

\[ \begin{align} &\begin{aligned} p(x_1 \ldots x_n, y_1 \ldots y_{n+1}) & = \prod_{i=1}^{n+1} q(y_i | y_{i-2}, y_{i-1}) \prod_{i=1}^{n} e(x_i | y_i) \\ \textrm{log} \left\{ p(x_1 \ldots x_n, y_1 \ldots y_{n+1}) \right\} & = \textrm{log} \left\{ \prod_{i=1}^{n+1} q(y_i | y_{i-2}, y_{i-1}) \prod_{i=1}^{n} e(x_i | y_i) \right\} \\ & = \textrm{log} \left\{ \prod_{i=1}^{n+1} q(y_i | y_{i-2}, y_{i-1}) \right\} + \textrm{log} \left\{ \prod_{i=1}^{n} e(x_i | y_i) \right\} \end{aligned} \end{align} \]

\[ \begin{align} &\begin{aligned} & = \textrm{log} \left\{ q(y_1|y_{-1},y_0) \times q(y_2|y_0,y_1) \times \ldots \times q(y_{n+1}|y_{n-1},y_{n}) \right\} \\ & + \textrm{log} \left\{ e(x_0|y_0) \times e(x_1|y_1) \times \ldots \times e(x_n|y_n) \right\} \\ & = \textrm{log} \left\{ q(y_1|y_{-1},y_0) \right\} + \textrm{log} \left\{ q(y_2|y_0,y_1) \right\} + \ldots + \textrm{log} \left\{ q(y_{n+1}|y_{n-1},y_{n}) \right\} \\ & + \textrm{log} \left\{ e(x_0|y_0) \right\} + \textrm{log} \left\{ e(x_1|y_1) \right\} + \ldots + \textrm{log} \left\{ e(x_n|y_n) \right\} \end{aligned} \end{align} \]

Brute Force Search is Hopelessly Inefficient

The Viterbi Algorithm

\[S_{-1} = S_0 = \{\textrm{*}\}\] \[S_k = S\;\textrm{for}\;k \in \{1, 2, \ldots n\}\]

\[r(y_{-1}, y_0, y_1, \ldots, y_k) = \prod_{i=1}^{k} q(y_i | y_{i-2}, y_{i-1}) \prod_{i=1}^{k} e(x_i|y_i)\]

\[\pi(k,u,v) = \textrm{maximum probability of a tag sequence ending in tags}\;u, v\;\textrm{at position}\;k\]

i.e.

\[\pi(k,u,v) = max_{(y_{-1},y_0,y_{1},\ldots,y_k):y_{k-1}=u,\;y_k=v} r(y_{-1},y_0,y_1,\ldots,y_k)\]

An Example

\[\underset{-1}{\textrm{*}}\;\underset{0}{\textrm{*}}\;\underset{1}{\textrm{The}}\;\underset{2}{\textrm{man}}\;\underset{3}{\textrm{saw}}\;\underset{4}{\textrm{the}}\;\underset{5}{\textrm{dog}}\;\underset{6}{\textrm{with}}\;\underset{7}{\textrm{the}}\;\underset{8}{\textrm{telescope}}\;\]


Quiz: We have a trigram HMM model with the following transition parameters:

and emission parameters:

Say we have the sentence:

the dog barks

What is the value of \(\pi(3, \textrm{N}, \textrm{V})\)?

\[\underset{-1}{\textrm{*}}\;\underset{0}{\textrm{*}}\;\underset{1}{\textrm{the}}\;\underset{2}{\textrm{dog}}\;\underset{3}{\textrm{barks}}\]

\[ \begin{align} &\begin{aligned} r(y_{-1},y_0,y_1,\ldots,y_n) & = \prod_{i=1}^{k} q(y_i|y_{i-2},y_{i-1}) \prod_{i=1}^{k} e(x_i|y_i) \\ r(\textrm{*, *, D, N, V}) & = \left\{ q(\textrm{D | *, *}) \times q(\textrm{N | *, D}) \times q(\textrm{V | D, N}) \right\} \times \\ & \left\{ e(\textrm{the | D}) \times e(\textrm{dog | N}) \times e(\textrm{barks | V}) \right\} \\ & = \left\{ 1 \times 1 \times 1\right\} \times \left\{0.8 \times 0.8 \times 1.0 \right\} \\ & = 0.64 \end{aligned} \end{align} \]


A Recursive Definition

\[\pi(k,u,v) = \underset{w \in S_{k-2}}{\textrm{max}} (\pi(k-1),w,u) \times q(v|w,u) \times e(x_k|v))\]

Justification for the Recursive Definition

(part 2)

\[\underset{-1}{\textrm{*}}\;\underset{0}{\textrm{*}}\;\underset{1}{\textrm{The}}\;\underset{2}{\textrm{man}}\;\underset{3}{\textrm{saw}}\;\underset{4}{\textrm{the}}\;\underset{5}{\textrm{dog}}\;\underset{6}{\textrm{with}}\;\underset{7}{\textrm{the}}\;\underset{8}{\textrm{telescope}}\;\]

What is \(\pi(7, P, D)\)?

\[ \begin{align} &\begin{aligned} \pi(7, \textrm{P}, \textrm{D}) = & \underset{w \in \{\textrm{D,N,V,P}\}}{\textrm{max}} \left\{ \pi(6, w, \textrm{P}) \times q(\textrm{D} | w, \textrm{P}) \times e(\textrm{the} | \textrm{D}) \right\} \end{aligned} \end{align} \]


Quiz: assume \(S = \{\textrm{D, N, V, P}\}\) and a trigram HMM with parameters:

We are also given the sentence:

Ella walks to the red house

Say the dynamic programming table for this sentence has the following entries:

What is the value of \(\pi(\textrm{4, P, D})\)?

\[ \begin{align} &\begin{aligned} \pi(k,u,v) = & \underset{w \in S_{k-2}}{\textrm{max}} (\pi(k-1),w,u) \times q(v|w,u) \times e(x_k|v)) \\ \pi(4, \textrm{P}, \textrm{D}) = & \underset{w \in \{\textrm{D, N, V, P}\}}{\textrm{max}} \left\{ \pi(3, w, \textrm{P}) \times q(\textrm{D} | w, \textrm{P}) \times e(\textrm{the | D}) \right\} \\ = & \textrm{max} \left\{ 0.1 \times 0 \times 0.6, 0.2 \times 0.4 \times 0.6, 0.01 \times 0 \times 0.6, 0.5 \times 0 \times 0.6 \right\} \\ = & 0.048 \end{aligned} \end{align} \]


The Viterbi Algorithm

The Viterbi Algorithm with Backpointers

We want ‘argmax’, not ‘max’, i.e. the actual most-likely tag sequence.

Summary

Week 3 - Parsing, and Context-Free Grammars (CFGs)

 Parsing (Syntatic Structure)

Syntactic Formalisms

The Information Conveyed By Parse Trees

An Example Application: Machine Translation

Context-Free Grammars

Example CFG for English

Left-Most Derivations

 An Example

Derivation Rules Used
S S -> NP VP
NP VP NP -> DT N
DT N VP DT -> the
the N VP N -> dog
the dog VP VP -> VB
the dog VB VB -> laughs
the dog laughs

 Properties of CFGs

The Problem with Parsing: Ambiguity


Quiz: given “Jon saw Bill in Paris in June”, and grammar:

How many parse trees are there for this sentence?

There are three places to attach the two PPs: the verb, the first noun, the second noun. The five valid derivations are (1) verb, verb, (2) first noun, verb, (3) first noun, second noun, (4) first noun, first noun, (5) verb, second noun.

A brief sketch of the syntax of English

 A Fragment of a Noun Phrase Grammar


An example

and

and



Quiz: given “the fast car mechanic”, there are three parse trees.

  1. [ NP [ D the ] [ \(\bar{N}\) [ JJ fast ] [ \(\bar{N}\) [ NN car ] [ \(\bar{N}\) [ NN mechanic ] ] ] ] ]
  2. [ NP [ D the ] [ \(\bar{N}\) [ \(\bar{N}\) [ JJ fast ] [ \(\bar{N}\) [ NN car ] ] ] [ \(\bar{N}\) [ NN mechanic ] ] ] ]
  3. [ NP [ D the ] [ \(\bar{N}\) [ JJ fast ] [ \(\bar{N}\) [ \(\bar{N}\) [ NN car ] ] [ \(\bar{N}\) [ NN mechanic ] ] ] ] ]

(draw them out to see).

Prepositions and Prepositional Phrases

Verbs, Verb Phrases, and Sentences

PPs Modifying Verb Phrases

Complementizers, and SBARs

More Verbs

Coordination

 We’ve only scratched the surface…

Sources of Ambiguity




Week 3 - Probabilistic Context-Free Grammars (PCFGs)

A Probabilistic Context-Free Grammar (PCFG)

\[\alpha_1 \rightarrow \beta_1, \alpha_2 \rightarrow \beta_2, \ldots, \alpha_n \rightarrow \beta_n\]


Derivation Rules used Probability
S S \(\rightarrow\) NP VP 1.0
NP VP NP \(\rightarrow\) DT NN 0.3
DT NN VP DT \(\rightarrow\) the 1.0
the NN VP NN \(\rightarrow\) dog 0.1
the dog VP VP \(\rightarrow\) Vi 0.4
the dog Vi Vi \(\rightarrow\) laughs 0.5
the dog laughs

Quiz: consider the following PCFG:

And parse tree:

(S, (NP, (D, the), (N, cat)), (VP, (V, sings)))

Probability:

Derivation Rules used Probability
S S -> NP VP 0.9
NP VP NP -> D N 1
D N VP D -> the 0.8
the N VP N -> cat 0.5
the cat VP VP -> V 1
the cat V V -> sings 1
the cat sings

Total probability = \(0.9 \times 1 \times 0.8 \times 0.5 \times 0.5 \times 1 \times 1 = 0.36\)


Properties of CFGs


Quiz: consider the following PCFG:

Given sentence “Ted saw Jill in Town”, what is highest probability for any parse tree under this PCFG?

Right now there’s no clever way, just brute force. Also note that just because a PCFG parse tree has a non-zero probability doesn’t mean it’s a valid left-most derivation.

I could only find two valid left-most derivations; intuitively they are:

  1. Ted “saw” “Jill in town”, i.e. Jill was in town when Ted saw her.
  2. Ted “saw Jill” “in town”, i.e. Ted was in town and saw Jill.

Strictly speaking the parse trees and corresponding probabilities are:

(S, (NP, N, Ted), (VP, (V, saw), (NP, (NP, N, Jill), (PP, (P, in), (NP, N, town))))), = 0.00015

(S, (NP, N, Ted), (VP, (VP, (V, saw), (NP, N, Jill)), (PP, (P, in), (NP, in town)))) = 0.00027

Note that these probabilities do not add up to 1; not all possible PCFG parse-trees are valid left-most derivations.

Data for Parsing Experiments: Treebanks

Deriving a PCFG from a Treebank

\[q_{ML}(\alpha \rightarrow \beta) = \frac{\textrm{Count}(\alpha \rightarrow \beta)}{\textrm{Count}(\alpha)}\]

\[q_{ML}(\textrm{VP} \rightarrow \textrm{Vt NP}) = \frac{\textrm{Count}(\textrm{VP} \rightarrow \textrm{Vt NP})}{\textrm{Count}(\textrm{VP})}\]


Quiz: given these three parse trees, what’s \(q_{ML}(\textrm{NP} \rightarrow \textrm{NP PP})\)? It’s 2/7.


PCFGs

  1. The rule probabilities define conditional distributions over the different ways of rewriting each non-terminal.
  2. A technical condition on the rule probabilities ensuring that the probability of the derivation terminating in a finite number of steps is 1. (This condition is not really a practical concern).

Parsing with a PCFG

\[\textrm{arg} \underset{t \in T(s)}{\textrm{max}} p(t)\]

Chomsky Normal Form


A Dynamic Programming Algorithm

\[\underset{t \in T(s)}{\textrm{max}} p(t)\]



 An Example

\[\underset{1}{\textrm{the}} \enspace \underset{2}{\textrm{dog}} \enspace \underset{3}{\textrm{saw}} \enspace \underset{4}{\textrm{the}} \enspace \underset{5}{\textrm{man}} \enspace \underset{6}{\textrm{with}} \enspace \underset{7}{\textrm{the}} \enspace \underset{8}{\textrm{telescope}}\]

\(\pi(3, 8, \textrm{VP})\) is all VPs that span words 3 to 8 inclusive.


 A Dynamic Programming Algorithm

\[\pi[i,i,X] = q(X \rightarrow w_i)\]

\[\pi(i,j,X) = \underset{\underset{s \in i \ldots (j-1)}{X \rightarrow Y \enspace Z \in R}}{max} (q(X \rightarrow Y \enspace Z) \times \pi(i,s,Y) \times \pi(s+1,j,Z))\]

An Example

\[\underset{1}{\textrm{the}} \enspace \underset{2}{\textrm{dog}} \enspace \underset{3}{\textrm{saw}} \enspace \underset{4}{\textrm{the}} \enspace \underset{5}{\textrm{man}} \enspace \underset{6}{\textrm{with}} \enspace \underset{7}{\textrm{the}} \enspace \underset{8}{\textrm{telescope}}\]

\(\pi(3,8,\textrm{VP})\)

Suppose we have:

We’re searching over two possible things:

\[ \begin{align} &\begin{aligned} & q(\textrm{VP} \rightarrow \textrm{VT NP}) \times \pi(3,3,\textrm{Vt}) \times \pi(4,8,NP) \\ & q(\textrm{VP} \rightarrow \textrm{VT NP}) \times \pi(3,4,\textrm{Vt}) \times \pi(5,8,NP) \\ & \vdots \\ & q(\textrm{VP} \rightarrow \textrm{VT NP}) \times \pi(3,7,\textrm{Vt}) \times \pi(8,8,NP) \\ & q(\textrm{VP} \rightarrow \textrm{VT PP}) \times \pi(3,3,\textrm{Vt}) \times \pi(4,8,PP) \\ & q(\textrm{VP} \rightarrow \textrm{VT PP}) \times \pi(3,4,\textrm{Vt}) \times \pi(5,8,PP) \\ & \vdots \\ & q(\textrm{VP} \rightarrow \textrm{VT PP}) \times \pi(3,7,\textrm{Vt}) \times \pi(8,8,PP) \end{aligned} \end{align} \]

Justification

 The Full Dynamic Programming Algorithm

Input: a sentence \(s = x_1 \ldots x_n\), a PCFG in Chomsky Normal Form (CNF) \(G = (N, \Sigma, S, R, q)\).

Initialization:

\(\forall \enspace i \in \{1, \ldots, n\}, \forall \enspace X \in N,\)

\[ \begin{equation} \pi(i,i,X) = \begin{cases} q(X \rightarrow x_i), & \textrm{if} \enspace X \rightarrow x_i \in R \\ 0, & \textrm{Otherwise} \end{cases} \end{equation} \]

Algorithm



 Summary

 Week 4 - Weaknesses of PCFGs

Lack of sensitivity to lexical information

Lack of sensitivity to structural frequencies


Week 4 - Lexicalized PCFGs

Heads in Context-Free Rules



More about Heads

Rules which Recover Heads: An Example for NPs


e.g.

NP -> DT NNP NN NP -> DT NN NNP* NP -> NP PP NP -> DT JJ NP -> DT

Rules which Recover Heads: An Example for VPs


e.g.

VP -> Vt NP VP -> VP PP

Adding Headwords to Trees

Adding Headwords to Trees (Continued)

Chomsky Normal Form

 Lexicalized Context-Free Grammars in Chomsky Normal Form


 An Example

S(saw) \(\rightarrow_2\) NP(man) VP(saw)
NP(saw) \(\rightarrow_1\) Vt(saw) NP(dog)
NP(man) \(\rightarrow_2\) DT(the) NN(man)
NP(dog) \(\rightarrow_2\) DT(the) NN(dog)
Vt(saw) \(\rightarrow\) saw
DT(the) \(\rightarrow\) the
NN(man) \(\rightarrow\) man
NN(dog) \(\rightarrow\) dog

Parameters in a Lexicalized PCFG

\[q(\textrm{S} \rightarrow \textrm{NP VP})\]

\[q(\textrm{S(saw)} \rightarrow_2 \textrm{NP(man) VP(saw)})\]

Parsing with Lexicalized CFGs

A Model from Charniak (1997)

\[q(\textrm{S(saw)} \rightarrow_2 \textrm{NP(man) VP(saw)})\]

\[ \begin{align} &\begin{aligned} & q(\textrm{S(saw)} \rightarrow_2 \textrm{NP(man) VP(saw)}) \\ = & q(\textrm{S} \rightarrow_2 \textrm{NP VP|S, saw}) \times q(\textrm{man|S} \rightarrow_2 \textrm{NP VP, saw}) \end{aligned} \end{align} \]

\[ \begin{align} &\begin{aligned} & q(\textrm{S} \rightarrow_2 \textrm{NP VP | S, saw}) \\ = & \lambda_1 \times q_{ML}(\textrm{S} \rightarrow_2 \textrm{NP VP | S, saw}) + \lambda_2 \times q_{ML} (S \rightarrow_2 \textrm{NP VP | S}) \\ \\ & q(\textrm{man | S} \rightarrow_2 \textrm{NP VP, saw}) \\ = & \lambda_3 \times q_{ML} (\textrm{man | S} \rightarrow_2 \textrm{NP VP, saw}) + \lambda_4 \times q_{ML}(\textrm{man | S} \rightarrow_2 \textrm{NP VP}) + \lambda_5 \times q_{ML} (\textrm{man | NP}) \end{aligned} \end{align} \]

Other Important Details

Evaluation: Representing Trees as Constituents

Label Start Point End Point
NP 1 2
NP 4 5
VP 3 5
S 1 5

 Precision and Recall

Results

Evaluation: Dependencies

 Strengths and Weaknesses of Modern Parsers

(Numbers taken from Collins (2003))

Summary

Dependency Accuracies

Readings

Speech and Language Processing, Chapter 3 (Words and Transducers)

3.9: Word and Sentence Tokenization

Speech and Language Processing, Chapter 4 (n-gram models)


p111: Good-Turing Discounting

\[N_c = \sum_{x\;:\;\textrm{Count(x)} = c} 1\]

\[c^* = (c+1)\frac{N_{c+1}}{N_c}\]

\[P_{GT}^{*}\;\textrm{(things with frequency zero in training)} = \frac{N_1}{N}\]



\[p_1 \times p_2 \times p_3 \times p_4 = exp(log p_1 + log p_2 + log p_3 + log p_4)\]


ARPA language model (LM) file format

Example:

\data\
ngram 1=19979
ngram 2=4987955
ngram 3=6136155

\1-grams:
-1.6682  A      -2.2371
-5.5975  A'S    -0.2818
-2.8755  A.     -1.1409
-4.3297  A.'S   -0.5886
-5.1432  A.S    -0.4862
...

\2-grams:
-3.4627  A  BABY    -0.2884
-4.8091  A  BABY'S  -0.1659
-5.4763  A  BACH    -0.4722
-3.6622  A  BACK    -0.8814
...

\3-grams:
-4.3813  !SENT_START    A       CAMBRIDGE
-4.4782  !SENT_START    A       CAMEL
-4.0196  !SENT_START    A       CAMERA
-4.9004  !SENT_START    A       CAMP
-3.4319  !SENT_START    A       CAMPAIGN
...
\end\