Michael Collins' NLP Class Note 1

January 27, 2017 - 5 minute read - Category: Ml

This article is a collection of notes for Michael Collins’ NLP Class on Coursera.

Week1

SECTION1 Intro

1.1 The Definition of Natural Language Processing(NLP)

Computers using natural language as input and/or output.

Natural Language Understanding(NLU)
Natural Language Generating(NLG)

1.2 Applications of Natural Language Processing

machine translation
information extraction
text summarization
dialogue systems

1.3 Major Problems of NLP

Tagging problem(part-of-speech tagging, named entity recognition)
Parsing problem

1.4 Sources of Ambiguity, where the difficulties lie

Acoustic
Syntactic
Semantic
Discourse

SECTION2 Language Model

2.1 Formal Definition of Vocabulary and Sentence

Vocabulary set $\mathcal{V}$: all words in the language, finite.
Sentence: a sequence of words $\mathcal{x}_1\mathcal{x}_2…\mathcal{x}_n$ where $n$ is integer and $n>1$, $x_i\in\mathcal{V}$ for $i \in {1,…,n-1}$, and $x_n$ is a special symbol, STOP.
Sentence set $\mathcal{V}^\dagger$: all sentences with the vocabulary $\mathcal{V}$, infinite.

(KEY) 2.2 Defining Language Model

Two major parts: a finite set $\mathcal{V}$ and a function $p(x_1,x_2,…,x_n)$
$\forall\langle x_1,…,x_n\rangle\in\mathcal{V}^\dagger$, $p(x_1,x_2,…,x_n)\ge0$, and $\sum_{\langle x_1,…,x_n\rangle\in\mathcal{V}^\dagger} p(x_1,x_2,...,x_n)=1,\ \forall x \in \mathcal{V}^\dagger$

Personally, it is trying to grasp an overall understanding of how words and sentences are distributed in a training set.

It’s not hard to get this distribution via a maximum likelihood method.

2.2 Markov Process

Chain Rule in Joint Probability $P(A,B) = P(A)\times P(B\mid A)$
First-Order $P(X_n=x_n \mid X_{n-1}=x_{n-1},...,X_1=x_1) = P(X_n=x_n \mid X_{n-1}=x_{n-1})$ Hence, $P(X_1=x_1,X_{2}=x_2,...,X_n=x_n) = P(X_1=x_1)\prod_{i=2}^n P(X_i=x_i \mid X_{i-1}=x_{i-1})$
Second-Order $\begin{align} &P(X_n=x_n, X_{n-1}=x_{n-1},...,X_1=x_1) \\ =\ & P(X_n=x_n \mid X_{n-1}=x_{n-1},X_{n-2}=x_{n-2}) \end{align}$ Hence, $\begin{align} & P(X_1=x_1,X_{2}=x_2,...,X_n=x_n) \\ =\ &P(X_1=x_1)P(X_2=x_2\mid X_1=x_1)\prod_{i=3}^n P(X_i=x_i \mid X_{i-1}=x_{i-1},X_{i-2}=x_{i-2}) \end{align}$
General Form $P(X_1=x_1,X_{2}=x_2,...,X_n=x_n) = \prod_{i=1}^n P(X_i=x_i \mid X_{i-1}=x_{i-1},X_{i-2}=x_{i-2})$ where $x_0=x_{-1}=*$, which is a special “start” symbol.
Push it into a more varying form: the length $n$ becomes a random variable.

(KEY) 2.3 Trigram Language Model

Estimating parameter: $q(w\mid u,v)$
The probability of sentences $p$: $p(x_1...x_n)=\prod_{i=1}^n q(x_1\mid x_{i-1},x_{i-2})$
Key Problem: to estimate the parameters

2.4 Using Maximum Likelihood Method to Estimate

Estimating Formula: $q(w\mid u,v) = \frac{c(u,v,w)}{c(u,v)}$
Problems:
1. large sum of bigrams
2. numerator being zero
3. denominator being zero

2.5 Evaluation Metric: Perplexity

Applying Held-out data to test the model’s ability to deal with unseen data
What is perplexity: to estimate the model’s efficient vocabulary size. The smaller it is, the better the model performs.

For example, if there is only one sentence $x^{(1)}$ in the test corpus, and it’s probability $p(x^{(1)})=0.1$, then it means approximately in ten words the sentences will appear, which means the model can detect the sentences in ten words. So the effective vocabulary is ten.

Definition of Perplexity: $2^{-l}=\frac{1}{t}$ Where $l=\frac{1}{M}\sum_{i=1}^m\log_2p(x^{(i)}), \quad t = \sqrt[M]{\prod_{i=1}^mp(x^{(i)})}$

SECTION3 Smoothed Estimation of Trigram Models

3.1 Linear Interpolation

Core idea: using a linear combination of probabilities to estimate trigram probability. $q(w\mid u,v) = \lambda_1\times q_{ml}(w\mid u,v)+\lambda_2\times q_{ml}(w\mid v)+\lambda_3\times q_{ml}(w)$ where $\lambda_1+\lambda_2+\lambda_3=1$
Define the loss function: $L(\lambda_1,\lambda_2,\lambda_3)=\sum_{u,v,w}c'(u,v,w)\log q(w\mid u,v)$
To choose the parameter $\lambda_1,\lambda_2,\lambda_3$: $\mathop{argmax}_{\lambda_1,\lambda_2,\lambda_3} L(\lambda_1,\lambda_2,\lambda_3)$
A practical way of choosing these $\lambda$s (bucketing): $\begin{cases} \lambda_1 = & \frac{c(u,v)}{c(u,v)+\gamma}\\ \lambda_2 = & (1-\lambda_1)\times\frac{c(v)}{c(v)+\gamma}\\ \lambda_3 = & 1 - \lambda_1 - \lambda_2\\ \end{cases}$

3.2 Discounting Methods

Discounted counts: $c^*(v,w)=c(v,w)-\beta$
Missing Probability: $\alpha(v)=1-\sum_{w,\forall c(v,w)>0}\frac{c^*(v,w)}{c(v)}$
Two sets: $\begin{align} \mathcal{A}(v) = & \{ w\mid c(v,w)>0 \}\\ \mathcal{B}(v) = & \{ w\mid c(v,w)=0 \}\\ \end{align}$
Bigram: $q_D(w\mid v)= \begin{cases} \frac{c^*(v,w)}{c(v)} & w\in\mathcal{A}(v)\\ \alpha\times\frac{q_{ML}(w)}{\sum_{w\in\mathcal{B}(v)}q_{ML}(w)} & w\in\mathcal{B}(v)\\ \end{cases}$

Week2

Applying Generative Models to Tagging Problems

op1=>operation: Probabilities
op2=>subroutine: Trigram HMMs
op1(right)->op2

op3=>operation: Estimation
op4=>subroutine: Maximum likelihood
op3(right)->op4

op5=>operation: Maximization
op6=>subroutine: The Viterbi Algorithm
op5(right)->op6

SECTION4 Elaborating Tagging Problems

4.1 Definition of Tagging Problems

From a set of training examples, $(x^{(i)},y^{(i)})\ \forall i = 1,…,m$, where each $x^{(i)}$ is a sentence $x^{(i)}1…x^{(i)}{n_i}$, and each $y^{(i)}$ is a tag sequence $y^{(i)}1…y^{(i)}{n_i}$, with the assumption that the length of sentences are varying in different sentences, the task is to learn a function(or more formally, a mapping) that maps sentences to tag sequences from these training examples.

4.2 Two major classes of Tagging Problems

Part of Speech Recognition
Named-Entity Recognition

4.3 Two Sources of Constraints

Local the probability of a meaning of a word appears is much higher
Contextual the probability of a meaning of a word appears another word is much higher
Sometimes there are conflicts

SECTION5 Generative Models

5.1 Introducing the Supervised Learning

(omitted)

5.2 Discriminative Models

Estimating $p(y \mid x)$ directly
Learning a conditional model: $p(y \mid x)$
Using the following function $f(x)$ to predict: $f(x)=\mathop{argmax}_{y\in\mathcal{Y}} \ p(y\mid x)$

5.3 Generative Models (Noisy Channel Model)

Estimating $p(x,y)$ instead
Learning a joint distribution model: $p(x,y)=p(y)p(x\mid y)$
Bayes Equation: $p(y\mid x)=\frac{p(y)p(x\mid y)}{p(x)}$
Using the following function $f(x)$ to predict: $f(x)=\mathop{argmax}_{y\in\mathcal{Y}} \ p(y)p(x\mid y)$

(KEY) 5.4 Generative Tagging Models

Learning Process: learn a probability distribution $p$:
1. $\forall \ s=\langle x_1…x_n,y_1…y_n \rangle\in\mathcal{S},p(s)\geq0$
2. $\sum_{s\in\mathcal{S}}p(s)=1$
Decision process: using the following function $f$ to make a decision: $f(x_1...x_n)=\mathop{argmax}_{y_1...y_n}\ p(x_1...x_n,y_1...y_n)$
Notations $\begin{array}{c|l} \mbox{Notation} & \mbox{Meaning} \\ \hline \mathcal{V} & \text{a finite set of words,} \ \{x_1,...,x_n\} \\ \mathcal{K} & \text{a finite set of tags,} \ \{y_1,...,y_n\} \\ \mathcal{S} & \text{all sequence/tag-sequences pairs} \ \{\langle x_1...x_n,y_1...y_n\rangle \mid x_i\in\mathcal{V},y_i\in\mathcal{S}\} \\ \end{array}$

5.5 Major Problems with Generative Models

How to elaborate the probability $p(x_1…x_n,y_1…y_n)$?
How to estimate the parameters of the model from training examples?
How to efficiently find the maximum in the decision process?

SECTION6 Solving the Problems in Generative Models

6.1 Trigram Hidden Markov Models -> Probability

Contains a finite set $\mathcal{V}$ of all possible words and a finite set $\mathcal{K}$ of all possible tages. Define $\mathcal{S}$ to be the set of all tag/sequence pairs $s=\langle x_1…x_n,y_1…y_{n+1} \rangle$ and note that $y_{n+1}=\mbox{STOP}$.
Parameter1: $q(s\mid u,v)$ for any trigram $(u,v,s)$ such that $s\in\mathcal{K}\cup{\mbox{STOP}}$ and $u,v\in\mathcal{K}\cup{\mbox{*}}$.
Parameter2: $e(x\mid s)$ for any $x\in\mathcal{v}, s\in\mathcal{K}$.
Formula: for any $s=\langle x_1…x_n,y_1…y_{n+1} \rangle\in\mathcal{S}$, $p(s)=\prod_{i=1}^{n+1}{q(y_i\mid y_{i-2},y_{i-1})}\prod_{i=1}^{n}{e(x_i\mid y_i)}$ , where $y_0=y_{-1}=*$
Derivation of the formula is generally same as the noisy channel model shown in section 5.3 except the independent assumption below(equation2.7 in the note): $\begin{align} &P(X_1=x_1...X_n=x_n\mid Y_1=y_1...Y_{n+1}=y_{n+1}) \\ = &\prod_{i=1}^nP(X_i=x_i\mid X_1=x_1...X_{i-1}=x_{i-1}, Y_1=y_1...Y_{n+1}=y_{n+1}) \\ = &\prod_{i=1}^nP(X_i=x_i\mid Y_i=y_i) \end{align}$ It ignores the dependencies both for the precedent words and for other irrelevant tags.

6.2 Maximum Likelihood Method -> Estimating Parameters

Just count the appearance number:

$\begin{align} q(s\mid u,v) &= \frac{c(u,v,s)}{c(u,v)} \\ e(x\mid s) &= \frac{c(s\leadsto x)}{c(s)} \end{align}$

Advanced estimation tricks are shown above in Section3.

6.3 The Viterbi Algorithm -> Finding the Maximum

LAST UPDATED ON Feb 17, 2017