Michael Collins' NLP Class Note 1
This article is a collection of notes for Michael Collins’ NLP Class on Coursera.
Week1
SECTION1 Intro
1.1 The Definition of Natural Language Processing(NLP)
Computers using natural language as input and/or output.
- Natural Language Understanding(NLU)
- Natural Language Generating(NLG)
1.2 Applications of Natural Language Processing
- machine translation
- information extraction
- text summarization
- dialogue systems
1.3 Major Problems of NLP
- Tagging problem(part-of-speech tagging, named entity recognition)
- Parsing problem
1.4 Sources of Ambiguity, where the difficulties lie
- Acoustic
- Syntactic
- Semantic
- Discourse
SECTION2 Language Model
2.1 Formal Definition of Vocabulary and Sentence
- Vocabulary set $\mathcal{V}$: all words in the language, finite.
- Sentence: a sequence of words $\mathcal{x}_1\mathcal{x}_2…\mathcal{x}_n$ where $n$ is integer and $n>1$, $x_i\in\mathcal{V}$ for $i \in {1,…,n-1}$, and $x_n$ is a special symbol, STOP.
- Sentence set $\mathcal{V}^\dagger$: all sentences with the vocabulary $\mathcal{V}$, infinite.
(KEY) 2.2 Defining Language Model
- Two major parts: a finite set $\mathcal{V}$ and a function $p(x_1,x_2,…,x_n)$
- $\forall\langle x_1,…,x_n\rangle\in\mathcal{V}^\dagger$, $p(x_1,x_2,…,x_n)\ge0$, and
Personally, it is trying to grasp an overall understanding of how words and sentences are distributed in a training set.
- It’s not hard to get this distribution via a maximum likelihood method.
2.2 Markov Process
- Chain Rule in Joint Probability
- First-Order Hence,
- Second-Order Hence,
- General Form where $x_0=x_{-1}=*$, which is a special “start” symbol.
- Push it into a more varying form: the length $n$ becomes a random variable.
(KEY) 2.3 Trigram Language Model
- Estimating parameter: $q(w\mid u,v)$
- The probability of sentences $p$:
- Key Problem: to estimate the parameters
2.4 Using Maximum Likelihood Method to Estimate
-
Estimating Formula:
-
Problems:
- large sum of bigrams
- numerator being zero
- denominator being zero
2.5 Evaluation Metric: Perplexity
- Applying Held-out data to test the model’s ability to deal with unseen data
- What is perplexity: to estimate the model’s efficient vocabulary size. The smaller it is, the better the model performs.
For example, if there is only one sentence $x^{(1)}$ in the test corpus, and it’s probability $p(x^{(1)})=0.1$, then it means approximately in ten words the sentences will appear, which means the model can detect the sentences in ten words. So the effective vocabulary is ten.
- Definition of Perplexity: Where
SECTION3 Smoothed Estimation of Trigram Models
3.1 Linear Interpolation
- Core idea: using a linear combination of probabilities to estimate trigram probability. where
- Define the loss function:
- To choose the parameter $\lambda_1,\lambda_2,\lambda_3$:
- A practical way of choosing these $\lambda$s (bucketing):
3.2 Discounting Methods
- Discounted counts:
- Missing Probability:
- Two sets:
- Bigram:
Week2
Applying Generative Models to Tagging Problems
op1=>operation: Probabilities
op2=>subroutine: Trigram HMMs
op1(right)->op2
op3=>operation: Estimation
op4=>subroutine: Maximum likelihood
op3(right)->op4
op5=>operation: Maximization
op6=>subroutine: The Viterbi Algorithm
op5(right)->op6
SECTION4 Elaborating Tagging Problems
4.1 Definition of Tagging Problems
From a set of training examples, $(x^{(i)},y^{(i)})\ \forall i = 1,…,m$, where each $x^{(i)}$ is a sentence $x^{(i)}1…x^{(i)}{n_i}$, and each $y^{(i)}$ is a tag sequence $y^{(i)}1…y^{(i)}{n_i}$, with the assumption that the length of sentences are varying in different sentences, the task is to learn a function(or more formally, a mapping) that maps sentences to tag sequences from these training examples.
4.2 Two major classes of Tagging Problems
- Part of Speech Recognition
- Named-Entity Recognition
4.3 Two Sources of Constraints
-
Local the probability of a meaning of a word appears is much higher -
Contextual the probability of a meaning of a word appears another word is much higher - Sometimes there are conflicts
SECTION5 Generative Models
5.1 Introducing the Supervised Learning
(omitted)
5.2 Discriminative Models
-
Estimating $p(y \mid x)$ directly
-
Learning a conditional model:
-
Using the following function $f(x)$ to predict:
5.3 Generative Models (Noisy Channel Model)
-
Estimating $p(x,y)$ instead
-
Learning a joint distribution model:
-
Bayes Equation:
-
Using the following function $f(x)$ to predict:
(KEY) 5.4 Generative Tagging Models
- Learning Process: learn a probability distribution $p$:
- $\forall \ s=\langle x_1…x_n,y_1…y_n \rangle\in\mathcal{S},p(s)\geq0$
- $\sum_{s\in\mathcal{S}}p(s)=1$
-
Decision process: using the following function $f$ to make a decision:
- Notations
5.5 Major Problems with Generative Models
- How to elaborate the probability $p(x_1…x_n,y_1…y_n)$?
- How to estimate the parameters of the model from training examples?
- How to efficiently find the maximum in the decision process?
SECTION6 Solving the Problems in Generative Models
6.1 Trigram Hidden Markov Models -> Probability
-
Contains a finite set $\mathcal{V}$ of all possible words and a finite set $\mathcal{K}$ of all possible tages. Define $\mathcal{S}$ to be the set of all tag/sequence pairs $s=\langle x_1…x_n,y_1…y_{n+1} \rangle$ and note that $y_{n+1}=\mbox{STOP}$.
-
Parameter1: for any trigram $(u,v,s)$ such that $s\in\mathcal{K}\cup{\mbox{STOP}}$ and $u,v\in\mathcal{K}\cup{\mbox{*}}$.
-
Parameter2: for any $x\in\mathcal{v}, s\in\mathcal{K}$.
-
Formula: for any $s=\langle x_1…x_n,y_1…y_{n+1} \rangle\in\mathcal{S}$, , where $y_0=y_{-1}=*$
-
Derivation of the formula is generally same as the noisy channel model shown in section 5.3 except the independent assumption below(equation2.7 in the note): It ignores the dependencies both for the precedent words and for other irrelevant tags.
6.2 Maximum Likelihood Method -> Estimating Parameters
Just count the appearance number:
Advanced estimation tricks are shown above in Section3.
6.3 The Viterbi Algorithm -> Finding the Maximum
LAST UPDATED ON Feb 17, 2017