Posted on 2020-08-02 ML NLP

Natural language processing is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data. This article collects a bunch of topics in this area and gives a brief explanation for each of them. The original layout is provided by Prof. Daniel Bauer in his NLP class. I editied it and tried to enrich the content for further study use. Click to read in a clear HTML version

General Linguistics Concepts
Text Processing
Text Classification
Probability Background
Language Models
Sequence Labeling (POS tagging)
Parsing with Context Free Grammars
Dependency parsing
Machine Learning
Two Approaches to Language Meaning
Formal Lexical Semantics
Distributional (Vector-based) Lexical Semantics
Recurrent Neural Nets
Transformer Based Models
Semantic Role Labeling
Statistical MT

General Linguistics Concepts

(Parts of J&M 2nd ed. Ch 1 but mostly split over different chapters in J&M. Ch.1 not yet available in 3rd ed.)

Levels of linguistic representation:
- phonetics/phonology: 语音学 sounds and sound patterns of language
- morphology: 词法学 structure and formation of words
- syntax: 句法 word order
- semantics: 语义学（词/句的本身意思） word and sentence meaning
- pragmatics: 语用学（用词/句的目的） influence of context and situation
Ambiguity, know some examples in syntax and semantics, including PP attachment, noun-noun compounds.
- semantics: Stolen Painting Found by Tree
- syntax/Prepositional Phrase (PP) attachment: Enraged cow injures farmer with axe
- noun-noun compounds: country song platinum album
Garden-path sentences
- The horse raced past the barn fell
- raced can be a past tense verb or a past participle (indicating passive voice).
- The verb interpretation is more likely before fell is read.
- Once fell is read, the verb interpretation is impossible.
Type/Token distinction
- Type: number of distinct words in a corpus (vocabulary size). doesn’t count duplicates
- Token: Total number of word occurrences. count duplicates
Know the following terms: word form, stem, lemma, lexeme
- Word Form: 本来的变形 the inflected form as it actually appears in the corpus. “produced”
- Word Stem: The part of the word that never changes between morphological variations. “produc”
- Lemma: 类似动词原形 an abstract base form, shared by word forms, having the same stem, part of speech, and word sense – stands for the class of words with stem. “produce”
- Lexeme: a pairing of a word form with it’s sense (a synset).
Parts of speech:
- know the 9 traditional POS and some of the Penn Treebank tags (but no need to remember the precise tagset).
- noun, pronoun, determiner, adjective, verb, adverb, preposition介词, conjuction连词, interjection感叹词
Types of Linguistic Theories
- Prescriptive: “This is how people ought to talk.”
- Descriptive: provide a formal account of how people talk. <- NLP focuses on the descriptive part.
- Explanatory: explain why people talk a certain way (identify underlying cognitive, or neural mechanism)
Syntax: Study of structure of language
- determine how words are arranged in a sentence and the relationship between them.
- judge if a sentence is grammatical or not
- Constituency and Recursion. Constituency tests (*).
  - A constituent is a group of words that behave as a single unit (within a hierarchical structure)
  - Constituency Tests: Move the phrase around to test, but cannot tear it apart
  - Category of Constituent: XP, where X is the part-of-speech of the head, like NP, VP, AdjP, AdvP, DetP
- Dependency
  - obtain a dependency structure by projecting the heads up the tree
  - Percolation table allow to convert constituency structure to dependency structure
- Grammatical Relations
  - edges can be labeled with grammatical relations between words (typed dependencies)
- Subcategorization / Valency (and relationship to semantic roles).
  - Verbs may take a different number of arguments of different syntactic types in different positions (applies to other types of words as well).
- Long-distance dependencies.
- Syntactic heads (connection between dependency and constituency structure).
- Center embeddings.
- Dependency syntax:
  - Head, dependent
    - Head: point from
    - Depend: be pointed to
  - dependency relations and labels
    - subj(likes-02, girl-01)
      - like: head
      - girl: dependent
      - subj: the relation
      - number: position of word in sentence
  - projectivity: If dependency edges cross, the dependency structure is non-projective

Text Processing

(Split over different chapters in J&M. Parts of J&M 3rd ed. Ch. 6)

Text normalization: Every NLP task needs to do some text normalization.
- Segmenting / tokenizing words in running text
- Normalizing word forms (lemmatization or stemming, possibly replacing named-entities: people name -> PER).
- Sentence splitting
Linguistic Terminology
- Sentence: Unit of written language.
- Utterance: 话语 Unit of spoken language.
- Word Form: 本来的变形 the inflected form as it actually appears in the corpus. “produced”
- Word Stem: The part of the word that never changes between morphological variations. “produc”
- Lemma: 类似动词原形 an abstract base form, shared by word forms, having the same stem, part of speech, and word sense – stands for the class of words with stem. “produce”
- Type: number of distinct words in a corpus (vocabulary size). doesn’t count duplicates
- Token: Total number of word occurrences. count duplicates
Tokenization (word segmentation)
- The process of segmenting text (a sequence of characters) into a sequence of tokens (words).
- Weak: Separate off punctuation. Then split on whitespaces.
Sentence splitting.
Lemmatization
- Converting Lemmas into their base form.
Know why these are useful and challenging

Text Classification

(J&M 3rd ed. Ch. 6)

Task definition and applications:
- Definition: Given a representation of some document $d$, identify which class $c \in C$ the document belongs to.
- Applications:
  - Spam detection
  - Mood / Sentiment detection
  - Author identification
  - Identifying political affiliation
  - Word Sense Disambiguation
- Machine learning problem: Supervised ML: Fixed set of classes $C$. Train a classifier from a set of labeled <document,class> pairs
  - Discriminative vs. Generative models.
Document representation:
- Set-of-words
- Bag-of-words: multi-set
- Vector space model: Each word corresponds to one dimension in vector space, with entries:
  - Binary (Word appears / does not appear)
  - Raw or normalized frequency counts
  - Weighted frequency counts
  - Probabilities
Naive Bayes’ and independence assumptions in text classification.
- Naive assumptions: all the attributes are independent of each other, given the label
  - Without the independence assumption we would have to estimate $P(X_1,…,X_d|Label)$
  - There would be many combinations of $X_1,…,X_d$ that are never seen (sparse data).
  - The independence assumption allows us to estimate each $P(X_i|Label)$ independently.
  - Reality: the words do not really independent of each other
- $P(Label|X_1,…,X_d) = \alpha [P(Label) \prod_i P(X_i|Label) ]$
- Goal: Use the training data to estimate $P(Label)$ and $P(X_i|Label)$ from training data.
- Estimate the prior and posterior probabilities using Maximum Likelihood Estimates (MLE):
  - Prior: $P(y)=\frac{Count(y)}{\sum_{y’ \in Y}Count(y’)}$
  - Posterior: $P(x_i|y)=\frac{Coount(x_i,y)}{Count(y)}$

Probability Background

Prior vs. conditional probability
- Bayesian Statistics: Probabilities express the degree of belief we have in the possible interpretations.
  - Prior probabilities: Probability of an interpretation prior to seeing any evidence.
  - Conditional (Posterior) probability: Probability of an interpretation after taking evidence into account.
Sample space, basic outcomes
- Each $\omega \in \Omega$ is a possible basic outcome / “possible world” (e.g. the 6 possible rolls of a die).
Probability distribution
- A probability distribution assigns a probability to each basic outcome. (Sum of all outcomes equal to 1)
Events
- An event $A$ is any subset of $\Omega$
Random variables
- A random variable is a function from basic outcomes to some range, e.g. real numbers or booleans.
- A distribution $P$ induces a probability distribution for any random variable.
Product rule
- $P(A,B)=P(B) \times P(A|B) = P(A) \times P(B|A)$
Chain rule
- $P(A_n,…,A_1)=P(A_n|A_{n-1},…,A_1) \times P(A_{n-1},…,A_1)$
Bayes’ rule
- $P(A|B)=\frac{P(B|A) \times P(A)}{P(B)}$
Conditional independence
- $P(B,C|A) = P(B|A) \times P(C|A)$
- $P(B|A,C) = P(B|A)$ and $P(C|B,A)=P(C|A)$
Discriminative vs. generative models
- Discriminative algorithms: learn $P(y|x)$ directly
  - Model conditional distribution of the label given the data $P(y|x)$
  - Learns decision boundaries that separate instances of the different classes
  - To predict a new example, check on which side of the decision boundary it falls.
- Generative algorithms: use Bayes rule $P(y|x)=\frac{P(x|y) \times P(y)}{P(x)}$ (estimate $P(x|y)$ and $P(y)$)
  - Assume the observed data is being “generated” by a “hidden” class label.
  - Build a different model for each class (Different conditional distribution for each class)
  - To predict a new example, check it under each of the models and see which one matches best (With the highest probability)
Noisy channel model
- The noisy channel model assumes that an observation (for example, a sequence of words) has been generated probabilistically by some underlying hidden state sequence (for example a sequence of P.O.S tags).
- Goal: “Decoding” is the task of finding the most likely hidden input sequence given the observed output using Bayesian inference.
Calculating with probabilities in log space
- Probabilities can become very small, often work with log probabilities in practice.

Language Models

(J&M 2nd ed. Ch 4.1-4.8, J&M 3rd ed. Ch 4)

Task definition and applications
- Task definition: predict the next word given the context, and also can be used to describe the probability of an entire sentence, not just the last word.
- Applications: speech recognition, handwritten character recognition, spelling correction, text entry UI, machine translation
Probability of the next word and probability of a sentence
- Idea: We do not need to model domain, syntactic, and lexical knowledge perfectly.
- Instead, we can rely on the notion of probability of a sequence (letters, words…)
  - predict next word based on the given sequence $P(w_n|w_1,w_2,w_3,…,w_{n-1})$
Markov independence assumption
- simple independence assumption (Markov assumption): The probability to see $w_n$ depends only on the previous
  k-1 words (K is actually n in n-gram)
n-gram language models
- Naive Bayes (each word is independent to others)
Role of the START and END markers
- use a special marker STOP/END that indicates the end of a sentence.
- sentence with START and STOP markers to provide the appropriate context. START i want to eat Chinese food END
- bigram: P(i|START)·P(want|i)·P(to|want)·P(eat|to)·P(Chinese|eat)·P(food|Chinese)·P(END|food)
- trigtam: P(i|START,START)·P(want|START,i)·P(to|i,want)·P(eat|want,to)·P(Chinese|to,eat)·P(food|eat,Chinese)·P(END|Chinese,food)
Estimating n-gram probabilities from a corpus:
- Maximum Likelihood Estimates
  - bigram: $P(w|u)=\frac{count(u,w)}{count(u)}$ <- u first, w next in $count(u,w)$, order matters
  - trigtam: $P(w|u,v)=\frac{count(u,v,w)}{count(u,v)}$
- Dealing with unseen Tokens
  - Start with a specific lexicon 词典 of known tokens.
  - Replace all tokens in the training and testing corpus that are not in the lexicon with an UNK token. P(cat, dog) -> P(cat, UNK)
  - Practical approach:
    - Lexicon contains all words that appear more than k times in the training corpus (appear once is too rare, should not be included in lexicon)
    - Replace all other tokens with UNK.
- Dealing with unseen contexts: Smoothing and Back-off
  - Smoothing / Discounting: Move some probability mass from seen trigrams to unseen trigrams.
  - Back-off: Use n-1-…, n-2-… grams to compute n-gram probability.
  - Additive Smoothing:
    - Classic approach: Laplacian, a.k.a. additive smoothing
    - unigram: $P(w_i)=\frac{count(w_i)+1}{N+V}$
    - bigram: $P(w|u)=\frac{count(u,w)+1}{count(u)+V}$
    - $N$ is the number of tokens, $V$ is the number of types (i.e.size of the vocabulary)
  - Linear Interpolation:
    - Use denser distributions of shorter n-grams to “fill in” sparse n-gram distributions
    - $P(w|u,v)= \lambda_1 P_{MLE}(w|u,v) + \lambda_2 P_{MLE}(w|v) + \lambda_3 P_{MLE}(w)$
    - $\lambda_1, \lambda_2, \lambda_3 >0$ and $\lambda_1+\lambda_2+\lambda_3=1$
    - Parameters $\lambda$ can be estimated on development data (for example, using Expectation Maximization).
  - Discounting:
    - Idea: set aside some probability mass, then fill in the missing mass using back-off.
    - $count^*(v,w)=count(v,w)-\beta$ where $0<\beta<1$
    - then for all seen bigrams: $P(w|v)=\frac{count^*(v,w)}{count(v)}$
    - for each context $v$ the missing probability mass is: $\alpha(v) = 1 - \sum_{w:count(v,w)>0} \frac{count^*(v,w)}{count(v)}$
    - then divide this held-out mass between the unseen words (evenly or using back-off)
  - Katz’ Backoff:
    - Divide the held-out probability mass $\alpha(v)$ proportionally to the unigram probability of the unseen words in context $v$
    - Additive Smoothing gives the same probability to unseen words, and Back-Off fix it. Word appears rarely in unigram should also appears rarely in unseen.
    - For trigrams: recursively compute backoff-probability for unseen bigrams. Then distribute the held-out probability mass proportionally to that bigram backoff-probability.
Perplexity (*): Measure of surprises
- Perplexity (per word) measures how well the n-gram model predicts the sample
- Perplexity is defined as $2^{-l}$, where $l=\frac{1}{M}\sum_{i=1}^{m} \log_2 P(s_i)$
- M: the total number of words tokens (include STOP)
- m: the total number of sentences
- Lower perplexity = better model. Intuition:
  - Assume we are predicting one word at a time.
  - With uniform distribution, all successor words are equally likely. Perplexity is equal to vocabulary size. (可以从vocabulary的所有词中等可能选择下一个词)
  - Perplexity can be thought of as “effective vocabulary size” (perplexity越小，下一个词选择的范围越精确，model的准确率越好)

Sequence Labeling (POS tagging)

(J&M 2nd ed Ch 5.1-5.5, J&M 3rd ed. Ch 10.1-10.4)

Linguistic tests for part of speech (*).
- P.O.S. tag-set should contain morphological and maybe syntactic information.
Part-of-Speech Tagging
- Goal: Assign a part-of-speech label to each word in a sentence.
- example off a sequence labeling task (a translation from sequence of words to sequence of tags)
- correct POS depends on context
- Infer the most likely sequence of hidden variables given the sequence of observed words
Noisy Channel Model:
- Goal: figure out what the original input to the the channel was.
- Use Bayes’ rule: $\text{argmax}{tags} P(tags|words)=\text{argmax}{tags}\frac{P(tags) P(word|tags)}{P(words)}$
Hidden Markov Model:
- Generative (Bayesian) probability model
- Observations (sequence of tokens/words)
- Hidden states (sequence of part of speech tags)
- Hidden sequence is generated by an n-gram language model (typically a bigram model) $P(t_1,…,t_n)=\prod_{i=1}^{n}P(t_i|t_{i-1})$
- Transition probabilities, emission probabilities
  - transition probabilities: given by Markov chain
  - emission probabilities: generate the certain word given the underlying tag/state
  - $P(t_1,…,t_n,w_1,…,w_n)=\prod_{i=1}^{n}P(t_i|t_{i-1})P(w_i|t_i)$
- Markov chain
  - A Markov chain is a sequence of random variables $X_1,X_2,…$
  - The domain of these variables is a set of states
  - Markov assumption: Next state depends only on current state.
- Three tasks on HMMS: Decoding, Evaluation, Training
  - Decoding: Find the most likely sequence of tags:
    - Viterbi algorithm (dynamic programming, know the algorithm and data structures involved)
    - data structures: Create a table $\pi$, such that each entry $\pi [k,t]$ contains the score of the highest-probability sequence ending in tag $t$ at time $k$.
    - $\pi[k,t] \leftarrow \max_s \pi[k-1,s] \times P(t|s) \times P(w_k|t)$
    - overall running time: O(n) (for k=1 to n)
  - Evaluation: Given a sequence of words, find the total probability for this word sequence given an HMM.
    - an view the HMM as another type of language model
    - Spurious ambiguity: multiple hidden sequences lead to the same observation.
    - Forward algorithm (difference to Viterbi).
      $$P(w_1,…,w_n)=\sum_{t_1,…,t_n}P(t_1,…,t_n,w_1,…,w_n)=\sum_{t_1,…,t_n}[\prod_{i=1}^{n}P(t_i|t_{i-1})P(w_i|t_i)]$$
    - data structures: Create a table $\pi$, such that each entry $\pi [k,t]$ contains the sum of the probabilities of all tag/word sequences ending in tag $t$ at time $k$.
    - $\pi[k,t] \leftarrow \sum_s \pi[k-1,s] \times P(t|s) \times P(w_k|t)$
    - overall running time: O(n)
    - the advantage over a plain word n-gram model: HMM can capture a limited amount of syntactic information
  - Training: Estimate emission and transition probabilities from training data. We only discussed maximum likelihood estimates.
- Extending HMMs to trigrams
  - Instead of using a unigram context $P(t_i|t_{i-1})$, use a bigram context $P(t_i|t_{i-2}t_{i-1})$
  - The HMM probability for a given tag and word sequence is: $\prod_{i=1}^{n}P(t_i|t_{i-2}t_{i-1})P(w_i|t_i)$
  - Need to handle data sparseness when estimating transition probabilities (for example using backoff or linear interpolation)
Applying HMMs to other sequence labeling tasks, for example Named Entity Recognition
- B.I.O. tags for NER.
  - O - outside of named entity
  - I - inside named entity
  - B - first word (beginning) of named entity
    - There could be two name entities appear immediately adjacent, so the B tag allows to identify where each of name entity starts

Parsing with Context Free Grammars

(J&M 2nd ed Ch. 12 and 13.1-13.4 and 14.1-14.4 and Ch. 16, J&M 3rd ed. Ch. 11.1-11.5 and Ch. 12.1-12.2 [Earley not covered in 3rd. ed] and Ch 13.1-13.4, [complexity classes not covered in 3rd ed.] )

Tree representation of constituency structure. Phrase labels.
CFG definition:
- CFGs are context free: applicability of a rule depends only on the nonterminal symbol, not on its context.
- terminals: $\Sigma$
- nonterminals: $N$
- start symbol: $S \in N$
- productions: set $R$ of productions of the form $A \rightarrow \beta$ where $A \in N$ and $\beta \in (\Sigma \cup N)$
derivations and language of a CFG
- The language of a given CFG $G=(N,\Sigma,R,S)$ is defined as the set of all terminal strings that can be derived from the start symbol
derivation trees (vs. derived string).
- CFG is a string rewriting formalism, so the derived objects are strings.
- A derivation is a sequence of rewriting steps
- the order in which multiple non-terminals in a partially derived string are replaced does not matter
- can represent identical derivations in a derivation tree
- The derivation tree implies a parse tree
Regular grammars (and know that these are equivalent to finite state machines - you don’t need to know FSAs in detail for the final (*)
- Regular Grammars and FSAs describe the same set of languages. For any regular grammar, one can construct an FSA recognizing the same language. The automaton can be used directly to recognize sentences, while the grammar needs a separate parsing algorithm. The terminal states are just chosen by the developer (or by the algorithm that creates the FSA from a regular grammar or from a regular expression).
Complexity classes
- Regular languages. (regular grammars, regular expressions, finite state automata)
- Context-free languages. (context-free string grammars, pushdown automata).
- “Mildly” context sensitive languages.
- Context-sensitive languages.
- Recursively enumerable languages (Turing machines).
- “Chomsky hierarchy”: Regular languages $\subset$ Context-free languages $\subset$ “Mildly” context sensitive languages $\subset$ Context-sensitive languages $\subset$ Recursively enumerable languages
- Center embeddings as an example of a non-regular phenomenon.
  - Regular grammars cannot capture long-distance dependencies
  - It’s a phenomenon that exists in natural languages (like English) that cannot be described using regular grammars (for formal reasons), proving that natural languages are not regular
- Cross-serial dependencies as an example of a non-context-free phenomenon.
  - CFGs cannot describe crossing dependencies
Probabilitistic context free grammars (PCFG):
- Probability of a tree vs. probability of a sentence.
- Maximum likelihood estimates from treebanks.
- Computing the most likely parse tree using CKY.
Recognition (membership checking) problem vs. parsing
Top-down vs. bottom-up parsing.
- Bottom-up: Start at the words (terminal symbols) and see which subtrees you can build. Then combine these subtrees into larger trees. (Driven by the input sentence.)
- Top-down: Start at the start symbol (S), try to apply production rules that are compatible with the input. (Driven by the grammar)
CKY parser:
- bottom-up approach.
- Chomsky normal form.
  - A CFG $G=(N,\Sigma,R,S)$ is in Chomsky Normal Form (CNF) if the rules take one of the following forms:
    - $A \rightarrow BC$, where $A \in N$, $B \in N$, $C \in N$
    - $A \rightarrow b$, where $A \in N$, $b \in \Sigma$
  - Any CFG can be converted to an equivalent grammar in CNF that expresses the same language
- Dynamic programming algorithm. (know the algorithm and required data structure: CKY parse table). Split position.
  - data structure: $\pi[i,j]$ be the set of nonterminals that cover $[i,j]$, use “parse table” to represent $\pi[i,j]$
  - Split position: To compute $\pi[i,j]$, try all possible split points $k$, such that $i<k<j$. For each $k$, check if the nonterminals in $\pi[i,k]$ and $\pi[k,j]$ match any of the rules in the grammar.
  - overall runtime: $O(n^3)$
- Backpointers
  - retrieve the parse trees
  - store a list of instantiated rules and backpointers
  - start at the [0,n] entry and recursively follow the backpointers. Return a set of of subtrees from the recursion.
  - retrieve the k highest-scoring trees in polynomial time (PCFGs)
  - Retrieving ANY single parse tree takes $O(n^2)$
- Parsing with PCFGs
  - assigns a probability to each parse tree to select the most probable parse tree compatible with an input sentence (generative model)
  - A PCFG consists of a Context Free Grammar $G=(N,\Sigma,R,S)$ and a probability $P(A\rightarrow \beta)$ for each production $A\rightarrow \beta \in R$
  - The probabilities for all rules with the same left-hand- side sum up to 1
  - The probability of a parse tree is $P(t)=\prod_{i=1}^nP(A_i \rightarrow \beta_i)$ <- joint probability
  - The probability of a parse tree given the sentence = $P(t)/P(S)$ <- conditional probaiblity
  - Estimating PCFG probabilities: MLE $P(A_i \rightarrow \beta_i) = count(A \rightarrow \beta)/count(A)$
  - data structure: $\pi[i,j,X]$ be the probability of the highest scoring parse tree for $s[i,j]$ starting in nonterminal $X$
  - $\pi[i,j,X]=\max_{k=i+1,…,j-1 X \rightarrow BC \in R}P(X \rightarrow BC) \times \pi[i,k,B] \times \pi[k,j,C]$
  - retrieve the highest-scoring trees in polynomial time
  - Probability of a sentence:
    - Spurious ambiguity. Need to sum the probabilities of all parse trees for the sentence.
    - $\pi[i,j,X]=\sum_{k=i+1,…,j-1 X \rightarrow BC \in R}P(X \rightarrow BC) \times \pi[i,k,B] \times \pi[k,j,C]$
Earley parser:
- Top-down approach.
- Does not require CNF
  - The Earley parser instead starts at the start symbol and tries to “guess” derivations top-down.
  - It discards derivations that are incompatible with the sentence.
  - The early parser sweeps through the sentence left-to-right only once. It keeps partial derivations in a table (“chart”).
- Parser state definition. Initial state, goal states.
  - State represent hypotheses about constituent structure based on the grammar, taking into account the input (represented as dotted rules with spans).
- Three operations: Scan, Predict, Complete
  - Predict: Predict new subtrees top-down. Predict operation can only apply the state we’re currently considering has the dot immediately to the left of an non-terminal symbol.
  - Scan: Scan input terminals to verify the prediction.
  - Complete: Complete with passive states. And shift the dot.
- Dynamic programming algorithm. (know the algorithm and required data structure: parse “Chart” organized by end-position).
  - data structure: Keep track of parser states in a table (“chart”). $Chart[k]$ contains a list of all parser states that end in position $k$.
  - shift(state): $A \rightarrow \alpha •s[i] \beta[k,i]$ $\Rightarrow$ $A \rightarrow \alpha s[i] •\beta[k,i+1]$ and add this new state to $Chart[i+1]$
  - predict(state): $A \rightarrow \alpha •B \beta[k,i]$ for each rule $B \rightarrow \gamma$, then add a new state $B \rightarrow •\gamma[i,i]$ to $Chart[i]$
  - complete(state): $A \rightarrow \alpha •[k,j]$ for each state $B \rightarrow \beta •A \gamma[i,k]$ add a new state $B \rightarrow \beta A •\gamma[i,j]$ to $Chart[j]$
  - overall runtime: $O(n^3)$
- Ambiguity
  - Multiple ways to complete the same state.
  - Could hash the states so no duplicates are actually added into the chart entries.
- Recover parse trees: Keep back-pointers in the parser state objects.
- Make the algorithm work with PCFG: Easy to compute probabilities on Complete. Follow back pointer with max probability.

Dependency parsing

(Not in J&M 2nd ed., J&M 3rd ed. Ch 14.1-14.5, Supplementary material: Küber, McDonald, and Nivre (2009): Dependency Parsing, Ch.3 to 4.2)

Goal:
- Find a set of labeled, directed edges between the nodes, such that the resulting graph forms a correct dependency tree.
Grammar based vs. data based dependency parsing.
- Grammar based: follow the valid sentence form
- Data based: No valid or invalid sentence concept
Data based approaches:
- Graph algorithms. vs. transition based (*) (we did not discuss graph-based approaches)
Transition based dependency parsing:
- Idea:
  - Start with an initial configuration and find a sequence of transitions to the terminal state.
  - Uses a greedy approach to find the best sequence of transitions.
    - Uses a discriminative model (classifier) to select the next transition.
- States (configuration): Stack, Buffer, Partial dependency tree
  - stack: Keep track of nodes that can become dependents of a left-arc
  - buffer: Contains words that can become dependents of a right-arc. Keep unseen words.
- Transitions (Arc-standard system): Shift, Left-Arc, Right-Arc
  - shift: Move next word from the buffer to the stack
  - left-arc: Build an edge from the next word on the buffer to the top word on the stack, and remove the top word on the stack
  - right-arc: Build an edge from the top word on the stack to the next word on the buffer, then remove the top word on the buffer and move the top word on the stack to the top of the buffer
- Properties of the transition system:
  - The time required to parse $w_1,…, w_m$ with an oracle is $O(m)$
  - A node must collect all its children before its parent. As soon as the word has a parent (incoming edge), the word disappears from the both stack and buffer, no longer be able to attach its children.
  - Can only produce projective trees. Because it can never create dependencies between substance.
- Predicting the next transition using discriminative classifiers.
  - Features should the classifier use: Local features from each state, but ideally want to model entire history of transitions leading to the state.
  - Need to define a feature function that maps states to feature vectors.
  - Feature definition (address + attribute)
    - an address in the state description: (identifies a specific word in the configuration, for example “top of stack”).
    - an attribute of the word in that address: (for example POS, word form, lemma, word embedding, …)
- Training the parser from a treebank:
  - Oracle transitions from the annotated dependency tree. Train the model on these transitions.
Arc-eager system (*) (did not discuss this in class)

Machine Learning

(Some textbook references below)

Generative vs. discriminative algorithms
Supervised learning. Classification vs. regression problems, multi-class classification
- Classification：predict a label from a set of labels. Learn a classifier function $h:\mathbb R^d \rightarrow {-1,+1}$
- Regression: predict a numeric value. Learn a regressor function $h:\mathbb R^d \rightarrow \mathbb R$
Loss functions:
- Least squares error: $loss(y_i,h(x_i))=(y_i-h(x_i))^2$
- Classification error:
$$loss(y_i,h(x_i))= \{ \begin{aligned} 1 \text{, if } sign(h(x_i)) \not ={sign(y_i)} \\ 0 \text{, otherwise } \end{aligned}$$
Training vs. testing error. Overfitting and how to prevent it.
- Training aims to minimize training error. We hope that minimizing training error also minimizes test error.
- Minimizing empirical risk can lead to overfitting. This happens when a model works well on the training data, but it does not generalize to testing data.
- Prevent overfitting:
  - Reduce the number of features (feature selection).
  - Model selection: Select the simplest one
  - Regularization: Adding a term that penalizes complexity if the model to the empirical risk.
  - Cross validation: Train on n-1 partitions of original training set
Linear Models.
- activation function
  - Binary classification mode: $f(x_i)=sign(\bf W • \bf X)$
- perceptron learning algorithm: Threshold function is not differentiable.
  - Start with arbitrary hyperplane
  - Adjust it using the training data
  - Update rule: $w_j \leftarrow w_j + (y-h_w(x)) \times x_j$
  - Perceptron Convergence Theorem states that any linear function can be learned using this algorithm in a finite number of iterations. “convergence” means that the weights don’t change for one entire iteration through the training data.
- linear separability and the XOR problem.
  - One trick with data that is not linearly separable is to project it into a higher dimensional features base -> which turns into linearly separable
Feature functions: multi-class decisions
- feature function $\phi (x,y): \mathcal X \times \mathcal Y \rightarrow \mathbb R^d$ d-dimensional vectors.
- In NLP applications, the feature functions are often indicator functions (value is 0 or 1)
- Multiclass Perceptron: predict the one with the highest score $f(x)=\text{argmax} _{y \in Y} \sum_{j=1}^d w_j • \phi_j(x,y)$
Log-linear / maximum entropy models (J&M 3rd. ed. ch. 7, J&M 2nd ed. ch. 6.6)
- Log-likelihood of the model on the training data.
  - Define the conditional probability $P(y|x;w)=\frac{\exp (w • \phi(x,y))}{\sum_{y’ \in Y} \exp (w•\phi(x,y’))}$
  - Define the log-likelihood of some model $w$ on the training data: $LL(w)=\sum_{i=1}^n \log P(y_i|x_i;w)$
  - compute the maximum likelihood: $LL^*(w)=\text{argmax}w \sum{i=1}^n \log P(y_i|x_i;w)$
- Simple gradient ascent.
  - for each $w_j$ in $w$: $w_j \leftarrow w_j + \alpha \frac{\partial}{\partial w_j}LL(w)$
  - $\frac{\partial}{\partial w_j}LL(w) = \sum_{i=1}^n \phi_j(x_i,y_i) - \sum_{i=1}^n \sum_{y \in Y} P(y|x_i;w) \phi_j(x_i,y)$
- Regularization:
  - Include a regularization term to prevent overfitting. For example L2 regularizer $LL(w)=\sum_{i=1}^n P(y_i|x_i;w)-\frac{\lambda}{2}||w||^2$.
  - a trade-off between fit and model ‘complexity’: $\frac{\partial}{\partial w_j}LL(w) = \sum_{i=1}^n \phi_j(x_i,y_i) - \sum_{i=1}^n \sum_{y \in Y} P(y|x_i;w) \phi_j(x_i,y) - \lambda w_j$
- MEMM (Maximum entropy markov models):
  - POS tagging with MEMMs
    - Make an independence assumption (similar to HMM) and probability only depends on the previous tag $P(t_1,…,t_n|w_1,…w_n)=\prod_{i=1}^m P(t_i|t_{i-1},w_1,…,w_m)$
    - Model each term using a log-linear model $P(t_i|t_{i-1},w_1,…,w_m)=\frac{\exp (w•\phi(w_1,…,w_m,i,t_{i-1},t_i))}{\sum_{t’} \exp (w•\phi(w_1,…,w_m,i,t_{i-1},t’))}$
    - Need to find $\text{argmax} P(t_1,…,t_n|w_1,…w_n) = \text{argmax} P(t_1,…,t_n, w_1,…w_n)$
Feed-forward neural nets
(J&M 3rd. ed. ch. 8, also take a look at Yoav Goldberg’s book “Neural Network Methods for Natural Language Processing”)
- Multilayer neural nets (input layer, hidden layer, output layer)
  - Basic idea: represent any (non-linear) function as a composition of soft-threshold functions. This is a form of non-linear regression.
- Different differentiable activation functions (sigmoid, ReLU, tanh)
  - sigmoid: $g(z)=1/(1+e^{-z})$
  - tanh: $tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}$
  - ReLU: $ReLu(z)=\max (z,0)$
- Softmax
  - Normalize activation of each output unit by the sum of all output activations (as in log-linear models)
  - $softmax(z_i)=\frac{\exp (z_i)}{\sum_{j=1}^k \exp(z_j)}$
- Backpropagation (*) (know the idea of propagating back partial derivatives of the loss with respect to each weight – I won’t ask tricky questions about backprop).
  - gradient descent to update weights and minimize loss.
  - apply the chain rule of calculus

Two Approaches to Language Meaning

Formal Semantics
- Use formal Meaning Representations for words /sentences / discourse.
- Based on lexical resources. Dictionary
- History in cognitive science
Distributional / distributed approaches
- Usage based. Characterize the meaning of words given the context that typically word would appear in.
- Dense Vector representations (embeddings)
- Core concept: Semantic similarity

Formal Lexical Semantics

(J&M 3rd ed. ch 17, J&M 2nd ed. ch 19.1-19.3 and 20.1-20.4 )

Word senses, lexemes, homonymy, polysemy, metonymy, zeugma.
- Word senses: Different word senses of the same word can denote different (more or less related) concepts
- Lexeme: a pairing of a particular word form with its sense. (bank, river) (bank, finincial institution)
- Homonymy: 同形同音异义(不同意思间不相关) Multiple unrelated concepts correspond to the same word form.
- Polysemy: 同形同音异义(不同意思间相关) Multiple semantically related concepts correspond to the same word form.
- Metonomy: A subtype of polysemy. One aspect of a concept is used to refer to other aspects of a concept (or the concept itself).
  - ANIMAL <-> MEAT (the chicken was overcooked, the chicken eats a worm)
- Zeugma: when a single word is used with two other parts oOf a sentence but must be understood differently (word sense) in relation to each.
  - He lost his gloves and his temper.
WordNet, synsets
- Wordnet: WordNet is a lexical database containing English word senses and their relations.
- Represents word sense as synsets, sets of lemmas that have synonymous lexemes in one context.
Lexical/Semantic relations (in WordNet)
- synonym, antonym
  - synonym: Two lexemes refer to the same concept. (synonymy is not a relationship between words, but between lexemes)
  - antonym: Senses are opposites with respect to one specific feature of their meaning.
- hypernym, hyponym
  - hyponymy: One sense is a hyponym (or subordinate) of another sense if the first sense is more specific, denoting a subclass of the other.
  - hypernymy: Inverse relation of hyponymy. (Furniture is a hypernym (or superordinate) of desk.)
  - The set of things denoted by the hyponym is a subset of the set of things denoted by the hypernym.
- meronym: A meronym is a part of another concept. (leg is a meronym of chair.)
  - The inverse relation is holonymy. Car is a holonym of wheel.
Word-sense disambiguation: Given a word token in context, identify its correct word sense (from a list of possible word-senses).
- Supervised learning approach
  - Train a classifier for each word
  - Requires hand-labeled training data (sense annotations)
- Dictionary-Based Methods:
  - No training data. Instead use a sense dictionary (like WordNet).
  - to obtain candidate senses
  - to obtain information that allows us to identify which of the candidate senses is correct.
- Feature Definition:
  - Collocational features: Take the position of each context word into account.
  - Bag-of-word features: Simply keep the set of all context words.
  - Using word embeddings
- Lesk algorithm (J&M 3rd ed. ch. 17.6)
  - Simplified Lesk Algorithm
    - Use dictionary glosses for each sense.
    - Choose the sense that has the highest word overlap between gloss and context (ignore function words).
  - Extensions to Lesk Algorithm
    - Use embedded representations for each word in the definition. Choose the sense with highest average vector similarity to the context.
    - “Corpus-Lesk”: Use a sense-tagged-example corpus, add context from example sentences.
    - Extended Gloss Overlap. Add glosses from related words (hypernyms, meronyms, …)
Lexical substitution task
- find a substitute for the target word, such that the meaning is preserved.

Distributional (Vector-based) Lexical Semantics

(J&M 3rd ed. ch 15 & 16, not in 2nd ed.)

Dense Word Representations:
- Distributional Semantics (“count-based” approaches)
  - Based in linguistic theory (“distributional hypothesis”)
  - Context is symbolic, dimensions are interpretable.
- Distributed Semantics
  - Based in connectionist approaches.
  - Dimensions are not necessarily interpretable
  - Application focused
Distributional hypothesis
- Two words should be similar if they have similar typical word contexts.
Co-occurence matrix
- Numbers are co-occurence counts (how often the symbols appear together in context)
- Verb-Object counts
Distance/Similarity metrics (euclidean distance, cosine similarity)
- Can be seen as coordinates in n-dimensional Euclidean space.
- Similarity:
  - Spatial distance between words. (lower distance = higher similarity)
  - Cosine of two words vectors’ angle
Distributional Semantic Models
- A Distributional Semantic Model (DSM) is any matrix M such that each row represents the distribution of a term x across contexts, together with a similarity measurement.
Parameters/Dimensions of Distributional Semantic Models
- Preprocessing, term definition: word form, lemmas, POS, …
- Context definition:
  - Type of context (word, syntactic dependents (with or without relation labels labels), remove stop-words, etc.)
  - Size of context window.
- Feature scaling / term weighting
  - Not all context terms are equally relevant to characterize the meaning of a word. -> One solution: TF*IDF (term frequency * inverse-document frequency)
- Normalization of rows / columns
- Dimensionality reduction
- Similarity measure
Semantic similarity and relatedness (paradigmatic vs. syntagmatic relatedness)
- Semantic Relatedness (syntagmatic relatedness)
  - Words that occur nearby each other, but are not necessarily similar.
- Semantic Similarity (paradigmatic relatedness)
  - Can typically substitute one word for another
- Effect of context size on type of relatedness.
  - define distributional context to capture each type of similarity
Sparse vs. Dense Vectors
One-hot representation (of words, word-senses, features)
Word Embeddings
- Word embeddings are representations of words in a low-dimensional, dense vector space.
- Two approaches:
  - Use matrix decomposition on co-occurence matrix, for example Singular Value Decomposition (SVD)
  - Learn distributed embeddings using neural networks. Minimal feature-engineering required.
Word2Vec embeddings using a neural network. (The neural network should capture the relationship between a word and its context)
- Skip-gram model: Input is a single word. Predict a probability for each context word.
- Continuous bag-of-words (CBOW): Input is a representation of the context window. Predict a probability for each target word.

Recurrent Neural Nets

Neural language model (without RNNs)
- Use a feed-forward neural network with softmax activation to model the probabilities.
  - learn embeddings and Language Model at the same time
  - add a new input layer that is a concatenation of one-hot vectors.
  - all context words share a single weight matrix E
Recurrent neural nets: Many NLP tasks require models of sequences. RNN take entire history into account.
- Basic RNN concept, hidden layer as state representation.
  - Network computes two functions:
    - R maps input vector x to hidden representation s.
    - O maps hidden representation s to output vector y.
  - Hidden layer represents a state. State representation is fed back into the function R.
  - The output at step i depends on all previous steps.
- Common usage patterns:
  - Acceptor / Encoder
    - Encoder: RNN used to compute sequence encoding (for example, to compute a sentence representation). This representation is then used in some other task.
  - Transducer: One output for each input (time step). During training, loss for each time step is combined.
    - Transducer as generator: Typically trained like a regular transducer on the target output in each step. Previous output is next input.
    - Conditioned transduction (encoder-decoder): Concat contex with each step input.
    - Sequence to Sequence: Acceptor + Conditioned transduction
- Backpropagtion through time (BPTT) (*)
  - For each input, treat the unfolded network (copies of all units for each time step) as one big feed-forward network. But with shared weights across time steps.
  - During the forward pass, create copies of the RNN unit for each time step, but share the same weight vector.
- LSTMs (*)
  - Split state vector into two halfs: “memory cell” c and working memory h.
  - “gates” decide:
    - how much of the input to add to the memory
    - how much of current memory to forget
- Bidirectional RNNs
  - computes two hidden states (forward and backword) for each input.
- Stacked/Layered/Deep RNNs
ELMo
- conventional pre-trained embeddings word2vec do not take context into account and cannot model polysemy
- contextualized word representations: compute word representations as a combination of the stacked layer outputs. (pre-train a stacked biRNN language model)
- model architecture, bi-Language Model training objective
- pre-training and applying ELMo vectors
Encoder/Decoder and Sequence to Sequence
- Attention mechanism
  - Context vector
  - alignment score and attention weights
  - attention as a “lookup” operation (queries/keys/values): Attention functions can be described as mapping a query, and a set of key-value pairs to an output.
    - query: $s_{j-1}$ in $score(s_{j-1},h_i)$ - previous hidden state of the decoder
    - keys: $h_i$ in $score(s_{j-1},h_i)$ - the input representation of the encoder at time $i$
    - values: $h_i$ in $c_j=\sum \alpha_{j,i}h_i$
    - $\alpha_{j,i}$ a compatibility score between Query and each Key, select how much the corresponding Value contributes to the output.

Transformer Based Models

(not in textbook)

Problem with the RNNs for MT:
- Fixed-length encoded representation becomes information bottleneck.
- Not everything in the input sequence is equally important to predict each word in the decoder
- Integrate the idea of “alignments” into the seq2seq approach (Alignments: weight differently for each input words)
  - Each output word corresponds to one (or a few) words in the input.
  - Long distance dependencies between input and output.
Attention Mechanism
- Maintain a sequence of hidden state vectors in the source sequence.
- For each time step during decoding, select which positions in the source sentence contain the most relevant information.
- Compute a different context vector $c_j$ for each target word $y_j$, as a weighted sum representations for the input words.
- Compute alignment weights using softmax of $score(s_{j-1},h_i)$
- Encoder/Decoder/Attention weights trained jointly.
Transformer architecture (contrast to RNNs: a non-recurrent architecture for sequence modeling and transduction that is based exclusively on attention.)
- self attention
  - to compute representations that depend on other representations in the same layer
- scaled dot product attention function
  - to compute the $score(q,k)$
- multi-head attention
  - Compute h different linear projects of V, K, Q (“attention heads”), compute scaled dot product for each, then concatenate the output.
- Encoder layers: Encode the input sequence using a stack of N encoders.
- Decoder layers: Generate one output word at a time, based on the previous outputs and the encoded input.
  - Multi-head self-attention with masked (*): masked to prevent attention to unseen output positions
- Positional encoding (*)
How Attention is Used:
- Self attention:
  - Q, K, and V are all computed from the output representations of the previous layer (using linear transformations).
- Encoder/Decoder attention:
  - K, V are computed from the output of the encoder. Q is computed from the output of the previous decoder layer.
Unsupervised Pre-Training
- Problem: Most NLP tasks depend on annotated training data, which is scarce and expensive to create.
- Idea: Train a model on some unsupervised task (e.g. language modeling) on large amounts of data, and then transfer the model parameters to another task.
GPT (Generative PreTraining)
- uses transformer decoder layers
- Pre-trained using language modeling objective.
- Application of GPT to classification problems
- Representing various single/two sentence classification problems as input for GPT.
- Fine tuning (weighted secondary objective)
  - helps to keep training the pre-trained transformer weights on the new task during fine-tuning
BERT (Bidirectional Encoder Representations from Transformers)
- uses stacked transformer encoder layers
  - no masked attention, so representations can attend to the entire input sequence
- input representation
- significance of the [CLS] token
  - First position is a special classification token [CLS], used to compute a single representation of the input for classification tasks.
- masked language model training objective
  - GPT’s LM objective only takes left context into account (masked attention in transformer decoder layers)
  - BERT use a “masked language model” objective for pre- training. Allows use of bidirectional self-attention (i.e. transformer encoder layers)
  - Randomly replace 15% of the input tokens in each sequence with a [MASK] symbol.
  - Max the probability of the actual tokens that were seen in the input sequence for all the masked positions
- next sentence prediction training objective as the secondary objective
  - Add a secondary objective (trained jointly with MLM) to obtain representations that depend on the other sentence.
- applying BERT to sequence labeling and prediction tasks (and fine tuning)
  - For sequence tasks: Use output representations to make per-token predictions.
  - For classification tasks: Use [CLS] representation to make a prediction.

Semantic Role Labeling

(J&M Ch. 22)

Frame Semantics
- Frame, Frame Elements (specific to each frame)
  - Frame Elements: Frames describe the interaction/relation between a set of frame-specific semantic roles called Frame Elements (FEs)
  - i.e.: Donor, Theme, Recipient
- Valence Patterns
- FrameNet: (Frame Semantic Parsing)
  - Central interest: mapping from Grammatical Function (Subj, Obj, …) to Frame Elements.
  - Frame definitions: A frame represents a situation, object, event providing background needed to understand a word
  - Lexical Units: A pair of a word and a frame is called a lexical unit
  - Example annotations to illustrate valence patterns.
    - Example annotations illustrate how frame elements are realized linguistically.
    - Valence patterns (derived from annotated sentences) specify different ways grammatical roles (subject, object, …) can be mapped to frame elements for a given lexical unit.
  - Frame-to-frame relations and frame-element relations (*) (you do not need to remember an exhaustive list of these relations).
PropBank: (Semantic Role Labeling)
- Differences to FrameNet: only verbs with numbered arguments (semantic roles)
- Semantic Roles (ARGx)
  - Arg0:PROTO-AGENT - Causes an event or change of state in another participant
  - Arg1:PROTO-PATIENT - Undergoes change of state. Causally affected by another participant
  - Arg2: benefactive, instrument, attribute, or end state
  - Arg3: start point, benefactive, instrument, or attribute
  - Arg4 the end point
- predicate-argument structure.
- framesets
  - Different framesets correspond to different senses
Semantic Role Labeling:
- Goal: automatically produce PropBank or FrameNet-style annotations (“frame-semantic parsing”).
- Steps of the general syntax-based approach
  - Parse the sentence (dependence or constituency parse)
  - Detect all potential targets (predicates / frame evoking elements)
  - For each predicate:
    - For each node in the parse tree use supervised ML classifiers to:
      1. identify if it is an argument.
      2. label the argument with a role.
- Features for semantic role role/FE identification and labeling. (*) (you do not need to remember an exhaustive list, but have a general sense of which features are important).
  - Selectional restrictions and preferences.
    - Different semantic roles might have restrictions on the semantic type of arguments they can take.
    - Food FE (or ARG1) needs to be edible
  - Parse-tree path features

Statistical MT

(J&M 2nd ed. Ch, also see Michael Collins’ notes on IBM M2)

Challenges for MT
- Sentence Word order
- Word order in phrases (adjective modifiers)
- Prepositions and case marking:
- Lexical divergence
Vauquois triangle.
Faithfulness vs. Fluency
- Faithfulness: Target sentence should have the same content as the source text.
- Fluency: Target text should be grammatical / natural / fluent in the target language.
- often a trade-off between these two factors.
Parallel Corpora
- The Rosetta Stone: Identical text written in three scripts (Hieroglyphics, Demotic, and Greek)
Noisy Channel Model for MT
- Goal: Given an observation in the source language (F), figure out what was said in the target language (E).
- Fluency is modeled by P(E).
- Faithfulness is modeled by P(F|E).
- Apply Baye’s rule: P(E|F) $\propto$ P(E) P(F|E)
Word alignments:
- First step: Figure out word-to-word translations by aligning words between parallel sentences (bitext) in the corpus.
IBM Model 2:
- Alignment variables and model definition.
  - Simplifying assumption: alignments are one-to-many, i.e. each f word originates from exactly one e word (not many-to-one, many-to-many, …)
  - Use alignment variables $a_1…a_m$ to indicate the position that each word in f is aligned to.
  - Mini translation model: $P(F|E)=\sum_A P(F,A|E)$
  - Decoding problem: find $\text{argmax}_E P(E)P(F|E)$
  - Model the conditional probability: $P(F,A|E)=P(f_1,…,f_m,a_1,…,a_m|e_1,…,e_l,m)$
    - $t(f|e)$ is the probability of generating word f from e.
    - $q(j|i,l,m)$ is the probability that alignment variable i takes value j (given sentence length m and l)
    - Define the conditional probability for the target sentence and alignments as: $P(f_1,…,f_m,a_1,…,a_m|e_1,…,e_l,m)=\prod_{i=1}^m q(a_i|i,l,m) t(f_i|e_{a_i})$
    - Independence assumption: Alignment variables are independent to each other
  - Computing Alignments:
    - Compute the optimal alignments given a sentence pair: $\text{argmax}_{a_1,…,a_m} P(a_1,…,a_m|f_1,…,f_m,e_1,…,e_l,m)$
    - All $a_i$ are independent: $a_i=\text{argmax}_{j \in {0,…,l}} q(j|i,l,m)t(f_i|e_j)$ for $i=1,…,m$
- EM training for Model 2 (*)
Phrase-based MT (*)
MT Evaluation:
- BLEU Score

Previous post: Decode the Secrets Behind Recommender Systems Next post: Load Third-party Netflix Subtitles

J. An's Blog