TU;DoCSE: Course notes on Natural Language Processing

Natural Language Processing

Books:

Jurafsky D., and Martin J H. . Speech and Natural Language Processing, 2e, Pearson Education

References:
James A.. Natural language Understanding 2e, Pearson Education
Bharati A., Sangal R., Chaitanya V.. Natural language processing: a Paninian perspective , PHI
Siddiqui T., Tiwary U. S.. Natural language processing and Information retrieval , OUP

1. Introduction

Language

A protocol for representation of information/knowledge for communication or storage.
Comprises a set of elementary symbols, and rules for combining these symbols.
Forms- sound based, action based, or written (pictorial or alphabet-based).

Natural Language is a language used by human beings in spoken form and, optionally, in written form too. It may also be called human language .

Study of languages is called linguistics.

We shall use the term linguistic expression or simply, expression to denote instances of use of a language to represent information.

From a linguistic perspective, use of a language comprises two broad tasks- understanding and generation.
From a technical perspective, there are additional aspects, such as rendering, encoding, storage, etc.

Using computers for dealing with natural languages is gaining much thrust. Use of computers in linguistics is referred to as computational linguistics. Natural Language Processing (NLP) broadly denotes the use of computer in applications that require knowledge of language(s). NLP covers computational linguistics, as well as techniques required for encoding, rendering, and storage of linguistic expressions. Some NLP applications are- spelling correction, grammar checking, information retrieval, summarization, machine translation, etc.

Ambiguity
....

Language processing capability of a computer is closely related to the wider subject of artificial intelligence. Alan Turing proposed the Turing Test, a game, in which a computer's use of language would form the basis of determining if it could think. If the machine wins, it would be judged intelligent.

Brief History-
1940s and 1950s- Work on automaton and probabilistic (or information theoretic) models. The automaton idea is related to the field of formal language theory. The probabilistic model started on the basis of Shannon's concept of noisy channel and decoding, and that of entropy.
1960s- Two paradigms- symbolic and stochastic. The symbolic paradigm is based on the formal language theory and generative syntax.
1970-1983 - Four paradigms- stochastic, logic-based, natural language understanding, discourse modelling.
1983-1993 - Empiricism and Finite State Models.
1994-1999 - Combination of different paradigms.

Different NLP applications require different levels of knowledge of the language. These are-

phonetics and phonology,
morphology,
syntax
semantics
pragmatics
discourse
world knowledge

Formal Models for NLP

State machines- deterministic and non-deterministic FSA, finite-state transducers, weighted automata, Markov models, HMM.
Formal rule systemss- regular grammars, context free grammars, feature augmented grammars, and their probabilistic variants.
Logic- first order logic (aka, predicate calculus), feature structures, semantic networks, conceptual dependency.
Probability theory- the above models can be augmented with probabilities, mostly to resolve ambiguity issues. Probabilistic models are related to machine learning models.

Text representation in computers, encoding schemes

Word Recognition

Given an alphabet (a set of elementary symbols used in written form of languages), not all combinations/sequences of symbols thereof are valid in the language. Valid sequences/strings of letters are called words. In terms of an alphabet a Word is a string of non-space letters from the alphabet that carry a meaning in the language.

Valid words can be specified for the purpose of recognition, using regular expressions (RE).
A regular expression is a formula in a special language for specifying simple classes of strings. It is an algebraic notation for characterizing a set of strings.
Regular expressions can be used to specify search strings as well as to define a language in a formal way.
In context of searching a regular expression specifies a pattern to be searched in a corpus of texts.
A Formal language is a set of strings, each strings composed of symbos from a finite set called an alphabet.
A model which can both generate and recognise all and only the strings of a formal language acts as a definition of the formal language.
A regular expression is one way of characterizing a particular kind of formal language called a regular language.

Finite State Automata

[Refer to books on FLA.]

Formally, an FSA is a 5-tuple { Q, Σ, q₀, F, δ(q,i) }
Any regular expression can be implemented as a finite state automaton. Symmetrically, any finite state automaton can be described with a regular expression.
Given an FSA m, L(m) is the formal language characterised by m.
A FSA can be assumed to read from left to right cells that contain symbols on a tape, and make state transitions based on current state and the tape symbol read. Certain states may be designated as final states or accepting states.
Formal languages are not the same as natural langauges. But we often use a formal language to model part of a natural language, such as parts of phonology, morphology, or syntax. In linguistics, the term generative grammar is sometimes used to mean a grammar of a formal language.
The operation of an FSA can be represented in the form of a state transition table.
An FSA can be one of two forms- Deterministic FSA (DFA) or Non-deterministic FSA (NFA). Every NFA can be represented by an equivalent DFA.
In NFA one has to make a choice of a transition. Three standard solutions for this are- backup, look-ahead, parallelism.
Recognition by an NFA is a kind of state-space search. The search is performed either as depth-first search (LIFO), or breadth-first search (FIFO).

Regular expressions in Perl- [Examples].

    #! /usr/bin/perl
    #prompts for a file name and prints all those lines that contain the
    #vowels in order
     
    print "Please give me a file name>>> ";
    $filename = ;
    chomp $filename;
    open (IN, $filename);
     
    while (){
    if ($_ =~ m/a.*e.*i.*o.*u/){
	    print;
	} 
    }

Morphology

- Study of the structure of words.

In natural languages words are often formed by some modification of other words or sub-words.

Spelling rules: fly -> flies
Morphological rules: fish is both singular and plural, cats is a plural of cat.

Identifying the root components of a word and the attributes of the surface form of the word is called morphological parsing.

Stemming
Morphemes- minimal meaning bearing unit in a language.
Affixes- prefix, suffix, infix, circumfix.
Productive- applies to (almost) all elements of a class.
Concatenative morphology.
Non-cancatenative morphology - infix, circumfix, templatic (or root-and-pattern) morphology
Agglutinative languages- that tend to string affixes (morphemes) together .
Inflection.
Derivation.
Regular and irregular words.

Finite-state morphological parsing Input- word in a surface form.
Output- stem of the word plus the morphological features.

To build a morphological parser it requires-

Lexicon.
Morphotactics - the model of the morpheme ordering, usually in terms of classes of morphemes rather than in terms of actual words.
Orthographic rules- spelling rules.

Finite state automata can be used to perform morphological recognition. But in NLP we often want to figure out attributes of a word that its morphology implies. This exercise is morphological parsing.

A finite state transducer (FST) can be used for morphological parsing. A transducer maps between one set of symbols and another. An FST does this via a finite automaton.

Formal definition of FST- [Based on Mealy machine.] 5-tuple:

(Q, Σ, q₀, F, δ(q, i:o) )

Σ is a finite alphabet of complex symbols. Each complex symbol is composed of an input-output pair i:o, i is a symbol from an input alphabet I, and o is a symbol from an output alphabet O

δ is a relation from Q × Σ to Q.

Porter's Stemmer

Acquisition of Morphology

Supervised Acquisition
Un-supervised Acquisition
- Goldsmith's method - Based on concept of Minimum description length (MDL).
- Gaussier's method - Occurrence of a pseudo-suffix pair with more than one base.
- Method developed at Tezpur University for a highly inflectional language.

N-grams

Word Prediction - in speech recognition, hand-writing recognition, aug-mentative communication for the disabled, spelling error detection/correction.
Guessing the next word is related to computing the probability of a sequence of words. Knowing the probability og whole sentences or strings of words is useful in part-of-speech tagging, word-sense disambiguation, and probabilistic parsing.
An n-gram model uses the previous n-1 words to predict the next one. It is a kind of language model, or grammar.

Counting Words in a Corpus

Spoken corpora and text corpora have certain distinct characteristics.
Word forms- words as they appear in a text, possibly in an inflected form. Eg. cat and cats are distinct word forms.
Lemma - a set of lexical forms having the same stem, the same major part-of-speech, and the same word-sense. Eg. cat and cats belong to the same lemma- cat.
Tokens - number of words in a corpus/text, including repetitions.
Types - number of distinct words in a corpus/text.

Unsmoothed n-grams

Given the set of words in a language, not all arbitrary sequences of n words are equaly likely. Some sequences are, outright, invalid. Many are unlikely.
The probability of a particular word occuring after a given sequence of n-1 words, is estimated from the actual number of occurrence of the n-word sequence in a corpus.
The probability of an n-word sequence
P(w₁, w₂, ..., w_n-1, w_n) or P(w₁ⁿ) can be obtained as
P(w₁) P(w₂ | w₁) P(w₃ | w₁²) .... P(w_n | w₁^n-1)
The bigram model approximates the probability of a word given all the previous words by the conditional probability of the preceding word. This assumption is the Markov assumption.
In general, the N-gram approximation is
P(w₁) ≅ P(w_n | w^n-1_n-N+1)
N-gram models can be trained by counting and normalising. In other words the probability value P(w_n | w^n-1_n-N+1 can be determined by counting the number of occurrences of that N-gram in a corpus, and dividing it by the number of occurrences of all N-grams that has that same leading sequence of N-1 terms. That is,
P(w_n | w^n-1_n-N+1) = C(w^n-1_n-N+1 w_n) / C(w^n-1_n-N+1)

N-gram models are sensitive to training corpus.

Smoothing

Smoothing is required since the relative frequencies of N-grams not present in the training corpus turns out to be zero even for linguistically valid N-grams.

Add-One Smoothing-

Add-one to counts of all N-grams, those that have occurred in the corpus, as well as those that have not, but are possible with the given vocabulary.
To obtain the smoothed relative frequencies from the modified counts (one-added), normalise them by dividing each by the total number of tokens having the same (N-1)-gram prefix, plus the size of the vocabulary.

Witten-Bell Discounting-

Based on the intuition that probability of seeing a zero-frequency N-gram can be modelled by the probability of seeing an N-gram for the first time.
Probability of seeing an N-gram for the first time can be estimated by counting the number of times an N-gram is seen for the first time. This is simple since it is exactly the number of distinct N-grams that occur in the training corpus.
Total probability mass of all zero-frequency n-grams-
Σ_{i:c_i=0}p_i^* = T / (N + T)

Good-Turing Discounting-

Re-estimate the amount of probability mass to assign th N-grams with zero or low counts by looking at the number of N-grams with higher counts.
N_c = Σ_b:c(b)=c1

Backoff-
- Build an N-gram model based on an (N-1)-gram model.
Backoff can be combined with Discounting - Use discounting to compute the probability mass for unseen events (N-grams); then use backoff to distribute this mass among different unseen N-grams, based on their (N-1) gram probabilities.
Entropy

Entropy is a measure of information. It indicates the degree of disorder in a system.
H(X) = -Σ_{x in À} p(x) log₂p(x)

The entropy as obtained by the above expression can be measured in bits. Entropy can be seen as a lower bound on the number of bits it would take to encode a certain decision or a piece of information in the optimal coding.
Example- Huffman coding.
Perplexity-
If H is the entropy, then 2^H is the perplexity of the system.
Entropy of a sequence-
H(L) = lim_{n -> ∞} (1/n) H(w₁,w₂,...,w_n)
Cross-entropy-
Measure of the distance between two probability distributions.