Natural Language Processing
Books:
- Jurafsky D., and Martin J H. . Speech and Natural Language
Processing, 2e, Pearson Education
References:
- James A.. Natural language Understanding 2e, Pearson
Education
- Bharati A., Sangal R., Chaitanya V.. Natural language
processing: a Paninian perspective , PHI
- Siddiqui T., Tiwary U. S.. Natural language
processing and Information retrieval , OUP
|
1. Introduction
Language
- A protocol for representation of information/knowledge for
communication or storage.
- Comprises a set of elementary symbols, and rules for combining these
symbols.
- Forms- sound based, action based, or written (pictorial or
alphabet-based).
Natural Language is a language used by human beings in spoken form
and, optionally, in written form too. It may also be called human language
.
Study of languages is called linguistics.
We shall use the term linguistic expression or simply, expression
to denote instances of use of a language to represent information.
From a linguistic perspective, use of a language comprises two broad tasks-
understanding and generation.
From a technical perspective, there are additional aspects, such as rendering,
encoding, storage, etc.
Using computers for dealing with natural languages is gaining much thrust. Use
of computers in linguistics is referred to as computational linguistics.
Natural Language Processing (NLP) broadly denotes the use of computer in
applications that require knowledge of language(s). NLP covers
computational linguistics, as well as techniques required for encoding,
rendering, and storage of linguistic expressions.
Some NLP applications are- spelling correction, grammar checking, information
retrieval, summarization, machine translation, etc.
Ambiguity
....
Language processing capability of a computer is closely related to the wider
subject of artificial intelligence. Alan Turing proposed the Turing
Test, a game, in which a computer's use of language would form the basis
of determining if it could think. If the machine wins, it would be judged
intelligent.
Brief History-
1940s and 1950s- Work on automaton and probabilistic (or information
theoretic) models. The automaton idea is related to the field of formal
language theory. The probabilistic model started on the basis of Shannon's
concept of noisy channel and decoding, and that of entropy.
1960s- Two paradigms- symbolic and stochastic. The symbolic paradigm is based
on the formal language theory and generative syntax.
1970-1983 - Four paradigms- stochastic, logic-based, natural language
understanding, discourse modelling.
1983-1993 - Empiricism and Finite State Models.
1994-1999 - Combination of different paradigms.
Different NLP applications require different levels of knowledge of the
language. These are-
- phonetics and phonology,
- morphology,
- syntax
- semantics
- pragmatics
- discourse
- world knowledge
Formal Models for NLP
- State machines- deterministic and non-deterministic FSA, finite-state
transducers, weighted automata, Markov models, HMM.
- Formal rule systemss- regular grammars, context free grammars, feature
augmented grammars, and their probabilistic variants.
- Logic- first order logic (aka, predicate calculus), feature structures,
semantic networks, conceptual dependency.
- Probability theory- the above models can be augmented with probabilities,
mostly to resolve ambiguity issues. Probabilistic models are related to
machine learning models.
Text representation in computers, encoding schemes
Word Recognition
Given an alphabet (a set of elementary symbols used in written form of
languages), not all combinations/sequences of symbols thereof are valid in
the language. Valid sequences/strings of letters are called words. In terms
of an alphabet a Word is a string of non-space letters from the
alphabet that carry a meaning in the language.
- Valid words can be specified for the purpose of recognition, using
regular expressions (RE).
- A regular expression is a formula in a special language for specifying
simple classes of strings. It is an algebraic notation for characterizing
a set of strings.
- Regular expressions can be used to specify search strings as well as to
define a language in a formal way.
- In context of searching a regular expression specifies a pattern to be
searched in a corpus of texts.
- A Formal language is a set of strings, each strings composed of
symbos from a finite set called an alphabet.
- A model which can both generate and recognise all and only the strings
of a formal language acts as a definition of the formal language.
- A regular expression is one way of characterizing a particular kind of
formal language called a regular language.
Finite State Automata
[Refer to books on FLA.]
- Formally, an FSA is a 5-tuple { Q, Σ, q0, F,
δ(q,i) }
- Any regular expression can be implemented as a finite state automaton.
Symmetrically, any finite state automaton can be described with a regular
expression.
- Given an FSA m, L(m) is the formal language characterised
by m.
- A FSA can be assumed to read from left to right cells that contain
symbols on a tape, and make state transitions based on current state and
the tape symbol read. Certain states may be designated as final
states or accepting states.
- Formal languages are not the same as natural langauges. But we often use
a formal language to model part of a natural language, such as parts of
phonology, morphology, or syntax. In linguistics, the term generative
grammar is sometimes used to mean a grammar of a formal language.
- The operation of an FSA can be represented in the form of a state
transition table.
- An FSA can be one of two forms- Deterministic FSA (DFA) or
Non-deterministic FSA (NFA). Every NFA can be represented by an
equivalent DFA.
- In NFA one has to make a choice of a transition. Three standard
solutions for this are- backup, look-ahead, parallelism.
- Recognition by an NFA is a kind of state-space search. The search is
performed either as depth-first search (LIFO), or breadth-first search
(FIFO).
Regular expressions in Perl- [Examples].
#! /usr/bin/perl
#prompts for a file name and prints all those lines that contain the
#vowels in order
print "Please give me a file name>>> ";
$filename = ;
chomp $filename;
open (IN, $filename);
while (){
if ($_ =~ m/a.*e.*i.*o.*u/){
print;
}
}
Morphology
- Study of the structure of words.
In natural languages words are often formed by some modification of other
words or sub-words.
- Spelling rules: fly -> flies
- Morphological rules: fish is both singular and plural, cats is a
plural of cat.
Identifying the root components of a word and the attributes of the surface
form of the word is called morphological parsing.
- Stemming
- Morphemes- minimal meaning bearing unit in a language.
- Affixes- prefix, suffix, infix, circumfix.
- Productive- applies to (almost) all elements of a class.
- Concatenative morphology.
- Non-cancatenative morphology - infix, circumfix, templatic (or
root-and-pattern) morphology
- Agglutinative languages- that tend to string affixes (morphemes)
together .
- Inflection.
- Derivation.
- Regular and irregular words.
Finite-state morphological parsing
Input- word in a surface form.
Output- stem of the word plus the morphological features.
To build a morphological parser it requires-
- Lexicon.
- Morphotactics - the model of the morpheme ordering, usually in terms of
classes of morphemes rather than in terms of actual words.
- Orthographic rules- spelling rules.
Finite state automata can be used to perform morphological
recognition. But in NLP we often want to figure out attributes of a word
that its morphology implies. This exercise is morphological parsing.
A finite state transducer (FST) can be used for morphological parsing.
A transducer maps between one set of symbols and another. An FST does this
via a finite automaton.
Formal definition of FST- [Based on Mealy machine.] 5-tuple:
(Q, Σ, q0, F, δ(q, i:o) )
Σ is a finite alphabet of complex symbols. Each complex symbol is
composed of an input-output pair i:o, i is a symbol from an
input alphabet I, and o is a symbol from an output alphabet
O
δ is a relation from Q × Σ to Q.
Porter's Stemmer
Acquisition of Morphology
- Supervised Acquisition
- Un-supervised Acquisition
- Goldsmith's method - Based on concept of Minimum description length
(MDL).
- Gaussier's method - Occurrence of a pseudo-suffix pair with more
than one base.
- Method developed at Tezpur University for a highly inflectional
language.
N-grams
- Word Prediction - in speech recognition, hand-writing recognition,
aug-mentative communication for the disabled, spelling error
detection/correction.
- Guessing the next word is related to computing the probability of a
sequence of words. Knowing the probability og whole sentences or strings
of words is useful in part-of-speech tagging, word-sense disambiguation,
and probabilistic parsing.
- An n-gram model uses the previous n-1 words to predict the next one. It
is a kind of language model, or grammar.
Counting Words in a Corpus
- Spoken corpora and text corpora have certain distinct characteristics.
- Word forms- words as they appear in a text, possibly in an inflected
form. Eg. cat and cats are distinct word forms.
Lemma - a set of lexical forms having the same stem, the same major
part-of-speech, and the same word-sense. Eg. cat and cats belong to the
same lemma- cat.
- Tokens - number of words in a corpus/text, including repetitions.
Types - number of distinct words in a corpus/text.
Unsmoothed n-grams
- Given the set of words in a language, not all arbitrary sequences of
n words are equaly likely. Some sequences are, outright, invalid.
Many are unlikely.
- The probability of a particular word occuring after a given sequence of
n-1 words, is estimated from the actual number of occurrence of
the n-word sequence in a corpus.
- The probability of an n-word sequence
P(w1, w2, ..., wn-1, wn)
or
P(w1n)
can be obtained as
P(w1)
P(w2 | w1)
P(w3 | w12)
....
P(wn | w1n-1)
- The bigram model approximates the probability of a word given all the
previous words by the conditional probability of the preceding word. This
assumption is the Markov assumption.
- In general, the N-gram approximation is
P(w1) ≅
P(wn | wn-1n-N+1)
- N-gram models can be trained by counting and normalising. In other words
the probability value
P(wn | wn-1n-N+1 can be
determined by counting the number of occurrences of that N-gram in a
corpus, and dividing it by the number of occurrences of all N-grams that
has that same leading sequence of N-1 terms. That is,
P(wn | wn-1n-N+1) =
C(wn-1n-N+1 wn) /
C(wn-1n-N+1)
N-gram models are sensitive to training corpus.
Smoothing
Smoothing is required since the relative frequencies of N-grams not present
in the training corpus turns out to be zero even for linguistically
valid N-grams.
Add-One Smoothing-
- Add-one to counts of all N-grams, those that have occurred in the
corpus, as well as those that have not, but are possible with the given
vocabulary.
- To obtain the smoothed relative frequencies from the modified counts
(one-added), normalise them by dividing each by the total number of
tokens having the same (N-1)-gram prefix, plus the size of the vocabulary.
Witten-Bell Discounting-
- Based on the intuition that probability of seeing a zero-frequency
N-gram can be modelled by the probability of seeing an N-gram for the
first time.
- Probability of seeing an N-gram for the first time can be estimated by
counting the number of times an N-gram is seen for the first time. This
is simple since it is exactly the number of distinct N-grams that occur
in the training corpus.
- Total probability mass of all zero-frequency n-grams-
Σi:ci=0pi*
= T / (N + T)
Good-Turing Discounting-
- Re-estimate the amount of probability mass to assign th N-grams with
zero or low counts by looking at the number of N-grams with higher counts.
Nc = Σb:c(b)=c1
Backoff-
- Build an N-gram model based on an (N-1)-gram model.
Backoff can be combined with Discounting - Use discounting to compute the
probability mass for unseen events (N-grams); then use backoff to
distribute this mass among different unseen N-grams, based on their (N-1)
gram probabilities.
Entropy
Perplexity-
If H is the entropy, then 2H is the perplexity of
the system.
Entropy of a sequence-
H(L) = limn -> ∞ (1/n) H(w1,w2,...,wn)
Cross-entropy-
Measure of the distance between two probability distributions.