Natural Language Processing

Books:
  1. Jurafsky D., and Martin J H. . Speech and Natural Language Processing, 2e, Pearson Education

    References:

  2. James A.. Natural language Understanding 2e, Pearson Education
  3. Bharati A., Sangal R., Chaitanya V.. Natural language processing: a Paninian perspective , PHI
  4. Siddiqui T., Tiwary U. S.. Natural language processing and Information retrieval , OUP

1. Introduction

Language

Natural Language is a language used by human beings in spoken form and, optionally, in written form too. It may also be called human language .

Study of languages is called linguistics.

We shall use the term linguistic expression or simply, expression to denote instances of use of a language to represent information.

From a linguistic perspective, use of a language comprises two broad tasks- understanding and generation.
From a technical perspective, there are additional aspects, such as rendering, encoding, storage, etc.

Using computers for dealing with natural languages is gaining much thrust. Use of computers in linguistics is referred to as computational linguistics. Natural Language Processing (NLP) broadly denotes the use of computer in applications that require knowledge of language(s). NLP covers computational linguistics, as well as techniques required for encoding, rendering, and storage of linguistic expressions. Some NLP applications are- spelling correction, grammar checking, information retrieval, summarization, machine translation, etc.

Ambiguity
....

Language processing capability of a computer is closely related to the wider subject of artificial intelligence. Alan Turing proposed the Turing Test, a game, in which a computer's use of language would form the basis of determining if it could think. If the machine wins, it would be judged intelligent.

Brief History-
1940s and 1950s- Work on automaton and probabilistic (or information theoretic) models. The automaton idea is related to the field of formal language theory. The probabilistic model started on the basis of Shannon's concept of noisy channel and decoding, and that of entropy.
1960s- Two paradigms- symbolic and stochastic. The symbolic paradigm is based on the formal language theory and generative syntax.
1970-1983 - Four paradigms- stochastic, logic-based, natural language understanding, discourse modelling.
1983-1993 - Empiricism and Finite State Models.
1994-1999 - Combination of different paradigms.

Different NLP applications require different levels of knowledge of the language. These are-

Formal Models for NLP

Text representation in computers, encoding schemes

Word Recognition

Given an alphabet (a set of elementary symbols used in written form of languages), not all combinations/sequences of symbols thereof are valid in the language. Valid sequences/strings of letters are called words. In terms of an alphabet a Word is a string of non-space letters from the alphabet that carry a meaning in the language. Finite State Automata

[Refer to books on FLA.]

Regular expressions in Perl- [Examples].
    #! /usr/bin/perl
    #prompts for a file name and prints all those lines that contain the
    #vowels in order
     
    print "Please give me a file name>>> ";
    $filename = ;
    chomp $filename;
    open (IN, $filename);
     
    while (){
    if ($_ =~ m/a.*e.*i.*o.*u/){
	    print;
	} 
    }

Morphology

- Study of the structure of words.

In natural languages words are often formed by some modification of other words or sub-words.

Identifying the root components of a word and the attributes of the surface form of the word is called morphological parsing. Finite-state morphological parsing Input- word in a surface form.
Output- stem of the word plus the morphological features.

To build a morphological parser it requires-

  1. Lexicon.
  2. Morphotactics - the model of the morpheme ordering, usually in terms of classes of morphemes rather than in terms of actual words.
  3. Orthographic rules- spelling rules.
Finite state automata can be used to perform morphological recognition. But in NLP we often want to figure out attributes of a word that its morphology implies. This exercise is morphological parsing.

A finite state transducer (FST) can be used for morphological parsing. A transducer maps between one set of symbols and another. An FST does this via a finite automaton.

Formal definition of FST- [Based on Mealy machine.] 5-tuple:

(Q, Σ, q0, F, δ(q, i:o) )
Σ is a finite alphabet of complex symbols. Each complex symbol is composed of an input-output pair i:o, i is a symbol from an input alphabet I, and o is a symbol from an output alphabet O

δ is a relation from Q × Σ to Q.

Porter's Stemmer

Acquisition of Morphology

N-grams

Counting Words in a Corpus Unsmoothed n-grams N-gram models are sensitive to training corpus.

Smoothing

Smoothing is required since the relative frequencies of N-grams not present in the training corpus turns out to be zero even for linguistically valid N-grams.

Add-One Smoothing-

Witten-Bell Discounting-

Good-Turing Discounting-

Backoff-
- Build an N-gram model based on an (N-1)-gram model.

Backoff can be combined with Discounting - Use discounting to compute the probability mass for unseen events (N-grams); then use backoff to distribute this mass among different unseen N-grams, based on their (N-1) gram probabilities.

Entropy

Perplexity-
If H is the entropy, then 2H is the perplexity of the system.

Entropy of a sequence-
H(L) = limn -> ∞ (1/n) H(w1,w2,...,wn)

Cross-entropy-
Measure of the distance between two probability distributions.