`DOI 10.3233/FI-2014-1023
`IOS Press
`
`425
`
`Natural Language Understanding and Prediction:
`from Formal Grammars to Large Scale Machine Learning
`
`Nicolae Duta
`New England Research and Development Center
`Microsoft, Cambridge, USA
`niduta@microsoft.com
`
`Abstract. Scientists have long dreamed of creating machines humans could interact with by voice.
`Although one no longer believes Turing’s prophecy that machines will be able to converse like
`humans in the near future, real progress has been made in the voice and text-based human-machine
`interaction. This paper is a light introduction and survey of some deployed natural language systems
`and technologies and their historical evolution. We review two fundamental problems involving
`natural language: the language prediction problem and the language understanding problem. While
`describing in detail all these technologies is beyond our scope, we do comment on some aspects
`less discussed in the literature such as language prediction using huge models and semantic labeling
`using Marcus contextual grammars.
`
`Keywords: Natural language understanding, language modeling, language prediction
`
`1.
`
`Introduction
`
`Scientists have long dreamed of creating machines humans could interact with by voice. In his most cited
`paper published in 1950, Computing machinery and intelligence Turing predicted that “at the end of the
`century the use of words and general educated opinion will have altered so much that one will be able
`to speak of machines thinking without expecting to be contradicted” [46]. “Thinking machines” involve
`multiple capabilities: recognizing the words which are said, understanding their meaning and being able
`to produce a meaningful reaction (e.g, answer a question which may imply reasoning in addition to
`simply querying a fact database, perform an action/transaction, etc).
`Although, after several decades of research, one no longer believes Turing’s prophecy that machines
`will be able to converse like humans in the near future, real progress has been made in the voice and
`
`Address for correspondence: Microsoft NERD Center, One Memorial Drive, Cambridge, MA, 02142, USA
`
`Amazon v. SoundClear
`US Patent 11,069,337
`Amazon Ex. 1015
`
`
`
`426
`
`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`text-based human-machine interaction. From a theoretical viewpoint, modern computational linguistics
`started in the late 1950s when Noam Chomsky introduced the theory of generative grammars which
`aimed at producing a set of rules that correctly predict which combinations of words form grammatical
`sentences [15]. The first practical attempts at natural language understanding by a computer were at
`MIT and Stanford in the 1960s: Daniel Bobrow’s STUDENT system which used natural language input
`to solve algebra word problems and Joseph Weizenbaum’s ELIZA system that could carry a dialog on
`many topics (although it did not have a real understanding of the language)1.
`However, the early systems were only using written text input; it would take two more decades of re-
`search until Automatic Speech Recognition (ASR) could allow for voice input. Throughout 1990-2000s,
`the Defense Advanced Research Projects Agency (DARPA) in the United States conducted several pro-
`grams to advance the state of the art in ASR, spoken dialog and information extraction from automatically
`recognized text (ATIS 1990-1994, Hub4 1995-1999, Communicator 1999-2002, Ears 2002-2005, Gale
`2005-2010) [52]. These scientific advances have also been sustained by the introduction and exponential
`growth of the World Wide Web and by the huge increase in computing power and miniaturization that led
`to the today’s proliferation of smartphones.
`This paper is a light introduction and survey of some of the deployed natural language systems
`and technologies and their historical evolution. We review two fundamental problems involving natural
`language: the language prediction problem and the language understanding problem. While describing
`in detail all these technologies is beyond our scope, we do comment on some aspects less discussed in the
`literature such as language prediction using huge models and semantic labeling using Marcus contextual
`grammars.
`
`2. Natural Language Prediction
`
`Language prediction is defined as the ability to predict which words naturally follow a given word se-
`quence. It is generally assumed that natural languages are governed by a probability distribution on word
`sequences and the language prediction (actually called Statistical Language Modeling) models are trying
`to derive a good estimate of this distribution [4].
`Language modeling/prediction has started as a part of the Automated Speech Recognition research
`effort and is now extensively used in most systems which convert some form of signal into text using a
`Bayesian approach:
`
`• Automated Speech Recognition (ASR) for acoustic to text mappings [25]
`
`• Optical Character Recognition (OCR) and Handwriting recognition which map document images
`into text [32][38]
`
`• Automated Machine Translation (AMT) which maps text written in one language into text written
`in a different language [28]
`
`• Spelling correction systems which map incorrectly spelled text into the correct form [6]
`
`• Word completion and prediction systems (predict following letters in a word or words in a sms/email
`message considering context and previous user behavior) [9]
`
`1A detailed historical perspective can be found in [45]
`
`
`
`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`427
`
`These systems are all trying to find the text sentence which maximizes the posterior probability
`P (Sentence|Signal) which according to Bayes rule can be written as
`
`P (Sentence|Signal) = P (Signal|Sentence) × P (Sentence)/P (Signal)
`
`(1)
`
`P (Signal|Sentence) is the underlying signal model based on acoustic, visual, translational, etc cues
`while P (Sentence) describes the likelihood of a given sentence in a language.
`Since a natural language like English has a lexicon of the order of 106 words, it is not possible
`to directly estimate P (Sentence) for all sentences. Early ASR systems have restricted their language
`models to the set of sentences appearing in a training set. The drawback was that the system could only
`output one of the sentences it had seen in training no matter what the user said. Though the system had
`the ability to reject an input (not produce a text output if P (Sentence|Signal) was too low), that was
`not very helpful in a practical system that had to deal with unconstrained speech.
`Several techniques have been proposed for estimating P (S) for every sentence S = {w1, w2...wm}
`(wi are the sentence words in the order they are spoken) [4][40]; currently the most widely used models
`are based on decomposing P (w1, w2...wm) into a product of conditional probabilities
`
`P (w1, w2...wm) = P (wm|wm−1...w1) × P (wm−1|wm−2...w1) × P (w1).
`
`(2)
`
`Since it was still impractical to directly estimate P (w1, w2, ..., wm) for long word histories, one
`assumed that words far away in the history of a target word do not have a large influence. That is, the
`word sequence w1, w2, ..., wm behaves like a Markov chain of some order n. Therefore one only needs
`to estimate the statistical distributions of n consecutive word sequences called n-grams. According to
`these models, the probability of a sentence can be decomposed into a product of conditional n-gram
`probabilities. Although counterintuitive, n-gram models take no advantage of the syntactic or semantic
`structure of the sentences they model.
`However, if we are using the Maximum Likelihood (ML) estimate for
`
`PM L(wn|wn−1, ..., w1) = Count(wn, ..., w1)/Count(wn−1, ..., w1)
`
`(3)
`
`we are facing the issue of assigning a null probability to those n-grams not seen in the training data. Even
`with a training corpus in excess of a few billion words (that’s about the size of all newspaper text pub-
`lished in the US in the 1990s) there are still 10-20% valid 3-grams which have not been seen before (last
`row of Table 1). To properly handle them, one applies a technique called interpolated discounting (also
`called smoothing): set aside a part of the probability mass to account for unseen events and recursively
`interpolate longer history probabilities with shorter history probabilities:
`
`P (wi|wi−1, wi−2) = PM L(wi|wi−1, wi−2) × α(wi, wi−1, wi−2) + β(wi−1, wi−2) × P (wi|wi−1) (4)
`
`where α and β are called smoothing functions and model the amount of the probability mass that is left
`aside for unseen n-grams. To maintain a probability model we need it to sum to 1 over wi:
`
`β(wi−1, wi−2) = 1 − X
`
`PM L(wi|wi−1, wi−2) × α(wi, wi−1, wi−2)
`
`(5)
`
`wi
`
`
`
`428
`
`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`A large body of language modeling research in the 1990s has focused on finding suitable values for
`α and β. Two popular choices are called Witten-Bell [51] and Kneser-Ney [27] discounting:
`
`(6)
`
`(7)
`
`(8)
`
`W itten − Bell discounting
`
`Kneser − N ey discounting
`
`Count(.|wi−1, wi−2)
`U niq(.|wi−1, wi−2) + Count(.|wi−1, wi−2)
`
`1 −
`
`D(Count(wi|wi−1, wi−2))
`Count(wi|wi−1, wi−2)
`
`D(Count(wi|wi−1, wi−2))
`
`Pw
`
`i
`
`Count(wi|wi−1, wi−2)
`
`U niq(.|wi−1, wi−2)
`U niq(.|wi−1, wi−2) + Count(.|wi−1, wi−2)
`
`α =
`
`β =
`
`Starting in the early 2000s, the proliferation of documents posted on the internet generated a poten-
`tially huge LM training set. However, internet scraped text could not be directly used for LM training
`since: (i) Almost all of it was out of domain for the systems built at the time2. (ii) The computational
`resources (memory and computing speed) were not sufficient to accommodate the huge number of result-
`ing n-grams (a 5 billion-word newspaper corpus generates about 0.8B unique 3-grams and 1.5B unique
`4-grams).
`
`Multiple directions of research started to address these issues. One of them was LM pruning: some
`n-grams considered not too informative were discarded (although after being used in computing global
`statistics of the data). The simplest pruning technique is to discard the least frequent, higher-order n-
`grams which one may assume are not statistically significant. A more sophisticated technique is entropy
`pruning which considers the relative entropy between the original and the pruned model [44]. However,
`there appears to be a complex interaction between the pruning method/parameters and the type of dis-
`counting used in training the model and that can impact the speech recognition accuracy by as much as
`10% [12].
`A second research direction was to redesign the LM estimation toolkits and speech recognition
`pipelines to accommodate all n-grams seen in the data. It is important to know the difference between
`n-grams that are unobserved because they are rare and those that are impossible3. As shown in Ta-
`ble 1, keeping one billion 3-4 grams in the LM reduces the Word Error Rate (WER) by about 6% in the
`Broadcast News recognition domain [20]. Although received with skepticism in the academia [11], this
`direction (along with a distributed data processing framework like Map-Reduce [16]) largely contributed
`to the recent success of the Google ASR system [13].
`
`For many speech recognition applications (e.g. conversational speech) sufficient in-domain language
`data has not always been available and a solution was found to be the use additional out-of-domain data
`(especially internet scraped). Unfortunately, a simple mix of two (different in nature) corpora does not
`usually result in a better LM and a successful mixing strategy is often regarded as an art. Therefore, a
`third area of research has focused on combining in-domain with out-of-domain data or even bootstrap-
`ping an in-domain LM only using out-of-domain data [8][19].
`While this is still an active research area we would like to point out two interesting phenomena. The
`first is that the colloquial forms of some languages like Arabic and their literary counterparts (e.g. the
`
`2n-gram models are very sensitive to changes in the style, topic or genre of the text on which they are trained (called in-domain
`data) [40].
`3An analysis of the text currently used in sms messages and twitter postings shows that almost everything is now possible due
`to word mispelling, abbreviation and lack of syntactic structure
`
`
`
`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`429
`
`Table 1. The effects of LM prunning on the English broadcast news task [20]
`
`.
`
`LM Order
`
`3
`3
`4
`4
`
`LM size
`[4-grams,3-grams]
`[0, 36M]
`[0, 305M]
`[40M, 36M]
`[710M, 305M]
`
`HitRates
`[4-grams,3-grams]
`[0, 76%]
`[0, 84%]
`[49%, 76%]
`[61%, 84%]
`
`W ER
`
`12.6%
`12.1%
`12.1%
`11.8%
`
`Modern Standard Arabic-MSA used in newspaper articles and TV broadcasts) although have the same
`word lexicon, share very few of the higher order n-grams (see Table 2). That means that published texts
`and TV transcripts are not effective for training a conversational LM [26].
`
`Table 2. Vocabulary coverage and 3-gram hit rates for LMs based on the Arabic Conversational (150K words),
`Broadcast News (300M words) and Conversational + BN data
`
`LM training data
`Conversational alone
`Broadcast News
`Conversational + News
`
`Vocabulary coverage
`90.6%
`89.5%
`96.6%
`
`3-gram Hit Rate
`20%
`4%
`21%
`
`The second phenomenon is that even though there may still be a significant accuracy gap between
`speech recognition using a fully in-domain LM and that using a bootstrapped LM, the semantics of the
`recognized sentence may be far less impacted. That is, one can still figure out the semantic intent of a
`sentence even when some of the words are misrecognized [19][47].
`
`Finally, we would like to mention the latest trends in Language Modeling. Discriminative language
`models (DLMs) [14] aim at directly optimizing word error rate by rewarding features that appear in
`low error hypotheses and penalizing features in misrecognized hypotheses. Since the estimation of dis-
`criminative LMs is computationally more intensive than regular n-gram LM one has to use distributed
`learning algorithms and supporting parallel computing infrastructure [16]. Neural network language
`models embed words in a continuous space in which probability estimation is performed using neural
`networks (feed-forward or recurrent, very recent work is based on multiple hidden layer networks called
`deep networks [2]). The expectation is that, with proper training of the word embedding, words that
`are semantically or gramatically related will be mapped to similar locations in the continuous space.
`Because the probability estimates are smooth functions of the continuous word representations, a small
`change in the features results in a small change in the probability estimation and NNLM may achieve
`better generalization for unseen n-grams.
`
`
`
`430
`
`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`3. Natural Language Understanding
`
`3.1. Brief history
`
`During the last couple of decades there has been a tremendous growth of deployed voice driven language
`understanding systems; however mostly designed for limited domains. At first, these systems were able
`to recognize and interpret (through fixed grammars) some predetermined phrases and named entities like
`locations or business names. Most popular were the Directory Assistance (DA) systems built by TellMe,
`Phonetic Systems/Nuance, BBN, Jingle, Google, etc.
`Later on, the ASR technology started to support constrained digit sequences (dates, phone numbers,
`credit card and bank account numbers) and form filling directed dialog systems were designed for tasks
`like flight reservation. In such systems, users are asked to provide answers to what the system has asked
`for, which often consists of a single piece of semantic information. Directed dialog systems evolved
`into mixed-initiative systems where both users and the system can control the dialog flow and which
`allowed users to provide more semantic information in a single utterance and in any sequence they
`choose. The language understanding task became higher resolution with more semantic entities in need
`to be identified, segmented and normalized.
`The DARPA Airline Travel Information System (ATIS) project [3] was initiated in the 1990s for
`the flight information domain. Users provide some flight attributes like departure and destination cities,
`dates, etc. However there were no constraints on how the information could be expressed. That is, users
`could say either “I need a flight reservation from Boston to Miami leaving tomorrow and returning in two
`weeks” or “Please show me the flight to Miami departing Boston tomorrow”. One can notice that beyond
`this freedom of expression there is a clear semantic structure with crisp and unambiguous semantic
`entities like Departure/Arrival Cities/Date/Times. These entities, known as “semantic slots” or “frame
`elements” are considered to be part of a set of templates (semantic frames) which represent the structure
`of the semantic space. The language understanding component in a frame-based system has to choose the
`correct semantic frame for an utterance and to segment and normalize the associated semantic slots. For
`example, the “Departure Date” slot expressed as the word “tomorrow” has to be normalized to something
`like “03/11/2013” in order to be useful for searching a flight database. Most ATIS systems employed
`either a statistical classification approach (those coming from the speech processing community) such as
`AT&T’s CHRONUS [37] and BBN’s hidden understanding models [30] or a knowledge-based approach
`(mostly from the computational linguistics community) such as the MIT’s TINA [42], CMU’s Phoenix
`[50], and SRI’s Gemini [17].
`TINA [42] is basically a context-free grammar converted to a probabilistic network and implements
`a seamless interface between syntax and semantics. The initially bootstrapped context-free grammar is
`built from a set of training sentences where each sentence is translated by hand into a list of the rules
`invoked to parse it. The rule set is converted to a form that merges common elements on the right-hand
`side (RHS) of all rules sharing the same left-hand side (LHS). Elements on the LHS become parent nodes
`in a family tree. Through example sentences, they acquire knowledge of who their children are and how
`they can interconnect. The injection of domain-dependent semantics is done by replacing the low-level
`syntactic non-terminals with semantic non-terminals. For example, the syntactic rule-based derivation
`SUBJECT => NOUN PHRASE => ARTICLE NOUN => the Hyatt
`is replaced by the semantic derivation [48]
`SUBJECT => ARTICLE PLACE => ARTICLE HOTEL => the Hyatt
`
`
`
`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`431
`
`A main limitation of the knowledge-based systems is that the grammar design process is tedious, slow
`and requires a lot of expertise. The semantic space partition into semantic frames may be subjective and
`the set of slots for a frame are imposed in a top-down fashion rather than extracted from data. Therefore
`some natural language sentences may not be well modeled in this framework.
`At the other end of the semantic spectrum are systems which only need to extract the sentence
`intent without other semantic entities. An example of such systems are the Call Routers whose goal is
`to automatically route a telephone query from a customer to the appropriate set of agents based on a
`brief spoken description of the problem. Call routers are nowadays deployed in most of the large call
`centers because they reduce queue time and call duration, thus saving money and improving customer
`satisfaction by promptly connecting the customer to the right service representative. These systems
`remove the constraints on what a user can say but at the expense of limiting the target semantic space.
`That is, call routers are specifically built for business verticals (e.g.
`telecommunication, government,
`utility companies) and are only designed to detect the kinds of semantic intents specific to that vertical
`(e.g. a telecommunication provider may allow a customer to perform one of several actions: canceling
`some service, resolving a billing issue, paying a bill, adding a new telephone line, etc).
`Well known call routing systems are the AT&T How may I help you? (HMIHY) [22] and the BBN
`Call director [33]. The users are greeted by an open-ended prompt like How May I Help You?, which
`encourages them to speak naturally. To find the meaning of a human utterance in a call routing system,
`the caller’s speech is first translated into a text string by an ASR system and the text is then fed into a
`NLU component called Router. The NLU task is modeled as a statistical classification problem: the text
`corresponding to an utterance is assigned to one or more of a set of predefined user intents (routes).
`
`3.2. Current NLU architecture
`
`The explosion of mobile computing power that came with the smartphones allowed the development of
`more sophisticated NLU systems that could handle combinations of many user intents along with the
`associated named entity extraction. There is now a proliferation of more complex, dialog-based voice
`search systems and mobile personal assistants that are configured to understand and perform several tasks
`[18]. Each task may have different sets of semantic entities that can be formulated and uttered differently
`by different users. The NLU goal in such systems is to also identify which task the user would like to
`perform (usually called user intent).
`A modern client-server voice-based transactional system including dialog is depicted in Fig. 1 (see
`also [49], [23]). A user opens a client application on his phone and utters a sentence (e.g. query or
`command). The client sends the acoustic signal to the server system where it is first converted into text
`by the ASR module. Next, a NLU module extracts the semantic information from this text. A popular
`approach is top-down hierarchical meaning extraction. A semantic domain classifier can be used to
`determine which part of the semantic space a query belongs to. For example, the query “I need a table
`for two at the closest Bertucci’s restaurant for tomorrow” belongs to the “Restaurant” domain. Then,
`using domain dependent models, a second classifier finds the query intent (what the user asks for). In our
`example, the intent is “Restaurant reservation”. Finally using domain and intent dependent models, one
`segments the semantic slots (basic semantic entities) associated with the given domain and intent which
`have been specified by the user. In our case, the following slots appear and can be extracted from the
`sentence: (i) Restaurant name = “Bertucci’s” (ii) Reservation date = “tomorrow”, (iii) Party size = “two”
`and (iv) Restaurant location = “closest”. After that, a normalizer translates each slot value into a form that
`
`
`
`432
`
`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`Figure 1. The architecture of a client-server voice-based transactional system including dialog.
`
`can be used to query a knowledge-database. For our query, we could get the following slot normalized
`values: (i) Restaurant name = “Bertucci’s pizzeria” (ii) Reservation date = “03/15/2013” and (iii) Party
`size = “2”. The query domain, intent and normalized slot values are further sent out to a dialog manager
`which has detailed information about the domain and determines the next system action. In our case,
`the dialog manager asks the client application to provide the current user location4, then it interrogates a
`business database to find the closest Bertucci’s restaurant and finally detects that a restaurant reservation
`also requires time information in order to be fulfilled. Therefore, it will issue back to the user a question
`regarding the reservation time. The dialog manager produces the question as text, but that is fed into a
`speech synthesizer and turned into an audio signal which is played back to the user. Let’s assume the
`user answers “Hum let’s say six in the evening”. The NLU system now detects a single semantic slot
`Time = “six in the evening” which is normalized as Time = “6 pm” and sent to the dialog manager along
`with “Unknown” domain and intent. The dialog manager, which also keeps track of the dialog states
`(possible using a stack), knows that this is the missing piece of information from a previous query and
`it can now take action on the query. Some systems send back to the user the parsed information asking
`for confirmation: “Ok, I’ll make a reservation for two at Bertucci’s on Main street for March 15th 2013
`at 6pm. Is that correct?” If the user agrees, the Execution unit sends all the information to a restaurant
`reservation service/web site which performs the actual reservation.
`One can easily notice that the dialog system architecture in Fig. 1 generalizes all systems built in the
`past. The DA systems had no client application (at the time users were using landlines), dialog manager
`
`4either from the internal GPS or from the wireless provider signal triangulation
`
`
`
`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`433
`
`(there was a single-shot query which was automatically routed to a human agent if the system returned a
`low confidence), no domain or intent classifiers (the system’s goal was only to return the phone number
`of a certain business or individual). They only had a primitive slot extractor (either business name or
`location, though often they were asked for separately) and normalizer. The directed dialog systems added
`a dialog manager and a small number of fixed intents often specified as a single piece of information5.
`On the other hand, the call routers added an intent classifier (with a number of intents ranging from a few
`tens to a few hundreds) and a very small number of slots.
`From a linguistic viewpoint, these systems could be characterized by the following four criteria
`(see Table 3 and [49]): the naturalness and the size of the space of input sentences, the resolution of
`the target semantic representation and the size of the target semantic space. The systems have evolved
`from a low naturalness, input space, semantic resolution and space size (directed dialog) to medium-
`high naturalness, large input space, high semantic resolution and space size (today’s voice transaction
`systems).
`
`Table 3. Comparison of several NLU systems with respect to the characteristics of the input utterance space and
`the output semantic space (adapted from [49])
`
`NLU system
`
`Directed dialog
`DA
`Mixed initiative
`Call routing
`Voice search & personal assistant
`
`User input
`Naturalness
`Low
`Low
`Low-medium
`High
`Medium-high
`
`utterances
`Input space
`Small
`Large
`Small
`Medium
`Large
`
`Target semantic
`Resolution
`Low
`Low
`High
`Low
`High
`
`representation
`Semantic space
`Small
`Small
`Small
`Small
`Large
`
`One important phenomenon is that the text which modern systems are attempting to understand obeys
`less and less the syntax rules of the language. Spoken language often contains dysfluencies, restarts and
`recognition errors while written text may be highly abbreviated and/or truncated. For example, an SMS
`line may be “he like u” while a speech recognized sentence may be “by movie uh kings speech” (the
`spoken query was “buy movie king’s speech”).
`
`3.3. Semantic data annotation
`
`As previously mentioned, modern NLU systems often consist of sets of text classifiers which extract
`various types of semantic information: query domain, query intent, semantic slots and/or other attributes
`of the domain (facets) which sometimes may only be mentioned implicitly (e.g. the fragment “which
`make me laugh” in the query “find movies which make me laugh” should be interpreted as a movie genre
`and one may need some sort of logic reasoning for extracting these mappings [10]).
`These classifiers need to be trained on large amounts of data in which the semantic entities of interest
`are manually annotated. As shown on top of Fig. 1 several sets of manual annotations are necessary: (i)
`
`5The system could have asked: “What transaction would you like to perform: Flight information, reservation, cancellation,
`other?”
`
`
`
`434
`
`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`Speech transcriptions (a textual form of the user spoken utterances) (ii) Semantic domain and/or intent
`annotations and (iii) Semantic slot annotations. Although one tries to carefully annotate the data, the
`references produced by different human annotators are not identical. Sometimes that is due to annotator
`fatigue but most of the time there is a subjective component especially for the semantic annotations. The
`inter-annotator disagreement may be 6% for speech transcription [21] but it can get much higher when
`semantics is involved.
`Therefore one may try to automate parts of the data annotation process. Semantic slot annotation
`could be done using Marcus contextual grammars [29][36] which have been theoretically studied for
`a long time (a brief introduction is given in the Appendix). We will show here an example of how to
`construct and use such a grammar. Let’s assume the vocabulary V is the set of English words and the
`starting language A over V is the set of sentences a human might use for interacting with an NLU system
`as described in Section 3.2. The set of selectors correspond to the semantic entities we would like to label
`and the set of contexts contain the English word we would like to label them with. Let’s say S1 is the set
`of restaurant names, S2 is the set of location names and C1, C2 are their corresponding semantic labels:
`
`S1 = {M cDonald′s, Boston M arket, ...}, C1 = {(< Restaurant >, < /Restaurant >)},
`C2 = {(< Location >, < /Location >)}
`S2 = {Boston, Cambridge, ...},
`and so on. A possible derivation in this grammar is
`
`Find me a McDonald’s in Boston =>
`Find me a <Restaurant>McDonald’s</Restaurant> in Boston =>
`Find me a <Restaurant>McDonald’s</Restaurant> in <Location>Boston</Location>
`
`In order to generate correct annotations, we require the derivations to be in maximal global mode.
`That is, at each derivation step, the word selector x is maximal with respect to all selectors S1, ..., Sn.
`That is enforced if we always label first the longest semantic entity that could be labeled. The resulting
`annotated sentence obeys Occam’s razor6 (annotates as many words as possible with as few labels as
`possible) and is most of the time correct. A simple example is
`
`Find me a Boston Market in Cambridge =>
`Find me a <Restaurant>Boston Market</Restaurant> in Cambridge =>
`Find me a <Restaurant>Boston Market</Restaurant> in <Location>Cambridge</Location>
`
`If the derivation is not in the maximum global mode one could get:
`
`Find me a Boston Market in Cambridge =>
`Find me a <Location>Boston</Location> Market in Cambridge =>
`Find me a <Location>Boston</Location> Market in <Location>Cambridge</Location>
`
`which is obviously incorrect.
`In [29], the finite and regular families of selectors are investigated. Although in practical systems
`the vocabulary, starting axioms, selectors and contexts are all finite, the case where the selectors are
`generated by a context sensitive mechanism is of high interest. That is because one name entity may
`belong to multiple semantic classes. For example the word “Eagles” belongs to MusicBand, SportsTeam
`and Bird classes. For such ambiguous cases, the set of selectors must also contain some contextual
`words to disambiguate the semantic class. As such, the word “Eagles” should appear in fragments like
`
`6For a mathematically formalized version of Occam’s razor see Ray Solomonoff’s theory of universal inductive inference [43]
`
`
`
`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`435
`
`“Eagles songs” in MusicBand, “Eagles scores” in SportsTeam and “Eagle food” in Bird. The problem
`becomes even harder when the sets of context sensitive selectors have to be automatically extracted from
`un-annotated data.
`An easy way to implement semantic annotation with a Marcus contextual grammar is by using Finite
`State Transducer technology [3] [31]. In the Xerox FST toolkit language [3], the grammar shown above
`can be written as:
`
`define Location [{Boston}|{Cambridge}] EndTag(Location); # Selector Location
`define Restaurant [{McDonald’s}|{Boston Market}] EndTag(Restaurant); # Selector Restaurant
`regex Location | Restaurant; # Grammar definition with implicit vocabulary
`
`and the annotated output produced by the toolkit is:
`
`fst[1]: pmatch Find me a Boston Market in Cambridge
`Find me a <Restaurant>Boston Market</Restaurant> in <Location>Cambridge</Location>
`
`T