throbber
Fundamenta Informaticae 131 (2014) 425–440
`DOI 10.3233/FI-2014-1023
`IOS Press
`
`425
`
`Natural Language Understanding and Prediction:
`from Formal Grammars to Large Scale Machine Learning
`
`Nicolae Duta
`New England Research and Development Center
`Microsoft, Cambridge, USA
`niduta@microsoft.com
`
`Abstract. Scientists have long dreamed of creating machines humans could interact with by voice.
`Although one no longer believes Turing’s prophecy that machines will be able to converse like
`humans in the near future, real progress has been made in the voice and text-based human-machine
`interaction. This paper is a light introduction and survey of some deployed natural language systems
`and technologies and their historical evolution. We review two fundamental problems involving
`natural language: the language prediction problem and the language understanding problem. While
`describing in detail all these technologies is beyond our scope, we do comment on some aspects
`less discussed in the literature such as language prediction using huge models and semantic labeling
`using Marcus contextual grammars.
`
`Keywords: Natural language understanding, language modeling, language prediction
`
`1.
`
`Introduction
`
`Scientists have long dreamed of creating machines humans could interact with by voice. In his most cited
`paper published in 1950, Computing machinery and intelligence Turing predicted that “at the end of the
`century the use of words and general educated opinion will have altered so much that one will be able
`to speak of machines thinking without expecting to be contradicted” [46]. “Thinking machines” involve
`multiple capabilities: recognizing the words which are said, understanding their meaning and being able
`to produce a meaningful reaction (e.g, answer a question which may imply reasoning in addition to
`simply querying a fact database, perform an action/transaction, etc).
`Although, after several decades of research, one no longer believes Turing’s prophecy that machines
`will be able to converse like humans in the near future, real progress has been made in the voice and
`
`Address for correspondence: Microsoft NERD Center, One Memorial Drive, Cambridge, MA, 02142, USA
`
`Amazon v. SoundClear
`US Patent 11,069,337
`Amazon Ex. 1015
`
`

`

`426
`
`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`text-based human-machine interaction. From a theoretical viewpoint, modern computational linguistics
`started in the late 1950s when Noam Chomsky introduced the theory of generative grammars which
`aimed at producing a set of rules that correctly predict which combinations of words form grammatical
`sentences [15]. The first practical attempts at natural language understanding by a computer were at
`MIT and Stanford in the 1960s: Daniel Bobrow’s STUDENT system which used natural language input
`to solve algebra word problems and Joseph Weizenbaum’s ELIZA system that could carry a dialog on
`many topics (although it did not have a real understanding of the language)1.
`However, the early systems were only using written text input; it would take two more decades of re-
`search until Automatic Speech Recognition (ASR) could allow for voice input. Throughout 1990-2000s,
`the Defense Advanced Research Projects Agency (DARPA) in the United States conducted several pro-
`grams to advance the state of the art in ASR, spoken dialog and information extraction from automatically
`recognized text (ATIS 1990-1994, Hub4 1995-1999, Communicator 1999-2002, Ears 2002-2005, Gale
`2005-2010) [52]. These scientific advances have also been sustained by the introduction and exponential
`growth of the World Wide Web and by the huge increase in computing power and miniaturization that led
`to the today’s proliferation of smartphones.
`This paper is a light introduction and survey of some of the deployed natural language systems
`and technologies and their historical evolution. We review two fundamental problems involving natural
`language: the language prediction problem and the language understanding problem. While describing
`in detail all these technologies is beyond our scope, we do comment on some aspects less discussed in the
`literature such as language prediction using huge models and semantic labeling using Marcus contextual
`grammars.
`
`2. Natural Language Prediction
`
`Language prediction is defined as the ability to predict which words naturally follow a given word se-
`quence. It is generally assumed that natural languages are governed by a probability distribution on word
`sequences and the language prediction (actually called Statistical Language Modeling) models are trying
`to derive a good estimate of this distribution [4].
`Language modeling/prediction has started as a part of the Automated Speech Recognition research
`effort and is now extensively used in most systems which convert some form of signal into text using a
`Bayesian approach:
`
`• Automated Speech Recognition (ASR) for acoustic to text mappings [25]
`
`• Optical Character Recognition (OCR) and Handwriting recognition which map document images
`into text [32][38]
`
`• Automated Machine Translation (AMT) which maps text written in one language into text written
`in a different language [28]
`
`• Spelling correction systems which map incorrectly spelled text into the correct form [6]
`
`• Word completion and prediction systems (predict following letters in a word or words in a sms/email
`message considering context and previous user behavior) [9]
`
`1A detailed historical perspective can be found in [45]
`
`

`

`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`427
`
`These systems are all trying to find the text sentence which maximizes the posterior probability
`P (Sentence|Signal) which according to Bayes rule can be written as
`
`P (Sentence|Signal) = P (Signal|Sentence) × P (Sentence)/P (Signal)
`
`(1)
`
`P (Signal|Sentence) is the underlying signal model based on acoustic, visual, translational, etc cues
`while P (Sentence) describes the likelihood of a given sentence in a language.
`Since a natural language like English has a lexicon of the order of 106 words, it is not possible
`to directly estimate P (Sentence) for all sentences. Early ASR systems have restricted their language
`models to the set of sentences appearing in a training set. The drawback was that the system could only
`output one of the sentences it had seen in training no matter what the user said. Though the system had
`the ability to reject an input (not produce a text output if P (Sentence|Signal) was too low), that was
`not very helpful in a practical system that had to deal with unconstrained speech.
`Several techniques have been proposed for estimating P (S) for every sentence S = {w1, w2...wm}
`(wi are the sentence words in the order they are spoken) [4][40]; currently the most widely used models
`are based on decomposing P (w1, w2...wm) into a product of conditional probabilities
`
`P (w1, w2...wm) = P (wm|wm−1...w1) × P (wm−1|wm−2...w1) × P (w1).
`
`(2)
`
`Since it was still impractical to directly estimate P (w1, w2, ..., wm) for long word histories, one
`assumed that words far away in the history of a target word do not have a large influence. That is, the
`word sequence w1, w2, ..., wm behaves like a Markov chain of some order n. Therefore one only needs
`to estimate the statistical distributions of n consecutive word sequences called n-grams. According to
`these models, the probability of a sentence can be decomposed into a product of conditional n-gram
`probabilities. Although counterintuitive, n-gram models take no advantage of the syntactic or semantic
`structure of the sentences they model.
`However, if we are using the Maximum Likelihood (ML) estimate for
`
`PM L(wn|wn−1, ..., w1) = Count(wn, ..., w1)/Count(wn−1, ..., w1)
`
`(3)
`
`we are facing the issue of assigning a null probability to those n-grams not seen in the training data. Even
`with a training corpus in excess of a few billion words (that’s about the size of all newspaper text pub-
`lished in the US in the 1990s) there are still 10-20% valid 3-grams which have not been seen before (last
`row of Table 1). To properly handle them, one applies a technique called interpolated discounting (also
`called smoothing): set aside a part of the probability mass to account for unseen events and recursively
`interpolate longer history probabilities with shorter history probabilities:
`
`P (wi|wi−1, wi−2) = PM L(wi|wi−1, wi−2) × α(wi, wi−1, wi−2) + β(wi−1, wi−2) × P (wi|wi−1) (4)
`
`where α and β are called smoothing functions and model the amount of the probability mass that is left
`aside for unseen n-grams. To maintain a probability model we need it to sum to 1 over wi:
`
`β(wi−1, wi−2) = 1 − X
`
`PM L(wi|wi−1, wi−2) × α(wi, wi−1, wi−2)
`
`(5)
`
`wi
`
`

`

`428
`
`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`A large body of language modeling research in the 1990s has focused on finding suitable values for
`α and β. Two popular choices are called Witten-Bell [51] and Kneser-Ney [27] discounting:
`
`(6)
`
`(7)
`
`(8)
`
`W itten − Bell discounting
`
`Kneser − N ey discounting
`
`Count(.|wi−1, wi−2)
`U niq(.|wi−1, wi−2) + Count(.|wi−1, wi−2)
`
`1 −
`
`D(Count(wi|wi−1, wi−2))
`Count(wi|wi−1, wi−2)
`
`D(Count(wi|wi−1, wi−2))
`
`Pw
`
`i
`
`Count(wi|wi−1, wi−2)
`
`U niq(.|wi−1, wi−2)
`U niq(.|wi−1, wi−2) + Count(.|wi−1, wi−2)
`
`α =
`
`β =
`
`Starting in the early 2000s, the proliferation of documents posted on the internet generated a poten-
`tially huge LM training set. However, internet scraped text could not be directly used for LM training
`since: (i) Almost all of it was out of domain for the systems built at the time2. (ii) The computational
`resources (memory and computing speed) were not sufficient to accommodate the huge number of result-
`ing n-grams (a 5 billion-word newspaper corpus generates about 0.8B unique 3-grams and 1.5B unique
`4-grams).
`
`Multiple directions of research started to address these issues. One of them was LM pruning: some
`n-grams considered not too informative were discarded (although after being used in computing global
`statistics of the data). The simplest pruning technique is to discard the least frequent, higher-order n-
`grams which one may assume are not statistically significant. A more sophisticated technique is entropy
`pruning which considers the relative entropy between the original and the pruned model [44]. However,
`there appears to be a complex interaction between the pruning method/parameters and the type of dis-
`counting used in training the model and that can impact the speech recognition accuracy by as much as
`10% [12].
`A second research direction was to redesign the LM estimation toolkits and speech recognition
`pipelines to accommodate all n-grams seen in the data. It is important to know the difference between
`n-grams that are unobserved because they are rare and those that are impossible3. As shown in Ta-
`ble 1, keeping one billion 3-4 grams in the LM reduces the Word Error Rate (WER) by about 6% in the
`Broadcast News recognition domain [20]. Although received with skepticism in the academia [11], this
`direction (along with a distributed data processing framework like Map-Reduce [16]) largely contributed
`to the recent success of the Google ASR system [13].
`
`For many speech recognition applications (e.g. conversational speech) sufficient in-domain language
`data has not always been available and a solution was found to be the use additional out-of-domain data
`(especially internet scraped). Unfortunately, a simple mix of two (different in nature) corpora does not
`usually result in a better LM and a successful mixing strategy is often regarded as an art. Therefore, a
`third area of research has focused on combining in-domain with out-of-domain data or even bootstrap-
`ping an in-domain LM only using out-of-domain data [8][19].
`While this is still an active research area we would like to point out two interesting phenomena. The
`first is that the colloquial forms of some languages like Arabic and their literary counterparts (e.g. the
`
`2n-gram models are very sensitive to changes in the style, topic or genre of the text on which they are trained (called in-domain
`data) [40].
`3An analysis of the text currently used in sms messages and twitter postings shows that almost everything is now possible due
`to word mispelling, abbreviation and lack of syntactic structure
`
`

`

`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`429
`
`Table 1. The effects of LM prunning on the English broadcast news task [20]
`
`.
`
`LM Order
`
`3
`3
`4
`4
`
`LM size
`[4-grams,3-grams]
`[0, 36M]
`[0, 305M]
`[40M, 36M]
`[710M, 305M]
`
`HitRates
`[4-grams,3-grams]
`[0, 76%]
`[0, 84%]
`[49%, 76%]
`[61%, 84%]
`
`W ER
`
`12.6%
`12.1%
`12.1%
`11.8%
`
`Modern Standard Arabic-MSA used in newspaper articles and TV broadcasts) although have the same
`word lexicon, share very few of the higher order n-grams (see Table 2). That means that published texts
`and TV transcripts are not effective for training a conversational LM [26].
`
`Table 2. Vocabulary coverage and 3-gram hit rates for LMs based on the Arabic Conversational (150K words),
`Broadcast News (300M words) and Conversational + BN data
`
`LM training data
`Conversational alone
`Broadcast News
`Conversational + News
`
`Vocabulary coverage
`90.6%
`89.5%
`96.6%
`
`3-gram Hit Rate
`20%
`4%
`21%
`
`The second phenomenon is that even though there may still be a significant accuracy gap between
`speech recognition using a fully in-domain LM and that using a bootstrapped LM, the semantics of the
`recognized sentence may be far less impacted. That is, one can still figure out the semantic intent of a
`sentence even when some of the words are misrecognized [19][47].
`
`Finally, we would like to mention the latest trends in Language Modeling. Discriminative language
`models (DLMs) [14] aim at directly optimizing word error rate by rewarding features that appear in
`low error hypotheses and penalizing features in misrecognized hypotheses. Since the estimation of dis-
`criminative LMs is computationally more intensive than regular n-gram LM one has to use distributed
`learning algorithms and supporting parallel computing infrastructure [16]. Neural network language
`models embed words in a continuous space in which probability estimation is performed using neural
`networks (feed-forward or recurrent, very recent work is based on multiple hidden layer networks called
`deep networks [2]). The expectation is that, with proper training of the word embedding, words that
`are semantically or gramatically related will be mapped to similar locations in the continuous space.
`Because the probability estimates are smooth functions of the continuous word representations, a small
`change in the features results in a small change in the probability estimation and NNLM may achieve
`better generalization for unseen n-grams.
`
`

`

`430
`
`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`3. Natural Language Understanding
`
`3.1. Brief history
`
`During the last couple of decades there has been a tremendous growth of deployed voice driven language
`understanding systems; however mostly designed for limited domains. At first, these systems were able
`to recognize and interpret (through fixed grammars) some predetermined phrases and named entities like
`locations or business names. Most popular were the Directory Assistance (DA) systems built by TellMe,
`Phonetic Systems/Nuance, BBN, Jingle, Google, etc.
`Later on, the ASR technology started to support constrained digit sequences (dates, phone numbers,
`credit card and bank account numbers) and form filling directed dialog systems were designed for tasks
`like flight reservation. In such systems, users are asked to provide answers to what the system has asked
`for, which often consists of a single piece of semantic information. Directed dialog systems evolved
`into mixed-initiative systems where both users and the system can control the dialog flow and which
`allowed users to provide more semantic information in a single utterance and in any sequence they
`choose. The language understanding task became higher resolution with more semantic entities in need
`to be identified, segmented and normalized.
`The DARPA Airline Travel Information System (ATIS) project [3] was initiated in the 1990s for
`the flight information domain. Users provide some flight attributes like departure and destination cities,
`dates, etc. However there were no constraints on how the information could be expressed. That is, users
`could say either “I need a flight reservation from Boston to Miami leaving tomorrow and returning in two
`weeks” or “Please show me the flight to Miami departing Boston tomorrow”. One can notice that beyond
`this freedom of expression there is a clear semantic structure with crisp and unambiguous semantic
`entities like Departure/Arrival Cities/Date/Times. These entities, known as “semantic slots” or “frame
`elements” are considered to be part of a set of templates (semantic frames) which represent the structure
`of the semantic space. The language understanding component in a frame-based system has to choose the
`correct semantic frame for an utterance and to segment and normalize the associated semantic slots. For
`example, the “Departure Date” slot expressed as the word “tomorrow” has to be normalized to something
`like “03/11/2013” in order to be useful for searching a flight database. Most ATIS systems employed
`either a statistical classification approach (those coming from the speech processing community) such as
`AT&T’s CHRONUS [37] and BBN’s hidden understanding models [30] or a knowledge-based approach
`(mostly from the computational linguistics community) such as the MIT’s TINA [42], CMU’s Phoenix
`[50], and SRI’s Gemini [17].
`TINA [42] is basically a context-free grammar converted to a probabilistic network and implements
`a seamless interface between syntax and semantics. The initially bootstrapped context-free grammar is
`built from a set of training sentences where each sentence is translated by hand into a list of the rules
`invoked to parse it. The rule set is converted to a form that merges common elements on the right-hand
`side (RHS) of all rules sharing the same left-hand side (LHS). Elements on the LHS become parent nodes
`in a family tree. Through example sentences, they acquire knowledge of who their children are and how
`they can interconnect. The injection of domain-dependent semantics is done by replacing the low-level
`syntactic non-terminals with semantic non-terminals. For example, the syntactic rule-based derivation
`SUBJECT => NOUN PHRASE => ARTICLE NOUN => the Hyatt
`is replaced by the semantic derivation [48]
`SUBJECT => ARTICLE PLACE => ARTICLE HOTEL => the Hyatt
`
`

`

`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`431
`
`A main limitation of the knowledge-based systems is that the grammar design process is tedious, slow
`and requires a lot of expertise. The semantic space partition into semantic frames may be subjective and
`the set of slots for a frame are imposed in a top-down fashion rather than extracted from data. Therefore
`some natural language sentences may not be well modeled in this framework.
`At the other end of the semantic spectrum are systems which only need to extract the sentence
`intent without other semantic entities. An example of such systems are the Call Routers whose goal is
`to automatically route a telephone query from a customer to the appropriate set of agents based on a
`brief spoken description of the problem. Call routers are nowadays deployed in most of the large call
`centers because they reduce queue time and call duration, thus saving money and improving customer
`satisfaction by promptly connecting the customer to the right service representative. These systems
`remove the constraints on what a user can say but at the expense of limiting the target semantic space.
`That is, call routers are specifically built for business verticals (e.g.
`telecommunication, government,
`utility companies) and are only designed to detect the kinds of semantic intents specific to that vertical
`(e.g. a telecommunication provider may allow a customer to perform one of several actions: canceling
`some service, resolving a billing issue, paying a bill, adding a new telephone line, etc).
`Well known call routing systems are the AT&T How may I help you? (HMIHY) [22] and the BBN
`Call director [33]. The users are greeted by an open-ended prompt like How May I Help You?, which
`encourages them to speak naturally. To find the meaning of a human utterance in a call routing system,
`the caller’s speech is first translated into a text string by an ASR system and the text is then fed into a
`NLU component called Router. The NLU task is modeled as a statistical classification problem: the text
`corresponding to an utterance is assigned to one or more of a set of predefined user intents (routes).
`
`3.2. Current NLU architecture
`
`The explosion of mobile computing power that came with the smartphones allowed the development of
`more sophisticated NLU systems that could handle combinations of many user intents along with the
`associated named entity extraction. There is now a proliferation of more complex, dialog-based voice
`search systems and mobile personal assistants that are configured to understand and perform several tasks
`[18]. Each task may have different sets of semantic entities that can be formulated and uttered differently
`by different users. The NLU goal in such systems is to also identify which task the user would like to
`perform (usually called user intent).
`A modern client-server voice-based transactional system including dialog is depicted in Fig. 1 (see
`also [49], [23]). A user opens a client application on his phone and utters a sentence (e.g. query or
`command). The client sends the acoustic signal to the server system where it is first converted into text
`by the ASR module. Next, a NLU module extracts the semantic information from this text. A popular
`approach is top-down hierarchical meaning extraction. A semantic domain classifier can be used to
`determine which part of the semantic space a query belongs to. For example, the query “I need a table
`for two at the closest Bertucci’s restaurant for tomorrow” belongs to the “Restaurant” domain. Then,
`using domain dependent models, a second classifier finds the query intent (what the user asks for). In our
`example, the intent is “Restaurant reservation”. Finally using domain and intent dependent models, one
`segments the semantic slots (basic semantic entities) associated with the given domain and intent which
`have been specified by the user. In our case, the following slots appear and can be extracted from the
`sentence: (i) Restaurant name = “Bertucci’s” (ii) Reservation date = “tomorrow”, (iii) Party size = “two”
`and (iv) Restaurant location = “closest”. After that, a normalizer translates each slot value into a form that
`
`

`

`432
`
`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`Figure 1. The architecture of a client-server voice-based transactional system including dialog.
`
`can be used to query a knowledge-database. For our query, we could get the following slot normalized
`values: (i) Restaurant name = “Bertucci’s pizzeria” (ii) Reservation date = “03/15/2013” and (iii) Party
`size = “2”. The query domain, intent and normalized slot values are further sent out to a dialog manager
`which has detailed information about the domain and determines the next system action. In our case,
`the dialog manager asks the client application to provide the current user location4, then it interrogates a
`business database to find the closest Bertucci’s restaurant and finally detects that a restaurant reservation
`also requires time information in order to be fulfilled. Therefore, it will issue back to the user a question
`regarding the reservation time. The dialog manager produces the question as text, but that is fed into a
`speech synthesizer and turned into an audio signal which is played back to the user. Let’s assume the
`user answers “Hum let’s say six in the evening”. The NLU system now detects a single semantic slot
`Time = “six in the evening” which is normalized as Time = “6 pm” and sent to the dialog manager along
`with “Unknown” domain and intent. The dialog manager, which also keeps track of the dialog states
`(possible using a stack), knows that this is the missing piece of information from a previous query and
`it can now take action on the query. Some systems send back to the user the parsed information asking
`for confirmation: “Ok, I’ll make a reservation for two at Bertucci’s on Main street for March 15th 2013
`at 6pm. Is that correct?” If the user agrees, the Execution unit sends all the information to a restaurant
`reservation service/web site which performs the actual reservation.
`One can easily notice that the dialog system architecture in Fig. 1 generalizes all systems built in the
`past. The DA systems had no client application (at the time users were using landlines), dialog manager
`
`4either from the internal GPS or from the wireless provider signal triangulation
`
`

`

`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`433
`
`(there was a single-shot query which was automatically routed to a human agent if the system returned a
`low confidence), no domain or intent classifiers (the system’s goal was only to return the phone number
`of a certain business or individual). They only had a primitive slot extractor (either business name or
`location, though often they were asked for separately) and normalizer. The directed dialog systems added
`a dialog manager and a small number of fixed intents often specified as a single piece of information5.
`On the other hand, the call routers added an intent classifier (with a number of intents ranging from a few
`tens to a few hundreds) and a very small number of slots.
`From a linguistic viewpoint, these systems could be characterized by the following four criteria
`(see Table 3 and [49]): the naturalness and the size of the space of input sentences, the resolution of
`the target semantic representation and the size of the target semantic space. The systems have evolved
`from a low naturalness, input space, semantic resolution and space size (directed dialog) to medium-
`high naturalness, large input space, high semantic resolution and space size (today’s voice transaction
`systems).
`
`Table 3. Comparison of several NLU systems with respect to the characteristics of the input utterance space and
`the output semantic space (adapted from [49])
`
`NLU system
`
`Directed dialog
`DA
`Mixed initiative
`Call routing
`Voice search & personal assistant
`
`User input
`Naturalness
`Low
`Low
`Low-medium
`High
`Medium-high
`
`utterances
`Input space
`Small
`Large
`Small
`Medium
`Large
`
`Target semantic
`Resolution
`Low
`Low
`High
`Low
`High
`
`representation
`Semantic space
`Small
`Small
`Small
`Small
`Large
`
`One important phenomenon is that the text which modern systems are attempting to understand obeys
`less and less the syntax rules of the language. Spoken language often contains dysfluencies, restarts and
`recognition errors while written text may be highly abbreviated and/or truncated. For example, an SMS
`line may be “he like u” while a speech recognized sentence may be “by movie uh kings speech” (the
`spoken query was “buy movie king’s speech”).
`
`3.3. Semantic data annotation
`
`As previously mentioned, modern NLU systems often consist of sets of text classifiers which extract
`various types of semantic information: query domain, query intent, semantic slots and/or other attributes
`of the domain (facets) which sometimes may only be mentioned implicitly (e.g. the fragment “which
`make me laugh” in the query “find movies which make me laugh” should be interpreted as a movie genre
`and one may need some sort of logic reasoning for extracting these mappings [10]).
`These classifiers need to be trained on large amounts of data in which the semantic entities of interest
`are manually annotated. As shown on top of Fig. 1 several sets of manual annotations are necessary: (i)
`
`5The system could have asked: “What transaction would you like to perform: Flight information, reservation, cancellation,
`other?”
`
`

`

`434
`
`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`Speech transcriptions (a textual form of the user spoken utterances) (ii) Semantic domain and/or intent
`annotations and (iii) Semantic slot annotations. Although one tries to carefully annotate the data, the
`references produced by different human annotators are not identical. Sometimes that is due to annotator
`fatigue but most of the time there is a subjective component especially for the semantic annotations. The
`inter-annotator disagreement may be 6% for speech transcription [21] but it can get much higher when
`semantics is involved.
`Therefore one may try to automate parts of the data annotation process. Semantic slot annotation
`could be done using Marcus contextual grammars [29][36] which have been theoretically studied for
`a long time (a brief introduction is given in the Appendix). We will show here an example of how to
`construct and use such a grammar. Let’s assume the vocabulary V is the set of English words and the
`starting language A over V is the set of sentences a human might use for interacting with an NLU system
`as described in Section 3.2. The set of selectors correspond to the semantic entities we would like to label
`and the set of contexts contain the English word we would like to label them with. Let’s say S1 is the set
`of restaurant names, S2 is the set of location names and C1, C2 are their corresponding semantic labels:
`
`S1 = {M cDonald′s, Boston M arket, ...}, C1 = {(< Restaurant >, < /Restaurant >)},
`C2 = {(< Location >, < /Location >)}
`S2 = {Boston, Cambridge, ...},
`and so on. A possible derivation in this grammar is
`
`Find me a McDonald’s in Boston =>
`Find me a <Restaurant>McDonald’s</Restaurant> in Boston =>
`Find me a <Restaurant>McDonald’s</Restaurant> in <Location>Boston</Location>
`
`In order to generate correct annotations, we require the derivations to be in maximal global mode.
`That is, at each derivation step, the word selector x is maximal with respect to all selectors S1, ..., Sn.
`That is enforced if we always label first the longest semantic entity that could be labeled. The resulting
`annotated sentence obeys Occam’s razor6 (annotates as many words as possible with as few labels as
`possible) and is most of the time correct. A simple example is
`
`Find me a Boston Market in Cambridge =>
`Find me a <Restaurant>Boston Market</Restaurant> in Cambridge =>
`Find me a <Restaurant>Boston Market</Restaurant> in <Location>Cambridge</Location>
`
`If the derivation is not in the maximum global mode one could get:
`
`Find me a Boston Market in Cambridge =>
`Find me a <Location>Boston</Location> Market in Cambridge =>
`Find me a <Location>Boston</Location> Market in <Location>Cambridge</Location>
`
`which is obviously incorrect.
`In [29], the finite and regular families of selectors are investigated. Although in practical systems
`the vocabulary, starting axioms, selectors and contexts are all finite, the case where the selectors are
`generated by a context sensitive mechanism is of high interest. That is because one name entity may
`belong to multiple semantic classes. For example the word “Eagles” belongs to MusicBand, SportsTeam
`and Bird classes. For such ambiguous cases, the set of selectors must also contain some contextual
`words to disambiguate the semantic class. As such, the word “Eagles” should appear in fragments like
`
`6For a mathematically formalized version of Occam’s razor see Ray Solomonoff’s theory of universal inductive inference [43]
`
`

`

`N. Duta / Natural Language Understanding and Prediction: from Formal Grammars to Large Scale...
`
`435
`
`“Eagles songs” in MusicBand, “Eagles scores” in SportsTeam and “Eagle food” in Bird. The problem
`becomes even harder when the sets of context sensitive selectors have to be automatically extracted from
`un-annotated data.
`An easy way to implement semantic annotation with a Marcus contextual grammar is by using Finite
`State Transducer technology [3] [31]. In the Xerox FST toolkit language [3], the grammar shown above
`can be written as:
`
`define Location [{Boston}|{Cambridge}] EndTag(Location); # Selector Location
`define Restaurant [{McDonald’s}|{Boston Market}] EndTag(Restaurant); # Selector Restaurant
`regex Location | Restaurant; # Grammar definition with implicit vocabulary
`
`and the annotated output produced by the toolkit is:
`
`fst[1]: pmatch Find me a Boston Market in Cambridge
`Find me a <Restaurant>Boston Market</Restaurant> in <Location>Cambridge</Location>
`
`T

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket