`US 6,742,021 Bl
`(10) Patent No.:
`(12)
`May25, 2004
`(45) Date of Patent:
`Halverson etal.
`
`US006742021B1
`
`(54)
`
`(75)
`
`NAVIGATING NETWORK-BASED
`ELECTRONIC INFORMATION USING
`SPOKEN INPUT WITH MULTIMODAL
`ERROR FEEDBACK
`
`Inventors: Christine Halverson, San Jose, CA
`(US); Luc Julia, Menlo Park, CA (US);
`Dimitris Voutsas, Thessaloniki (GR);
`Aden J. Cheyer, Palo Alto, CA (US)
`
`(73)
`
`Assignee: SRI International, Inc., Menlo Park,
`CA (US)
`
`(*)
`
`Notice:
`
`Subject to any disclaimer, the term ofthis
`patent is extended or adjusted under 35
`US.C. 154(b) by 0 days.
`
`(21)
`
`(22)
`
`(63)
`
`(60)
`
`(61)
`(52)
`
`(58)
`
`(56)
`
`Appl. No.: 09/524,095
`
`Filed:
`
`Mar. 13, 2000
`
`Related U.S. Application Data
`
`Continuation-in-part of application No. 09/225,198, filed on
`Jan. 5, 1999,
`Provisional application No. 60/124,718, filed on Mar. 17,
`1999, provisional application No. 60/124,720, filed on Mar.
`17, 1999, and provisional application No. 60/124,719,filed
`on Mar. 17, 1999,
`
`Tint. C17 oes cceseeeeseseseseseees GO06F 15/16
`US. C1. cccccccccccterteeeeees 709/218; 707/5; 707/4;
`707/102
`Field of Search ............00.cccceee 709/218; 707/5,
`707/4, 102; 704/257, 231
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`5,197,005 A
`5,386,556 A
`
`3/1993 Schwartz et al... 364/419
`1/1995 Hedin et al... 395/600
`
`(List continued on next page.)
`FOREIGN PATENT DOCUMENTS
`
`WO
`
`WO 00/11869
`
`3/2000
`
`OTHER PUBLICATIONS
`
`“Com-
`http:/Avww.ai.sri.com/~lesaf/commandtalk-html:
`mandTalk: A Spoken—Language Interface for Battlefield
`Simulations”, 1997, by Robert Moore, John Dowding, Harry
`Bratt, J. Mark Gawron, Yonael Gorfu and Adam Cheyer, in
`“Proceedings of the Fifth Conference on Applied Natural
`Language Processing”, Washington, DC, pp. 1-7, Associa-
`tion for Computational Linguistics.
`“The CommandTalk Spoken Dialogue System’, 1999, by
`Amanda Stent, John Dowding, Jean Mark Gawron, Eliza-
`beth Owen Bratt and Robert Moore, in “Proceedings of the
`Thirty-Seventh Annual Meeting of the ACL”, pp. 183-190,
`University of Maryland, College Park, MD, Association for
`Computational Linguistics.
`Stent, Amandaetal., “The CommandTalk Spoken Dialogue
`System”, SRI International.
`Moore, Robert et al., “CommandTalk: A Spoken—Language
`Interface for Battlefield Simulations”, Oct. 23, 1997, SRI
`International.
`Dowding, Johnetal., “Interpreting Language in Context in
`CommandTalk”, Feb. 5, 1999, SRI International.
`http:/Avww.ai.sri.com/~oaa/infowiz.html, InfoWiz: An Ani-
`mated Voice Interactive Information System, May 8, 2000.
`(List continued on next page.)
`Primary Examiner—JamesP. Trammell
`Assistant Examiner—Firmin Backer
`(74) Attorney, Agent,
`or Firm—Moser,
`Sheridan, LLP.; Kin-Wah Tong, Esq.
`(57)
`ABSTRACT
`
`Patterson &
`
`A system, method, and article of manufacture are provided
`for navigating an electronic data source by means of spoken
`language. When a spoken input request is received from a
`user, it 1s interpreted. Additional input is solicited from the
`user in a modality different than the original request and
`used to refine the navigation query. The resulting interpre-
`tation of the request is thereupon used to automatically
`construct an operational navigation query to retrieve the
`desired information from one or more electronic network
`data sources.
`
`EP
`0 803 826 A2=10/1997
`132 Claims, 7 Drawing Sheets
`
`
`
`
`A02 [roeWE SPOKEN NL REQUEST
`
`404
`INTERPRET REQUEST
`
`
`|q———____.
`
`
`A405 [loeNTIFYiseLecr DATA SOURCE
`
`
`406|CONSTRUCT NAVIGATION QUERY
`
`
`
`
` a
`
`aa
`\.
`407 <DEFION.NGIES?
`oye
` NO
`
`dD:
`ATA SOURGE
`NAVIGATE
`
`
`
`SOLICIT
`ADDITIONAL,
`(MULTIMODAL)
`USER INPUT
`A42
`
`
`
`is0g,
`
`y
`
`ES
`
`aREFINE
`weRY?
`aNOy
`
` TRANSMITAND DISPIAY TOCLIENT.
`
`408
`
`
`
`aig)—_
`
`Page | of 21
`
`GOOGLEEXHIBIT 1005
`
`GOOGLE EXHIBIT 1005
`
`Page 1 of 21
`
`
`
`US 6,742,021 B1
`
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`5,434,777 A
`5,519,608 A
`5,608,624 A
`5,721,938 A
`5,729,659 A
`5,748,974 A
`5,774,859 A
`5,794,050 A
`5,802,526 A
`5,805,775 A
`5,855,002 A
`5,890,123 A
`5,963,940 A
`6,003,072 A
`6,012,030 A
`
`6,021,427 A
`6,026,388 A
`6,080,202 A
`6,144,989 A
`6,173,279 Bl *
`6,192,338 B1 *
`6,226,666 B1
`6,338,081 B1
`
` LUCIW oo. eeeeeeeee eens 364/419
`TIV99OS
`
`5/1996 Kupiec ...
`364/419.08
`3/1997 Luciw ....
`. 395/794
`
`... 395/754
`2/1998 Stuckey ..
`w 395/2.79
`3/1998 Potter .....
`
`... 395/759
`5/1998 Johnson..
`6/1998 Houseret al.
`....
`w. 704/275
`........... 395/708
`8/1998 Dahlgren et al.
`
`9/1998 Faweett et al. we. 707/104
`9/1998 Ebermanet al. «00... 395/12
`
`... 704/270
`12/1998 Armstrong....
`
`3/1999 Brownetal. ..
`w. 704/275
`
`10/1999 Liddy et al. oe 707/5
`........... 709/218
`12/1999 Gerritsen et al.
`1/2000 French-
`St. George et al.
`1/2000 Spagnaetal.
`2/2000 Liddy et al. oe TO7/1
`6/2000. Strickland et al.
`11/2000 Hodjat et al.
`1/2001 Levinet al. we. 7O7/5
`2/2001 Zasto et al. 704/257
`5/2001 Changet al.
`1/2002 Furusawaetal.
`
`......... 704/275
`
`OTHER PUBLICATIONS
`
`Dowding, John, “Interleaving Syntax and Semantics in an
`Efficient Bottom-up Parser”, SRI International.
`Moore, Robertet al., “Combining Linguistic and Statistical
`Knowledge Sources in Natural—Language Processing for
`ATIS”, SRI International.
`
`Dowding, Johnet al., “Gemini: A Natural Language System
`For Spoken—Language Understanding”, SRI International.
`Moran, Douglas B. et al., “Intelligent Agent-based User
`Interfaces”, Article Intelligence center, SRI International.
`Martin, David L. et al., “Building Distributed Software
`Systems with the Open Agent Architecture”.
`Julia, Luc. et al., “Cooperative Agents and Recognition
`System (CARS) for Drivers and Passengers”; SRI Interna-
`tional.
`
`Moran, Douglas et al., “Multimodal User Interfaces in the
`Open Agent Architecture”.
`Cheyer, Adam et al., “Multimodal Maps: An Agent-based
`Approach”, SRI International.
`Cutkosky, Mark R. et al., “An Experiment in Integrating
`Concurrent Engineering Systems”.
`Martin, David et al., “Development Tools for the Open
`Agent Architecture”, The Practical Application of Intel-
`leigent Agents and Multi-Agent Technology (PAAM96),
`London, Apr. 1996.
`Cheyer, Adam etal., “The Open Agent Architecture,,,”, SRI
`International, AI center.
`Dejima, Inc., http:/Avww.dejima.com/.
`Cohen, Philip et al., “An Open Agent Architecture”, AAAI
`Spring Symposium, pp. 1-8, Mar. 1994.
`Martin, David et al., “Information Brokering in an Agent
`Architecture”, Proceeding of the 2” Int’l1 Conference on
`Practical Application of Intelligent Agents & Multi-Agent
`Technology, London, Apr. 1997.
`
`* cited by examiner
`
`Page 2 of 21
`
`Page 2 of 21
`
`
`
`U.S. Patent
`
`May25, 2004
`
`Sheet 1 of 7
`
`US 6,742,021 B1
`
`104
`
`102 — |
`
`=
`
`@))
`
`Network
`
`106
`
`— — RO
`
`300 (see Fig. 3)
`
`108
`
`110
`
` — — oC a
`
`
`
`Fig. 1a
`
`Page 3 of 21
`
`Page 3 of 21
`
`
`
`U.S. Patent
`
`May25, 2004
`
`Sheet 2 of 7
`
`US 6,742,021 B1
`
`106
`
`Network
`
`
`
`
`
`—_ — ©
`
`Fig. 1b
`
`Page 4 of 21
`
`Page 4 of 21
`
`
`
`U.S. Patent
`
`May25, 2004
`
`Sheet 3 of 7
`
`US 6,742,021 B1
`
`e >)
`
`202n
`
`Ve
`
`Network
`206
`
`
`
`
`300 (see Fig. 3)
`
`210
`
`208
`
`o>»
`|
`
`202
`
`210n
`
`Fig. 2
`
`Page 5 of 21
`
`Page 5 of 21
`
`
`
`U.S. Patent
`
`May25, 2004
`
`Sheet 4 of 7
`
`US 6,742,021 B1
`
`REQUEST PROCESSING LOGIC 300
`
`
`
`SPEECH RECOGNITION
`ENGINE
`
`NATURAL LANGUAGE
`PARSER
`
`
`
`
`
`
`
`
` QUERY CONSTRUCTION
`
`
`
`
`LOGIC
`
`QUERY REFINEMENT LOGIC
`
`Fig. 3
`
`Page 6 of 21
`
`Page 6 of 21
`
`
`
`U.S. Patent
`
`May25, 2004
`
`Sheet 5 of 7
`
`US 6,742,021 B1
`
`402 RECEIVE SPOKEN NL REQUEST
`
`404
`
`INTERPRET REQUEST
`
`405 DENTINVIGeLEeT DATA SOURCE
`
`406 [construct NAVIGATION QUERY
`
`USER INPUT
`
`SOLICIT
`ADDITIONAL
`YES
`_
`CIENC
`(MULTIMODAL)
`407<DEFICIENCIES?
`
`—
`
`408
`
`NAVIGATE DATA SOURCE
`
`412
`
`408
`
`REFINE
`QUERY?
`
`NO
`
`10
`—
`
`TRANSMIT AND DISPLAY TO
`CLIENT
`
`Fig. 4
`
`Page 7 of 21
`
`Page 7 of 21
`
`
`
`U.S. Patent
`
`May25, 2004
`
`Sheet 6 of 7
`
`US 6,742,021 B1
`
`(from step 406, Fig. 4)
`
`po
`SCRAPE THE ONLINE SCRIPTED FORM TO 520
`EXTRACT AN INPUT TEMPLATE
`oO
`
`INSTANTIATE THE INPUT TEMPLATE USING 522
`INTERPRETATION OF STEP 404
`—
`were
`
`(to step 407, Fig. 4)
`
`Fig. 5
`
`Page 8 of 21
`
`Page 8 of 21
`
`
`
`US 6,742,021 B1
`
`SINOHd3731L
`
`
`
`LNSDVASILONTWYn.vNn
`
`LINSOVLNaoy
`
`0Z9arenLNaOVSOWNONVT
`
`YVONSIVO9‘Bid
`INS9OV.
`
`
`Sheet 7 of 7
`
`
`
`INS9VYODA“AVISGSFNSSVEVLvG
`SAUR_asvavivaNOLLINDOOSY=FOVSYSININaoy89gamJINOELOSTALIATTICH3asn
`
`
`
`
`
`LNS9VINSOWaIvININZOVSLNAOVozs)
`0E9500Ors
`
`
`U.S. Patent
`
`May 25, 2004
`
`
`
`OC <
`
`q -
`
`<q
`
`Page 9 of 21
`
`Page 9 of 21
`
`
`
`
`
`US 6,742,021 B1
`
`1
`NAVIGATING NETWORK-BASED
`ELECTRONIC INFORMATION USING
`SPOKEN INPUT WITH MULTIMODAL
`ERROR FEEDBACK
`
`This is a Continuation In Part of co-pending U'S. patent
`application Ser. No. 09/225,198, filed Jan. 5, 1999, Provi-
`sional U.S. patent application Ser. No. 60/124,718, filed
`Mar. 17, 1999, Provisional U.S. patent application Ser. No.
`60/124,720,filed Mar. 17, 1999, and Provisional U.S. patent
`application Ser. No. 60/124,719, filed Mar. 17, 1999, from
`which applications priority is claimed and these application
`are incorporated herein by reference.
`BACKGROUND OF THE INVENTION
`
`The present invention relates generally to the navigation
`of electronic data by means of spoken natural language
`requests, and to feedback mechanisms and methods for
`resolving the errors and ambiguities that may be associated
`with such requests.
`As global electronic connectivity continues to grow, and
`the universe of electronic data potentially available to users
`continues to expand,there is a growing need for information
`navigation technology that allowsrelatively naive users to
`navigate and access desired data by meansof natural lan-
`guage input. In many of the most
`important markets—
`including the homeentertainment arena, as well as mobile
`computing—spoken natural
`language input
`is highly
`desirable, if not ideal. As just one example, the proliferation
`of high-bandwidth communications infrastructure for the
`home entertainment market (cable, satellite, broadband)
`enables delivery of movies-on-demand andotherinteractive
`multimedia content to the consumer’s hometelevisionset.
`
`For users to take full advantage of this content stream
`ultimately requires interactive navigation of content data-
`bases in a manner that is too complex for user-friendly
`selection by means ofa traditional remote-control clicker.
`Allowing spoken natural language requests as the input
`modality for rapidly searching and accessing desired content
`is an important objective for a successful consumer enter-
`tainment product in a context offering a dizzying range of
`database content choices. As further examples, this same
`need to drive navigation of (and transaction with) relatively
`complex data warehouses using spoken natural language
`requests applies equally to surfing the Internet/Web or other
`networks for general information, multimedia content, or
`e-commerce transactions.
`
`In general, the existing navigational systems for brows-
`ing electronic databases and data warehouses (search
`engines, menus,etc.), have been designed without naviga-
`tion via spoken natural
`language as a specific goal. So
`today’s world is full of existing electronic data navigation
`systems that do not assume browsing via natural spoken
`commands, but rather assume text and mouse-click inputs
`(or in the case of TV remote controls, even less). Simply
`recognizing voice commands within an extremely limited
`vocabulary and grammar—the spoken equivalent of button/
`click input (e.g., speaking “channel 5” selects TV channel
`5)—isreally not sufficientby itself to satisfy the objectives
`described above. In order to deliver a true “win” for users,
`the voice-driven front-end must accept spoken natural lan-
`guage input
`in a manner that
`is intuitive to users. For
`example, the front-end should not require learning a highly
`specialized command language or
`format. More
`fundamentally,
`the front-end must allow users to speak
`directly in terms of what the user ultimately wants —e.g.,
`“I'd like to see a Western film directed by Clint
`Eastwood”—as opposed to speaking in terms of arbitrary
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`navigation structures (e.g., hierarchical layers of menus,
`commands,etc.) that are essentially artifacts reflecting con-
`straints of the pre-existing text/click navigation system. At
`the same time, the front-end must recognize and accommo-
`date the reality that a stream of naive spoken natural
`language input will, over time, typically present a variety of
`errors and/or ambiguities: e.g., garbled/unrecognized words
`(did the user say “Eastwood” or “Easter”?) and under-
`constrained requests (“Show me the Clint Eastwood
`movie”). An approach is needed for handling and resolving
`such errors and ambiguities in a rapid, user-friendly, non-
`frustrating manner.
`What
`is needed is a methodology and apparatus for
`rapidly constructing a voice-driven front-end atop an
`existing, non-voice data navigation system, whereby users
`can interact by meansofintuitive natural language input not
`strictly conforming to the step-by-step browsing architecture
`of the existing navigation system, and wherein anyerrors or
`ambiguities in user input are rapidly and conveniently
`resolved. The solution to this need should be compatible
`with the constraints of a multi-user, distributed environment
`such as the Internet/Web or a proprietary high-bandwidth
`content delivery network; a solution contemplating one-at-
`a-time userinteractions at a single location is insufficient,for
`example.
`
`SUMMARYOF THE INVENTION
`
`invention addresses the above needs by
`The present
`providing a system, method, and article of manufacture for
`navigating network-based electronic data sources in
`response to spoken input requests. When a spoken input
`request is received from a user, it is interpreted, such as by
`using a speech recognition engine to extract speech data
`from acoustic voice signals, and using a language parser to
`linguistically parse the speech data. The interpretation of the
`spoken request can be performed on a computing device
`locally with the user or remotely from the user. The resulting
`interpretation of the request is thereupon used to automati-
`cally construct an operational navigation query to retrieve
`the desired information from one or more electronic network
`data sources, which is then transmitted to a client device of
`the user. If the network data source is a database,
`the
`navigation query is constructed in the format of a database
`query language.
`Typically, errors or ambiguities emerge in the interpreta-
`tion of the spoken request, such that the system cannot
`instantiate a complete, valid navigational template. This is to
`be expected occasionally, and one preferred aspect of the
`invention 1s the ability to handle such errors and ambiguities
`in relatively graceful and user-friendly manner. Instead of
`simply rejecting such input and defaulting to traditional
`input modes or simply asking the user to try again, a
`preferred embodiment of the present
`invention seeks to
`converge rapidly toward instantiation of a valid navigational
`template by soliciting additional clarification from the user
`as necessary, either before or after a navigation of the data
`source, via multimodal input, i.e., by means of menu selec-
`tion or other input modalities including and in addition to
`spoken input. This clarifying, multi-modal dialogue takes
`advantage of whatever partial navigational information has
`been gleaned from the initial interpretation of the user’s
`spoken request. Thisclarification process continues until the
`system converges toward an adequately instantiated navi-
`gational template, which is in turn used to navigate the
`network-based data and retrieve the user’s desired informa-
`tion. The retrieved information is transmitted across the
`network and presented to the user on a suitable client display
`device.
`
`Page 10 of 21
`
`Page 10 of 21
`
`
`
`US 6,742,021 B1
`
`3
`In a further aspect of the present invention, the construc-
`tion of the navigation query includes extracting an input
`template for an online scripted interface to the data source
`and using the input template to construct the navigation
`query. The extraction of the input
`template can include
`dynamically scraping the online scripted interface.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention, together with further advantages thereof,
`may best be understood by reference to the following
`description taken in conjunction with the accompanying
`drawings in which:
`FIG. 1a illustrates a system providing a spoken natural
`language interface for network-based information
`navigation,
`in accordance with an embodiment of the
`present invention with server-side processing of requests;
`FIG. 16 illustrates another system providing a spoken
`natural language interface for network-based information
`navigation,
`in accordance with an embodiment of the
`present invention with client-side processing of requests;
`FIG. 2 illustrates a system providing a spoken natural
`language interface for network-based information
`navigation,
`in accordance with an embodiment of the
`present invention for a mobile computing scenario;
`FIG. 3 illustrates the functional logic components of a
`request processing module in accordance with an embodi-
`ment of the present invention;
`FIG. 4 illustrates a process utilizing spoken natural lan-
`guage for navigating an electronic database in accordance
`with one embodimentof the present invention;
`FIG. 5 illustrates a process for constructing a navigational
`query for accessing an online data source via an interactive,
`scripted (e.g., CGI) form; and
`FIG. 6 illustrates an embodimentof the present invention
`utilizing a community of distributed, collaborating elec-
`tronic agents.
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`1. System Architecture
`a. Server-End Processing of Spoken Input
`FIG. 1a is an illustration of a data navigation system
`driven by spokennatural language input, in accordance with
`one embodimentof the present invention. As shown,a user’s
`voice input data is captured by a voice input device 102,
`such as a microphone. Preferably voice input device 102
`includes a button or the like that can be pressed or held-
`downto activate a listening mode, so that the system need
`not continually pay attention to, or be confused by,irrelevant
`backgroundnoise. In one preferred embodiment well-suited
`for the home entertainmentsetting, voice input device 102
`is a portable remote control device with an integrated
`microphone, and the voice data is transmitted from device
`102 preferably via infrared (or other wireless) link to com-
`munications box 104 (e.g., a set-top box or a similar
`communications device that is capable of retransmitting the
`raw voice data and/or processing the voice data) local to the
`user’s environment and coupled to communications network
`106. The voice data is then transmitted across network 106
`
`to a remote server or servers 108. The voice data may
`preferably be transmitted in compressed digitized form, or
`alternatively—particularly where bandwidth constraints are
`significant—in analog format (e.g., via frequency modulated
`transmission), in the latter case being digitized upon arrival
`at remote server 108.
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`the voice data is processed by
`At remote server 108,
`request processing logic 300 in order to understand the
`user’s request and construct an appropriate query or request
`for navigation of remote data source 110, in accordance with
`the interpretation process exemplified in FIG. 4 and FIG. 5
`and discussed in greater detail below. For purposes of
`executing this process, request processing logic 300 com-
`prises functional modules including speech recognition
`engine 310, natural language (NL) parser 320, query con-
`struction logic 330, and query refinement
`logic 340, as
`shownin FIG. 3. Data source 110 may comprise database(s),
`Internet/web site(s), or other electronic information
`repositories, and preferably resides on a central server or
`servers—which may or may not be the same as server 108,
`depending on the storage and bandwidth needs of the
`application and the resources available to the practitioner.
`Data source 110 may include multimedia content, such as
`movies or other digital video and audio content, other
`various forms of entertainment data, or other electronic
`information. The contents of data source 110 are
`navigated—i.e., the contents are accessed and searched, for
`retrieval of the particular information desired by the user—
`using the processes of FIGS. 4 and 5 as described in greater
`detail below.
`Oncethe desired information has beenretrieved from data
`source 110,it is electronically transmitted via network 106
`to the user for viewing on client display device 112. In a
`preferred embodiment well-suited for the home entertain-
`mentsetting, display device 112 is a television monitor or
`similar audiovisual entertainment device, typically in sta-
`tionary position for comfortable viewing by users.
`In
`addition, in such preferred embodiment, display device 112
`is coupled to or integrated with a communications box
`(which is preferably the same as communications box 104,
`but may also be a separate unit) for receiving and decoding/
`formatting the desired electronic information that is received
`across communications network 106.
`Network 106 is a two-way electronic communications
`network and may be embodiedin electronic communication
`infrastructure including coaxial (cable television)
`lines,
`DSL, fiber-optic cable,
`traditional copper wire (twisted
`pair), or any other type of hardwired connection. Network
`106 may also include a wireless connection such as a
`satellite-based connection, cellular connection, or other type
`of wireless connection. Network 106 may be part of the
`Internet and may support TCP/IP communications, or may
`be embodied in a proprietary network, or in any other
`electronic communications network infrastructure, whether
`packet-switched or connection-oriented. A design consider-
`ation is that network 106 preferably provide suitable band-
`width depending upon the nature of the content anticipated
`for the desired application.
`b. Client-End Processing of Spoken Input
`FIG. 16 is an illustration of a data navigation system
`driven by spokennatural language input, in accordance with
`a second embodiment of the present invention. Again, a
`user’s voice input data is captured by a voice input device
`102, such as a microphone. In the embodiment shown in
`FIG. 1b, the voice data is transmitted from device 202 to
`requests processing logic 300, hosted on a local speech
`processor, for processing and interpretation. In the preferred
`embodimentillustrated in FIG. 1b, the local speech proces-
`sor is conveniently integrated as part of communications box
`104, although implementation in a physically separate (but
`communicatively coupled) unit is also possible as will be
`readily apparent to those of skill in the art. The voice data is
`processed by the components of request processing logic
`
`Page 11 of 21
`
`Page 11 of 21
`
`
`
`US 6,742,021 B1
`
`5
`300 in order to understand the user’s request and construct
`an appropriate query or request for navigation of remote data
`source 110, in accordance with the interpretation process
`exemplified in FIGS. 4 and 5 as discussed in greater detail
`below.
`
`The resulting navigational query is then transmitted elec-
`tronically across network 106 to data source 110, which
`preferably resides on a central server or servers 108. As in
`FIG. 1a, data source 110 may comprise database(s), Internet/
`website(s), or other electronic information repositories, and
`preferably may include multimedia content, such as movies
`or other digital video and audio content, other various forms
`of entertainment data, or other electronic information. The
`contents of data source 110 are then navigated—1.e.,
`the
`contents are accessed and searched, for retrieval of the
`particular information desired by the user—preferably using
`the process of FIGS. 4 and 5 as described in greater detail
`below. Oncethe desired information has beenretrieved from
`data source 110, it is electronically transmitted via network
`106 to the user for viewing on client display device 112.
`In one embodiment
`in accordance with FIG. 1b and
`
`well-suited for the home entertainmentsetting, voice input
`device 102 is a portable remote control device with an
`integrated microphone, and the voice data is transmitted
`from device 102 preferably via infrared (or other wireless)
`link to the local speech processor. The local speech proces-
`sor is coupled to communications network 106, and also
`preferably to client display device 112 (especially for pur-
`poses of query refinement transmissions, as discussed below
`in connection with FIG. 4, step 412), and preferably may be
`integrated within or coupled to communications box 104. In
`addition, especially for purposes of a home entertainment
`application, display device 112 is preferably a television
`monitor or similar audiovisual entertainment device, typi-
`cally in stationary position for comfortable viewing by
`users. In addition, in such preferred embodiment, display
`device 112 is coupled to a communications box (which is
`preferably the same as communications box 104, but may
`also be a physically separate unit)
`for receiving and
`decoding/formatting the desired electronic information that
`is received across communications network 106.
`Design considerations favoring server-side processing
`and interpretation of spoken input requests, as exemplified
`in FIG. 1a, include minimizing the need to distribute costly
`computational hardware and software to all client users in
`order to perform speech and language processing. Design
`considerations favoring client-side processing, as exempli-
`fied in FIG. 15, include minimizing the quantity of data sent
`upstream across the network from each client, as the speech
`recognition is performed before transmission across the
`network and only the query data and/or request needs to be
`sent, thus reducing the upstream bandwidth requirements.
`c. Mobile Client Embodiment
`Amobile computing embodimentof the present invention
`may be implemented by practitioners as a variation on the
`embodiments of either FIG. 1a or FIG. 1b. For example, as
`depicted in FIG. 2, a mobile variation in accordance with the
`server-side processing architecture illustrated in FIG. 1 a
`may be implemented by replacing voice input device 102,
`communications box 104, and client display device 112,
`with an integrated, mobile, information appliance 202 such
`as a cellular telephone or wireless personal digital assistant
`(wireless PDA). Mobile information appliance 202 essen-
`tially performs the functions of the replaced components.
`Thus, mobile information appliance 202 receives spoken
`natural language input requests from the user in the form of
`voice data, and transmits that data (preferably via wireless
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`data receiving station 204) across communications network
`206 for server-side interpretation of the request, in similar
`fashion as described above in connection with FIG. 1.
`
`Navigation of data source 210 and retrieval of desired
`information likewise proceeds in an analogous manner as
`described above. Display information transmitted electroni-
`cally back to the user across network 206is displayed for the
`user on the display of information appliance 202, and audio
`information is output through the appliance’s speakers.
`Practitioners will further appreciate, in light of the above
`teachings,
`that
`if mobile information appliance 202 is
`equipped with sufficient computational processing power,
`then a mobile variation of the client-side architecture exem-
`plified in FIG. 2 may similarly be implemented. In that case,
`the modules corresponding to request processing logic 300
`would be embodied locally in the computational resources
`of mobile information appliance 202, and the logical flow of
`data would otherwise follow in a manner analogousto that
`previously described in connection with FIG. 1b.
`As illustrated in FIG. 2, multiple users, each having their
`ownclient input device, may issue requests, simultaneously
`or otherwise, for navigation of data source 210. This is
`equally true (though not explicitly drawn) for the embodi-
`ments depicted in FIGS. la and 1b. Data source 210 (or
`100), being a network accessible information resource, has
`typically already been constructed to support access requests
`from simultaneous multiple network users, as known by
`practitioners of ordinary skill in the art. In the case of
`server-side speech processing, as exemplified in FIGS. la
`and 2,
`the interpretation logic and error correction logic
`modules are also preferably designed and implemented to
`support queuing and multi-tasking of requests from multiple
`simultaneous networkusers, as will be appreciated by those
`of skill in the art.
`It will be apparentto those skilled in the art that additional
`implementations, permutations and combinations of the
`embodimentsset forth in FIGS. 1a, 1b, and 2 may be created
`without straying from the scope and spirit of the present
`invention. For example, practitioners will understand,
`in
`light of the above teachings and design considerations, that
`it is possible to divide and allocate the functional compo-
`nents of request processing logic 300 between client and
`server. For example, speech recognition—in entirety, or
`perhapsjust early stages such as feature extraction—might
`be performed locally on the client end, perhaps to reduce
`bandwidth requirements, while natural language parsing and
`other necessary processing might be performed upstream on
`the server end, so that more extensive computational power
`need not be distributed locally to each client. In that case,
`corresponding portions of request processing logic 300, such
`as speech recognition engine 310 or portions thereof, would
`reside locally at
`the client as in FIG. 1b, while other
`component modules would be hosted at the server end as in
`FIGS. 1a and 2.
`
`Further, practitioners may choose to implement the each
`of the various embodiments described above on any number
`of different hardware and software computing platforms and
`environments and various combinationsthereof, including,
`by way of just a few examples: a general-purpose hardware
`microprocessor such as the Intel Pentium series; operating
`system software such as Microsoft Windows/CE, Palm OS,
`or Apple Mac OS(particularly for client devices and client-
`side processing), or Unix, Linux, or Windows/NT(thelatter
`three particularly for network data servers and server-side
`processing), and/or proprietary information access platforms
`such as Microsoft’s WebTV or the Diva Systems video-on-
`demand system.
`
`Page 12 of 21
`
`Page 12 of 21
`
`
`
`US 6,742,021 B1
`
`7
`2. Processing Methodology
`The present invention provides a spoken natural language
`interface for interrogation of remote electronic databases
`and retrieval of desired information. A preferred embodi-
`mentof the present invention utilizes the basic methodology
`outlined in the flow diagram of FIG. 4 in order to provide
`this interface. This methodology will now be discussed.
`a. Interpreting Spoken Natural Language Requests
`At step 402, the user’s spoken request for information is
`initially received in the form of raw (acoustic) voice data by
`a suitable input device, as previously discussed in connec-
`tion with FIGS. 1-2. At step 404 the voice data received
`from the user is interpreted in order to understand the user’s
`request for information. Preferably this step includes per-
`forming speech recognition in order to extract words from
`the voice data, and further includes natural language parsing
`of those words in order to generate a structured linguistic
`representation of the user’s request.
`Speech recognition in step 404 is performed using speech
`recognition engine 310. A variety of commercial quality,
`speech recognition engines are readily available on the
`market, as practitioners will know. For example, Nuance
`Communications offers a suite of speech recognition
`engines, including Nuance 6, its current flagship product,
`and Nuance Express, a lower cost package for entry-level
`applications. As one other example, IBM offers the ViaVoice
`speech recognition engine,
`including a low-cost shrink-
`wrapped version available through popular consumerdistri-
`bution channels. Basically, a speech recognition engine
`processes acoustic voice data and attempts to generate a text
`stream of recognized words.
`Typically, the speech recognition engine is provided with
`a vocabulary lexicon of likely words or phrases that the
`recognition engine can match against its analysis of acous-
`tical signals, for purposes of a given application. Preferably,
`the lexicon is dynamically adjusted to reflect the current user
`context, as established by the preceding user inputs. For
`example, if a user is engaged in a dialogue with the system
`about movie selection, the recognition engine’s vocabulary
`may preferably be adjusted to favor relevant words and
`phrases, such as a stored list of proper names for popular
`movie actors and directors, etc. Whereas if t