`(10) Patent N0.:
`US 6,757,718 B1
`
`Halverson et al.
`(45) Date of Patent:
`Jun. 29, 2004
`
`U5006757718B1
`
`(54) MOBILE NAVIGATION OF NETWORK-
`BASED ELECTRONIC INFORMATION
`USING SPOKEN INPUT
`
`(75)
`
`-
`-
`.
`.
`Inventors Christme Halyerson’ san Jose’ CA
`(US)’ Lu“ Juha’ Menlo Park’ CA (US)’
`Dimitris V0“t535> Thessaloniki (GR);
`Adam Cheyer, PalO Alto, CA(US)
`
`(73) Assignee: SRI International, Menlo Park, CA
`(us)
`
`5,721,938 A
`5,729,659 A
`5,748,974 A
`5,774,859 A
`5,794,050 A
`5,802,526 A
`5 805 775 A
`5,855,002 A
`5,890,123 A
`5,963,940 A
`
`2/1998 Stuckey ...................... 395/754
`3/1998 Potter .....
`.. 395/2.79
`5/1998 Johnson .........
`395/759
`6/1998 Houser et al.
`704/275
`8/1998 Dahlgren et al.
`395/708
`9/1998 Fawcett et al.
`707/104
`9/1998 Eberman et al.
`..... 395/12
`..
`12/1998 Armstrong ........
`704/270
`
`3/1999 Brown et al.
`..
`704/275
`................... 707/5
`10/1999 Liddy et al.
`
`
`
`.
`
`(List continued on next page.)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) bydays.days.
`
`EP
`WO
`WO
`
`FOREIGN PATENT DOCUMENTS
`0 867 861
`9/1998
`............. GlOL/5/06
`99/50826
`10/1999
`............. GlOL/3/OO
`00/05638
`2/2000
`
`(21) Appl- No: 09/608,872
`
`(22)
`
`Filed:
`
`Jun. 30: 2000
`
`Related U'S' Appllcatlon Data
`.
`.
`.
`.
`(63) Contmuatlon of apphcatlon No. 09/524,095, filed on Mar.
`13, 2000, which is a continuation—in—part of application No.
`09/225,198, filed on Jan. 5, 1999.
`Provisional application No. 60/124,720, filed on Mar. 17,
`1999, provisional application No. 60/124,719, filed on Mar.
`17, 1999, and provisional application No. 60/124,718, filed
`on Mar. 17, 1999.
`
`(60)
`
`Int. Cl.7 ................................................ G06F 15/16
`(51)
`(52) US. Cl.
`....................... 709/218; 709/202; 709/217;
`709/219; 709/227; 704/257
`_
`(58) Fleld of Search ................................. 709/202, 218,
`709/217’ 219’ 227; 707/5’ 3’ 4; 704/257’
`270'1’ 275’ 246
`
`(56)
`
`.
`References C1ted
`U.S. PATENT DOCUMENTS
`
`5,197,005 A
`5,386,556 A
`5,434,777 A
`5,519,608 A
`5,608,624 A
`
`3/1993 Shwartz et al.
`............. 364/419
`1/1995 Hedin et al.
`..... 395/600
`
`7/1995 Luciw .........
`364/419.13
`
`5/1996 Kupiec
`364/419.08
`3/1997 Luciw ........................ 395/794
`
`OTHER PUBLICATIONS
`
`International Search Report,
`07987.
`
`Intl Appl No. PCT/US01/
`
`Stent, Amanda et al., “The CommandTalk Spoken Dialogue
`System”, SRI International.
`
`(List continued on next page.)
`
`Primary Examiner—Frantz B Jean
`-
`(7114) Error/16y, . Agent, hor Firm
`S er1 an, LLP, Km-Wa Tong
`
`1‘1
`
`oser, Patterson &
`
`ABSTRACT
`(57)
`A system, method, and article of manufacture are provided
`.
`.
`.
`for nav1gat1ng an electromc data source by means of spoken
`language Where a portion Of the data link between a mobile
`information appliance of the user and the data source utilizes
`Wireless communication. When a spoken input request is
`received from a user Who is using the mobile information
`appliance, it is interpreted. The resulting interpretation of the
`request
`is thereupon used to automatically construct an
`operational navigation query to retrieve the desired infor-
`mation from one or more electronic network data sources,
`Which is transmitted to the mobile information appliance.
`
`27 Claims, 7 Drawing Sheets
`
`
`102 /
`
`.33) E
`
`
`
`m 3
`
`
`00 (see Fig, 3)
`
`1.1911
`
`
`
`Page 1 of18
`
`GOOGLE EXHIBIT 1001
`
`GOOGLE EXHIBIT 1001
`
`Page 1 of 18
`
`
`
`US 6,757,718 B1
`Page 2
`
`U.S. PATENT DOCUMENTS
`
`Dowding, John et al., “Interpreting Language in Context in
`CommandTalk”, Feb. 5, 1999, SRI International.
`
`http://WWW.ai.sri.com/~oaa/infowiz.html, InfoWiz: An Ani-
`mated Voice Interactive Information System, May 8, 2000.
`
`Dowding, John, “InterleaVing Syntax and Semantics in an
`Efficient Bottom—up Parser”, SRI International.
`
`Moore, Robert et al., “Combining Linguistic and Statistical
`Knowledge Sources in Natural—Language Processing for
`ATIS”, SRI International.
`
`Dowding, John et al., “Gemini: A Natural Language System
`For Spoken—Language Understanding”, SRI International.
`
`* cited by examiner
`
`6,003,072 A
`6,016,476 A
`6,026,388 A
`6,102,030 A
`6,173,279 B1
`6,192,338 B1
`6,314,365 B1
`6,317,684 B1
`6,349,257 B1
`
`6,353,661 B1 ******
`
`........... 709/218
`Gerritsen et al.
`Maes et al. .......... 705/1
`
`..... 707/1
`Liddy et al.
`
`..... 704/275
`Brown et al.
`........ 707/5
`LeVin et al.
`
`Haszto et al.
`..... 704/257
`
`Smith .................. 340/988
`
`..... 340/990
`Roeseler et al.
`
`
`Liu et al. .............. 340/56
`............... 379/8817
`Bailey, III
`OTHER PUBLICATIONS
`
`12/1999
`1/2000
`2/2000
`8/2000
`1/2001
`2/2001
`1 1/2001
`1 1/2001
`2/2002
`3/2002
`
`Moore, Robert et al., “CommandTalk: A Spoken—Language
`Interface for Battlefield Simulations”, Oct. 23, 1997, SRI
`International.
`
`Page 2 of 18
`
`Page 2 of 18
`
`
`
`US. Patent
`
`Jun. 29,2004
`
`Sheet 1 0f 7
`
`US 6,757,718 B1
`
`Jflfl
`
`
`
`Page 3 of 18
`
`
`
`300 (see Fig. 3)
`
`110
`
`Page 3 of 18
`
`
`
`US. Patent
`
`Jun. 29, 2004
`
`Sheet 2 0f 7
`
`US 6,757,718 B1
`
`/
`
`102
`
`104
`
`-
`
`300 (see Fig. 3)
`
`Network
`
`.W\\_— E 19g
`
`1g
`
`
`
`
`
`Fig. 1b
`
`Page 4 of 18
`
`Page 4 of 18
`
`
`
`US. Patent
`
`Jun. 29, 2004
`
`Sheet 3 0f 7
`
`US 6,757,718 B1
`
`. 3%
`II
`
`e”
`HiI!
`
`II
`
`
`
`
`
`
`
`fig
`#09.
`’Q'
`l
`N A
`
`Network
`
`20_8
`
`300 (see Fig. 3)
`
`210
`
`I01
`
`Fig. 2
`
`Page 5 of 18
`
`Page 5 of 18
`
`
`
`US. Patent
`
`Jun. 29, 2004
`
`Sheet 4 0f 7
`
`US 6,757,718 B1
`
`REQUEST PROCESSING LOGIC 300
`
`
`SPEECH RECOGNITION
`
`ENGINE
`
`NATURAL LANGUAGE
`
`PARSER
`
`QUERY REFINEMENT LOGIC
`
`QUERY CONSTRUCTION
`
`LOGIC
`
`Page 6 of 18
`
`Page 6 of 18
`
`
`
`US. Patent
`
`Jun. 29, 2004
`
`Sheet 5 0f 7
`
`US 6,757,718 B1
`
`402
`
`RECEIVE SPOKEN NL REQUEST
`
`404
`
`INTERPRET REQUEST
`
`405 IDENTIFY/SELECT DATA SOURCE
`
`406 CONSTRUCT NAVIGATION QUERY
`
`
`
`
`
` SOLICIT
`ADDITIONAL
`
`
`(MULTIMODAL)
`' USERINPUT
`
`412
`
`_4_Q_§
`
`NAVIGATE DATA SOURCE
`
`_
`
`
`
`REFINE
`QUERY?
`
`4L
`
`NO
`
`410
`—”'
`
`TRANSMIT AND DISPLAY TO
`CLIENT
`
`Fig. 4
`
`Page 7 of 18
`
`Page 7 of 18
`
`
`
`US. Patent
`
`Jun. 29, 2004
`
`Sheet 6 0f 7
`
`US 6,757,718 B1
`
`(from step 406, Fig. 4)
`
`SCRAPE THE ONLINE SCRIPTED FORM TO
`EXTRACT AN INPUT TEMPLATE
`
`INTERPRETATION OF STEP 404
`
`INSTANTIATE THE INPUT TEMPLATE USING
`
`(to step 407, Fig. 4)
`
`Fig. 5
`
`Page 8 of 18
`
`Page 8 of 18
`
`
`
`US. Patent
`
`Jun. 29, 2004
`
`Sheet7 0f7
`
`US 6,757,718 B1
`
` Fl
`
`mOH<t.:U<H_
`
`Fzmo<mo>
`
`0%mm;9235me103%
`
`
`
`I:mm<mE<QFzmo<422zoEzoommm
`ag3Em?Emo<
`
`
`Emw<OFEF>552#2522
`
`
`Em?a.:8me5%:5/3023
`aEmo<
`
`E5235mmi
`Emo<.
`
`mmm:
`
`mo<ummlrz_
`
`mHZmO<
`
`Om9>
`
`mm<m<k<o
`
`Hzm0<
`
`0mm
`
`Page 9 of 18
`
`Page 9 of 18
`
`
`
`
`
`
`US 6,757,718 B1
`
`1
`MOBILE NAVIGATION OF NETWORK-
`BASED ELECTRONIC INFORMATION
`USING SPOKEN INPUT
`
`This application is a continuation of an application
`entitled NAVIGATING NETWORK-BASED ELEC-
`TRONIC INFORMATION USING SPOKEN NATURAL
`LANGUAGE INPUT WITH MULTIMODAL ERROR
`FEEDBACK which was filed on Mar. 13, 2000 under Ser.
`No. 09/524,095 and which is a Continuation In Part of
`co-pending US. patent application Ser. No. 09/225,198,
`filed Jan. 5, 1999, Provisional US. patent application Ser.
`No. 60/124,718, filed Mar. 17, 1999, Provisional US. patent
`application Ser. No. 60/124,720, filed Mar. 17, 1999, and
`Provisional US. patent application Ser. No. 60/124,719,
`filed Mar. 17, 1999, from which applications priority is
`claimed and these application are incorporated herein by
`reference.
`
`BACKGROUND OF THE INVENTION
`
`The present invention relates generally to the navigation
`of electronic data by means of spoken natural language
`requests, and to feedback mechanisms and methods for
`resolving the errors and ambiguities that may be associated
`with such requests.
`As global electronic connectivity continues to grow, and
`the universe of electronic data potentially available to users
`continues to expand, there is a growing need for information
`navigation technology that allows relatively naive users to
`navigate and access desired data by means of natural lan-
`guage input. In many of the most
`important markets—
`including the home entertainment arena, as well as mobile
`computing—spoken natural
`language input
`is highly
`desirable, if not ideal. As just one example, the proliferation
`of high-bandwidth communications infrastructure for the
`home entertainment market (cable, satellite, broadband)
`enables delivery of movies-on-demand and other interactive
`multimedia content to the consumer’s home television set.
`
`For users to take full advantage of this content stream
`ultimately requires interactive navigation of content data-
`bases in a manner that is too complex for user-friendly
`selection by means of a traditional remote-control clicker.
`Allowing spoken natural language requests as the input
`modality for rapidly searching and accessing desired content
`is an important objective for a successful consumer enter-
`tainment product in a context offering a dizzying range of
`database content choices. As further examples, this same
`need to drive navigation of (and transaction with) relatively
`complex data warehouses using spoken natural language
`requests applies equally to surfing the Internet/Web or other
`networks for general information, multimedia content, or
`e-commerce transactions.
`
`In general, the existing navigational systems for browsing
`electronic databases and data warehouses (search engines,
`menus, etc.), have been designed without navigation via
`spoken natural language as a specific goal. So today’s world
`is full of existing electronic data navigation systems that do
`not assume browsing via natural spoken commands, but
`rather assume text and mouse-click inputs (or in the case of
`TV remote controls, even less). Simply recognizing voice
`commands within an extremely limited vocabulary and
`grammar—the spoken equivalent of button/click input (e.g.,
`speaking “channel 5” selects TV channel 5)—is really not
`sufficient by itself to satisfy the objectives described above.
`In order to deliver a true “win” for users, the voice-driven
`front-end must accept spoken natural language input in a
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`manner that is intuitive to users. For example, the front-end
`should not require learning a highly specialized command
`language or format. More fundamentally, the front-end must
`allow users to speak directly in terms of what the user
`ultimately wants —e.g., “I’d like to see a Western film
`directed by Clint Eastwood” —as opposed to speaking in
`terms of arbitrary navigation structures (e.g., hierarchical
`layers of menus, commands, etc.) that are essentially arti-
`facts reflecting constraints of the pre-existing text/click
`navigation system. At the same time,
`the front-end must
`recognize and accommodate the reality that a stream of
`naive spoken natural language input will, over time, typi-
`cally present a variety of errors and/or ambiguities: e.g.,
`garbled/unrecognized words (did the user say “Eastwood” or
`“Easter”?) and under-constrained requests (“Show me the
`Clint Eastwood movie”). An approach is needed for han-
`dling and resolving such errors and ambiguities in a rapid,
`user-friendly, non-frustrating manner.
`What
`is needed is a methodology and apparatus for
`rapidly constructing a voice-driven front-end atop an
`existing, non-voice data navigation system, whereby users
`can interact by means of intuitive natural language input not
`strictly conforming to the step-by-step browsing architecture
`of the existing navigation system, and wherein any errors or
`ambiguities in user input are rapidly and conveniently
`resolved. The solution to this need should be compatible
`with the constraints of a multi-user, distributed environment
`such as the Internet/Web or a proprietary high-bandwidth
`content delivery network; a solution contemplating one-at-
`a-time user interactions at a single location is insufficient, for
`example.
`
`SUMMARY OF THE INVENTION
`
`invention addresses the above needs by
`The present
`providing a system, method, and article of manufacture for
`mobile navigation of network-based electronic data sources
`in response to spoken input requests. When a spoken input
`request is received from a user using a mobile information
`appliance that communicates with a network server via an at
`least partially wireless communications system,
`it
`is
`interpreted, such as by using a speech recognition engine to
`extract speech data from acoustic voice signals, and using a
`language parser to linguistically parse the speech data. The
`interpretation of the spoken request can be performed on a
`computing device locally with the user, such as the mobile
`information appliance, or remotely from the user. The result-
`ing interpretation of the request is thereupon used to auto-
`matically construct an operational navigation query to
`retrieve the desired information from one or more electronic
`network data sources, which is then transmitted to a client
`device of the user. If the network data source is a database,
`the navigation query is constructed in the format of a
`database query language.
`Typically, errors or ambiguities emerge in the interpreta-
`tion of the spoken request, such that the system cannot
`instantiate a complete, valid navigational template. This is to
`be expected occasionally, and one preferred aspect of the
`invention is the ability to handle such errors and ambiguities
`in relatively graceful and user-friendly manner. Instead of
`simply rejecting such input and defaulting to traditional
`input modes or simply asking the user to try again, a
`preferred embodiment of the present
`invention seeks to
`converge rapidly toward instantiation of a valid navigational
`template by soliciting additional clarification from the user
`as necessary, either before or after a navigation of the data
`source, via multimodal input, i.e., by means of menu selec-
`tion or other input modalities including and in addition to
`
`Page 10 of 18
`
`Page 10 of 18
`
`
`
`US 6,757,718 B1
`
`3
`spoken input. This clarifying, multi-modal dialogue takes
`advantage of whatever partial navigational information has
`been gleaned from the initial interpretation of the user’s
`spoken request. This clarification process continues until the
`system converges toward an adequately instantiated navi-
`gational template, which is in turn used to navigate the
`network-based data and retrieve the user’s desired informa-
`tion. The retrieved information is transmitted across the
`
`network and presented to the user on a suitable client display
`device.
`
`10
`
`In a further aspect of the present invention, the construc-
`tion of the navigation query includes extracting an input
`template for an online scripted interface to the data source
`and using the input template to construct the navigation
`query. The extraction of the input
`template can include
`dynamically scraping the online scripted interface.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`The invention, together with further advantages thereof,
`may best be understood by reference to the following
`description taken in conjunction with the accompanying
`drawings in which:
`FIG. 1a illustrates a system providing a spoken natural
`language interface for network-based information
`navigation,
`in accordance with an embodiment of the
`present invention with server-side processing of requests;
`FIG. 1b illustrates another system providing a spoken
`natural language interface for network-based information
`navigation,
`in accordance with an embodiment of the
`present invention with client-side processing of requests;
`FIG. 2 illustrates a system providing a spoken natural
`language interface for network-based information
`navigation,
`in accordance with an embodiment of the
`present invention for a mobile computing scenario;
`FIG. 3 illustrates the functional logic components of a
`request processing module in accordance with an embodi-
`ment of the present invention;
`FIG. 4 illustrates a process utilizing spoken natural lan-
`guage for navigating an electronic database in accordance
`with one embodiment of the present invention;
`FIG. 5 illustrates a process for constructing a navigational
`query for accessing an online data source via an interactive,
`scripted (e.g., CGI) form; and
`FIG. 6 illustrates an embodiment of the present invention
`utilizing a community of distributed, collaborating elec-
`tronic agents.
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`1. System Architecture
`a. Server-End Processing of Spoken Input
`FIG. 1a is an illustration of a data navigation system
`driven by spoken natural language input, in accordance with
`one embodiment of the present invention. As shown, a user’s
`voice input data is captured by a voice input device 102,
`such as a microphone. Preferably voice input device 102
`includes a button or the like that can be pressed or held-
`down to activate a listening mode, so that the system need
`not continually pay attention to, or be confused by, irrelevant
`background noise. In one preferred embodiment well-suited
`for the home entertainment setting, voice input device 102
`is a portable remote control device with an integrated
`microphone, and the voice data is transmitted from device
`102 preferably via infrared (or other wireless) link to com-
`munications box 104 (e.g., a set-top box or a similar
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`communications device that is capable of retransmitting the
`raw voice data and/or processing the voice data) local to the
`user’s environment and coupled to communications network
`106. The voice data is then transmitted across network 106
`
`to a remote server or servers 108. The voice data may
`preferably be transmitted in compressed digitized form, or
`alternatively—particularly where bandwidth constraints are
`significant—in analog format (e.g., via frequency modulated
`transmission), in the latter case being digitized upon arrival
`at remote server 108.
`
`the voice data is processed by
`At remote server 108,
`request processing logic 300 in order to understand the
`user’s request and construct an appropriate query or request
`for navigation of remote data source 110, in accordance with
`the interpretation process exemplified in FIG. 4 and FIG. 5
`and discussed in greater detail below. For purposes of
`executing this process, request processing logic 300 com-
`prises functional modules including speech recognition
`engine 310, natural language (NL) parser 320, query con-
`struction logic 330, and query refinement
`logic 340, as
`shown in FIG. 3. Data source 110 may comprise database(s),
`Internet/web site(s), or other electronic information
`repositories, and preferably resides on a central server or
`servers—which may or may not be the same as server 108,
`depending on the storage and bandwidth needs of the
`application and the resources available to the practitioner.
`Data source 110 may include multimedia content, such as
`movies or other digital video and audio content, other
`various forms of entertainment data, or other electronic
`information. The contents of data source 110 are
`
`navigated—i.e., the contents are accessed and searched, for
`retrieval of the particular information desired by the user—
`using the processes of FIGS. 4 and 5 as described in greater
`detail below.
`Once the desired information has been retrieved from data
`
`source 110, it is electronically transmitted via network 106
`to the user for viewing on client display device 112. In a
`preferred embodiment well-suited for the home entertain-
`ment setting, display device 112 is a television monitor or
`similar audiovisual entertainment device, typically in sta-
`tionary position for comfortable viewing by users.
`In
`addition, in such preferred embodiment, display device 112
`is coupled to or integrated with a communications box
`(which is preferably the same as communications box 104,
`but may also be a separate unit) for receiving and decoding/
`formatting the desired electronic information that is received
`across communications network 106.
`
`Network 106 is a two-way electronic communications
`network and may be embodied in electronic communication
`infrastructure including coaxial (cable television)
`lines,
`DSL, fiber-optic cable,
`traditional copper wire (twisted
`pair), or any other type of hardwired connection. Network
`106 may also include a wireless connection such as a
`satellite-based connection, cellular connection, or other type
`of wireless connection. Network 106 may be part of the
`Internet and may support TCP/IP communications, or may
`be embodied in a proprietary network, or in any other
`electronic communications network infrastructure, whether
`packet-switched or connection-oriented. A design consider-
`ation is that network 106 preferably provide suitable band-
`width depending upon the nature of the content anticipated
`for the desired application.
`b. Client-End Processing of Spoken Input
`FIG. 1b is an illustration of a data navigation system
`driven by spoken natural language input, in accordance with
`a second embodiment of the present invention. Again, a
`user’s voice input data is captured by a voice input device
`
`Page 11 0f18
`
`Page 11 of 18
`
`
`
`US 6,757,718 B1
`
`5
`102, such as a microphone. In the embodiment shown in
`FIG. 1b, the voice data is transmitted from device 202 to
`requests processing logic 300, hosted on a local speech
`processor, for processing and interpretation. In the preferred
`embodiment illustrated in FIG. 1b, the local speech proces-
`sor is conveniently integrated as part of communications box
`104, although implementation in a physically separate (but
`communicatively coupled) unit is also possible as will be
`readily apparent to those of skill in the art. The voice data is
`processed by the components of request processing logic
`300 in order to understand the user’s request and construct
`an appropriate query or request for navigation of remote data
`source 110, in accordance with the interpretation process
`exemplified in FIGS. 4 and 5 as discussed in greater detail
`below.
`
`The resulting navigational query is then transmitted elec-
`tronically across network 106 to data source 110, which
`preferably resides on a central server or servers 108. As in
`FIG. 1a, data source 110 may comprise database(s), Internet/
`web site(s), or other electronic information repositories, and
`preferably may include multimedia content, such as movies
`or other digital video and audio content, other various forms
`of entertainment data, or other electronic information. The
`contents of data source 110 are then navigated—i.e.,
`the
`contents are accessed and searched, for retrieval of the
`particular information desired by the user—preferably using
`the process of FIGS. 4 and 5 as described in greater detail
`below. Once the desired information has been retrieved from
`
`10
`
`15
`
`20
`
`25
`
`data source 110, it is electronically transmitted via network
`106 to the user for viewing on client display device 112.
`In one embodiment
`in accordance with FIG. 1b and
`
`30
`
`well-suited for the home entertainment setting, voice input
`device 102 is a portable remote control device with an
`integrated microphone, and the voice data is transmitted
`from device 102 preferably via infrared (or other wireless)
`link to the local speech processor. The local speech proces-
`sor is coupled to communications network 106, and also
`preferably to client display device 112 (especially for pur-
`poses of query refinement transmissions, as discussed below
`in connection with FIG. 4, step 412), and preferably may be
`integrated within or coupled to communications box 104. In
`addition, especially for purposes of a home entertainment
`application, display device 112 is preferably a television
`monitor or similar audiovisual entertainment device, typi-
`cally in stationary position for comfortable viewing by
`users. In addition, in such preferred embodiment, display
`device 112 is coupled to a communications box (which is
`preferably the same as communications box 104, but may
`also be a physically separate unit)
`for receiving and
`decoding/formatting the desired electronic information that
`is received across communications network 106.
`
`Design considerations favoring server-side processing
`and interpretation of spoken input requests, as exemplified
`in FIG. 1a, include minimizing the need to distribute costly
`computational hardware and software to all client users in
`order to perform speech and language processing. Design
`considerations favoring client-side processing, as exempli-
`fied in FIG. 1b, include minimizing the quantity of data sent
`upstream across the network from each client, as the speech
`recognition is performed before transmission across the
`network and only the query data and/or request needs to be
`sent, thus reducing the upstream bandwidth requirements.
`c. Mobile Client Embodiment
`
`Amobile computing embodiment of the present invention
`may be implemented by practitioners as a variation on the
`embodiments of either FIG. 1a or FIG. 1b. For example, as
`depicted in FIG. 2, a mobile variation in accordance with the
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`server-side processing architecture illustrated in FIG. 1a
`may be implemented by replacing voice input device 102,
`communications box 104, and client display device 112,
`with an integrated, mobile, information appliance 202 such
`as a cellular telephone or wireless personal digital assistant
`(wireless PDA). Mobile information appliance 202 essen-
`tially performs the functions of the replaced components.
`Thus, mobile information appliance 202 receives spoken
`natural language input requests from the user in the form of
`voice data, and transmits that data (preferably via wireless
`data receiving station 204) across communications network
`206 for server-side interpretation of the request, in similar
`fashion as described above in connection with FIG. 1.
`
`Navigation of data source 210 and retrieval of desired
`information likewise proceeds in an analogous manner as
`described above. Display information transmitted electroni-
`cally back to the user across network 206 is displayed for the
`user on the display of information appliance 202, and audio
`information is output through the appliance’s speakers.
`Practitioners will further appreciate, in light of the above
`teachings,
`that
`if mobile information appliance 202 is
`equipped with sufficient computational processing power,
`then a mobile variation of the client-side architecture exem-
`
`plified in FIG. 2 may similarly be implemented. In that case,
`the modules corresponding to request processing logic 300
`would be embodied locally in the computational resources
`of mobile information appliance 202, and the logical flow of
`data would otherwise follow in a manner analogous to that
`previously described in connection with FIG. 1b.
`As illustrated in FIG. 2, multiple users, each having their
`own client input device, may issue requests, simultaneously
`or otherwise, for navigation of data source 210. This is
`equally true (though not explicitly drawn) for the embodi-
`ments depicted in FIGS. 1a and 1b. Data source 210 (or
`100), being a network accessible information resource, has
`typically already been constructed to support access requests
`from simultaneous multiple network users, as known by
`practitioners of ordinary skill in the art. In the case of
`server-side speech processing, as exemplified in FIGS. 1a
`and 2,
`the interpretation logic and error correction logic
`modules are also preferably designed and implemented to
`support queuing and multi-tasking of requests from multiple
`simultaneous network users, as will be appreciated by those
`of skill in the art.
`
`It will be apparent to those skilled in the art that additional
`implementations, permutations and combinations of the
`embodiments set forth in FIGS. 1a, 1b, and 2 may be created
`without straying from the scope and spirit of the present
`invention. For example, practitioners will understand,
`in
`light of the above teachings and design considerations, that
`it is possible to divide and allocate the functional compo-
`nents of request processing logic 300 between client and
`server. For example, speech recognition—in entirety, or
`perhaps just early stages such as feature extraction—might
`be performed locally on the client end, perhaps to reduce
`bandwidth requirements, while natural language parsing and
`other necessary processing might be performed upstream on
`the server end, so that more extensive computational power
`need not be distributed locally to each client. In that case,
`corresponding portions of request processing logic 300, such
`as speech recognition engine 310 or portions thereof, would
`reside locally at
`the client as in FIG. 1b, while other
`component modules would be hosted at the server end as in
`FIGS. 1a and 2.
`
`Further, practitioners may choose to implement the each
`of the various embodiments described above on any number
`of different hardware and software computing platforms and
`
`Page 12 0f18
`
`Page 12 of 18
`
`
`
`US 6,757,718 B1
`
`7
`environments and various combinations thereof, including,
`by way of just a few examples: a general-purpose hardware
`microprocessor such as the Intel Pentium series; operating
`system software such as Microsoft Windows/CE, Palm OS,
`or Apple Mac OS (particularly for client devices and client-
`side processing), or Unix, Linux, or Windows/NT (the latter
`three particularly for network data servers and server-side
`processing), and/or proprietary information access platforms
`such as Microsoft’s WebTV or the Diva Systems video-on-
`demand system.
`2. Processing Methodology
`The present invention provides a spoken natural language
`interface for interrogation of remote electronic databases
`and retrieval of desired information. A preferred embodi-
`ment of the present invention utilizes the basic methodology
`outlined in the flow diagram of FIG. 4 in order to provide
`this interface. This methodology will now be discussed.
`a. Interpreting Spoken Natural Language Requests
`At step 402, the user’s spoken request for information is
`initially received in the form of raw (acoustic) voice data by
`a suitable input device, as previously discussed in connec-
`tion with FIGS. 1—2. At step 404 the voice data received
`from the user is interpreted in order to understand the user’s
`request for information. Preferably this step includes per-
`forming speech recognition in order to extract words from
`the voice data, and further includes natural language parsing
`of those words in order to generate a structured linguistic
`representation of the user’s request.
`Speech recognition in step 404 is performed using speech
`recognition engine 310. A variety of commercial quality,
`speech recognition engines are readily available on the
`market, as practitioners will know. For example, Nuance
`Communications offers a suite of speech recognition
`engines, including Nuance 6, its current flagship product,
`and Nuance Express, a lower cost package for entry-level
`applications. As one other example, IBM offers the ViaVoice
`speech recognition engine,
`including a low-cost shrink-
`wrapped version available through popular consumer distri-
`bution channels. Basically, a speech recognition engine
`processes acoustic voice data and attempts to generate a text
`stream of recognized words.
`Typically, the speech recognition engine is provided with
`a vocabulary lexicon of likely words or phrases that the
`recognition engine can match against its analysis of acous-
`tical signals, for purposes of a given application. Preferably,
`the lexicon is dynamically adjusted to reflect the current user
`context, as established by the preceding user inputs. For
`example, if a user is engaged in a dialogue with the system
`about movie selection, the recognition engine’s vocabulary
`may preferably be adjusted to favor relevant words and
`phrases, such as a stored list of proper names for popular
`movie actors and directors, etc. Whereas if the current
`dialogue involves selection and viewing of a sports event,
`the engine’s vocabulary might preferably be adjusted to
`favor a stored list of proper names for professional sports
`teams, etc.
`In addition, a speech recognition engine is
`provided with language models that help the engine predict
`the most likely interpretation of a given segment of acous-
`tical voice data, in the current context of phonemes or words
`in which the segment appears. In addition, speech recogni-
`tion engines often echo to the user, in more or less real-time,
`a transcription of the engine’s best guess at what the user has
`said, giving the user an opportunity to confirm or reject.
`In a further aspect of step 404, natural language inter-
`preter (or parser) 320 linguistically parses and interprets the
`textual output of the speech recognition engine. In a pre-
`ferred embodiment of the present invention,
`the natural-
`
`8
`language interpreter attempts to determine both the meaning
`of spoken words (semantic processing) as well as the
`grammar of the statement (syntactic processing), such as the
`Gemini Natural Language Understanding System developed
`by SRI International. The Gemini system is described in
`detail in publications entitled “Gemini: A Natural Language
`System for Spoken-Language Understanding” and