`(10) Patent N0.:
`US 6,801,604 132
`
`Macs et al.
`(45) Date of Patent:
`Oct. 5, 2004
`
`USOO6801604B2
`
`(54) UNIVERSAL IP-BASED AND SCALABLE
`ARCHITECTURES ACROSS
`CONVERSATIONAL APPLICATIONS USING
`WEB SERVICES FOR SPEECH AND AUDIO
`PROCESSING RESOURCES
`
`(75)
`
`Inventors: Stephane H. Maes, Danbury, CT (US);
`David M. Lubensky, Brookfield, CT
`(US); Andrzej Sakrajda, White Plains,
`NY (US)
`
`(73) Assignee:
`
`International Business Machines
`Corporation, Armonk, NY (US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by 0 days.
`
`(21) Appl. No.: 10/183,125
`
`(22)
`
`Filed:
`
`Jun. 25, 2002
`
`(65)
`
`Prior Publication Data
`
`US 2003/0088421 A1 May 8, 2003
`
`(60)
`
`Related US. Application Data
`Provisional application No. 60/300,755, filed on Jun. 25,
`2001.
`
`Int. Cl.7 ................................................. H04M 1/64
`(51)
`(52) US. Cl.
`................................ 379/88.17; 379/88.16;
`704/2701, 709/203; 709/231
`(58) Field of Search ........................... 379/88.01—88.04,
`379/88.16, 88.17, 88.23—88.25; 704/270—275,
`717/114, 116; 709/228—231, 201—203, 249,
`250
`
`(56)
`
`References Cited
`
`U.S. PATENT DOCUMENTS
`
`......................... 709/228
`2002/0184373 A1 * 12/2002 Maes
`2002/0194388 A1 * 12/2002 Boloker et al.
`...... 709/310
`
`............ 709/318
`2003/0005174 A1 *
`1/2003 Coffman et al.
`2003/0088421 A1 *
`5/2003 Maes et al.
`.............. 704/2701
`
`* cited by examiner
`
`Primary Examiner—Roland Foster
`(74) Attorney, Agent, or Firm—F. Chau & Associates, LLC
`
`(57)
`
`ABSTRACT
`
`Systems and methods for conversational computing and, in
`particular, to systems and methods for building distributed
`conversational applications using a Web services-based
`model wherein speech engines (e.g., speech recognition) and
`audio I/O systems are programmable services that can be
`asynchronously programmed by an application using a
`standard, extensible SERCP (speech engine remote control
`protocol), to thereby provide scalable and flexible IP-based
`architectures that enable deployment of the same application
`or application development environment across a wide range
`of voice processing platforms and networks/gateways (e.g.,
`PSTN (public switched telephone network), Wireless,
`Internet, and VoIP (voice over IP)). Systems and methods are
`further provided for dynamically allocating, assigning, con-
`figuring and controlling speech resources such as speech
`engines, speech pre/post processing systems, audio
`subsystems, and exchanges between speech engines using
`SERCP in a web service-based framework.
`
`22 Claims, 17 Drawing Sheets
`
`14
`
`
`
`
`
`Application
`
`Task Manager
`
`Router and
`
`Load Manager
`
`
`
`
`Voice Response System
`(e.g. DT/SOOO)
`
`
`
`
`
`
`
`
`
`
`TELESIGN EX1003
`
`Page 1
`
`TELESIGN EX1003
`Page 1
`
`
`
`US. Patent
`
`aO
`
`
`
`W.K225,:
`
`E
`
`2
`
`co=m23q<
`
`
`
`
`
`5923/.xmm...
`
`
`
`
`S889.5.m.3memcms.30..
`
`
`
`
`
`
`
`
`E295mmcoqmmmmo_o>cam550m
`
`1wh
`
`U
`
`S
`
`8
`
`2B4
`
`U2aea.22
`
`momwana
`6.,\
`
`
`
`06.252mm
`
`M“N
`
`TELESIGN EX1003
`
`Page 2
`
`TELESIGN EX1003
`Page 2
`
`
`
`
`
`
`US. Patent
`
`Oct. 5, 2004
`
`Sheet 2 0f 17
`
`US 6,801,604 B2
`
`
`
`25
`
`
`
`
`
`Each server has unique
`address
`
`
`
`
`21
`
`13a
`
`<profile>.<service>.<ins_tance>
`.<host:port, Ilstener audio port>
`
`
`
`
`26
`
`FIG. 2
`
`TELESIGN EX1003
`
`Page 3
`
`TELESIGN EX1003
`Page 3
`
`
`
`US. Patent
`
`0
`
`m2
`
`71f03tee%
`
`2B406,108,6SU
`
`
`
`99.05E9:859$596$.A
`
`
`
`meEmunewmmEmcm22:50A
`
`
`
`4uwmmcmExmeA2..
`
`5:m.ana.nL®m>>ohmMW
`m<>>”NW86NH
`
`Ucmxomm
`
`
`
`$2505.__>_Xoo_o>
`
`5—D“.
`
`an.wE
`
`am.wE
`
`TELESIGN EX1003
`
`Page 4
`
`TELESIGN EX1003
`Page 4
`
`
`
`
`
`
`US. Patent
`
`Oct. 5, 2004
`
`Sheet 4 0f 17
`
`US 6,801,604 132
`
`ucwxomnucmm<>>
`
`ucmxomnUcmw<>>
`
`$8.309.2mefimeooWm.....................
`
`
`
`
`whys—OECmomwmomF—FAfiWBOLm>
`
`
`mm.
`
`$9505m.__>_Xmo_o>Hm”
`
`
`
`
`
`.__>_xcozfiocmmflxmz
`
`_>_n_n_
`
`om.Em
`
`_mUoE-E:_>_
`
`um.wE
`
`TELESIGN EX1003
`
`Page 5
`
`TELESIGN EX1003
`Page 5
`
`
`
`
`
`
`
`US. Patent
`
`whS
`
`SU
`
`2B4
`
`M£ngm.m,;833325_55.8%a.r......528E“O._
`SEESmm
`
`“sign:2;
`
`:83
`
`Em528x
`
`
`
`pm8029/"”raga:.=‘85;9.3%“
`5....................ASE”;5&3235
`2933U................A/n..252::BE2.
`
`
`
`
`m.m.VGER
`238m25053“a£53m=o__s__&<
`
`E
`
`TELESIGN EX1003
`
`Page 6
`
`v»
`
`2.8m
`
`mo_o>
`
`59505
`
`.__>_Xmo_o>
`
`
`
`Lomzém55335500
`
`TELESIGN EX1003
`Page 6
`
`
`
`
`
`
`
`
`US. Patent
`
`40025,LCO
`
`71f06teehS
`
`08,6SU
`
`2nluBaMa6o1warmT
`
`
`
`
`
`
`
`
`
`
`
`m.oo_fio__man
`
`558.32
`
`
`
`mm
`
`0%.MaEPN
`M7
`
`TELESIGN EX1003
`Page 7
`
`
`
`
`US. Patent
`
`Oct. 5, 2004
`
`Sheet 7 0f 17
`
`US 6,801,604 132
`
`
`
`
`
`
`
`
`
`
`
`2.2£2
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`TELESIGN EX1003
`
`Page 8
`
`TELESIGN EX1003
`Page 8
`
`
`
`US. Patent
`
`Oct. 5, 2004
`
`Sheet 8 0f 17
`
`US 6,801,604 B2
`
`
`
`
`
`ASR,TTS,SPID,NLU
`SeechServer
`
`
`
` Dialogic,
`NMS
`
`
`PCwith
`TeI&Aud Control
`
`
`
`
`
`Controller
`
`TELESIGN EX1003
`
`Page 9
`
`
`
`h :
`
`5I-I
`
`82
`
`h
`
`DT
`
`Data
`
`TELESIGN EX1003
`Page 9
`
`
`
`US. Patent
`
`Oct. 5, 2004
`
`Sheet 9 0f 17
`
`US 6,801,604 B2
`
`Receive Call
`
`Determine Call ID
`
`Send Application Instance Request
`
`Assign Application to Call
`
`90
`
`91
`
`92
`
`93
`
`94
`
`95
`
`Will
`
`
`Load Application Presentation Layer
`
`Provide Application with Audio l/O Port
`
`Application Sends Request to Accept Call
`
`Application Generates Control Message
`
`Requesting Audio Processing Services
`
`Task Manager Sends Control Message to
`Router/Load Manager Requesting Services
`
`Router/Load Manager Allocates/Assignes
`Appropriate Resources (speech engines)
`
`
`
`
`
`96
`
`97
`
`98
`
`99
`
`
`
`Task Manager Transmits Control Messages
`to Assigned Speech Engine(s) to Program
`Engine(s) for Processing Incoming
`all
`
`
`
`
`Task Manager Receives Processing Results
`
`FIG 8
`
`and Sends Results to Application
`
`100
`
`101
`
`TELESIGN EX1003
`
`Page 10
`
`TELESIGN EX1003
`Page 10
`
`
`
`7__.x‘_.............E0630H_536$\,0_mo_o>\m_\trllllk
`
`.626na\Eofcwhmtfi52mm
`
`US. Patent
`
`*mo285063
`
`
`
`53%:xmfl.
`
`
`
`a:m=,\$38.29m-m:_go£s<cosmo=QQ<
`
` mo_EmEom._z\.=>_X_5,33¢k3-1E55,_LmquEoEfi0:$596_mBm_o::oo
`
`i<Ow(_
`
`wEm>mcam9.3%:"n:.m/m.
`mmEmcw_/.ME0:
`
`./e./mt“_/./.S_/./_./._.
`
`6_S_
`
`U_:2:_05m...\_9.3m\_\20593350
`
`2BM
`
`6,_IIIIIIIIIIII.m._.
`
`TELESIGN EX1003
`
`Page 11
`
`TELESIGN EX1003
`Page 11
`
`
`
`
`US. Patent
`
`Oct. 5, 2004
`
`Sheet 11 0f 17
`
`US 6,801,604 B2
`
`
`
`
`
` Each server has unique
`ddr ss
`8
`e
`
`130,
`
`
`
`
`
`
`52 Audio Bus
`
`115
`
`FIG. 10
`
`TELESIGN EX1003
`
`Page 12
`
`TELESIGN EX1003
`Page 12
`
`
`
`US. Patent
`
`Oct. 5, 2004
`
`Sheet 12 0f 17
`
`US 6,801,604 132
`
`296mm
`
`
`
`mm.>>53>COBEOQE
`
`
`
`322:5mmgg‘s
`
`5523$83.
`
` $328
`
`
`22%aneozawnesmmoowkbx855552
`
`
`SE23?583%:32mm__cm
`
`52mm8:332;$0.;ngE8
`
`wmmflwx
`85>25
`
`em“
`
`2.2m
`
`25m
`
`TELESIGN EX1003
`
`Page 13
`
`TELESIGN EX1003
`Page 13
`
`
`
`
`
`
`
`US. Patent
`
`Oct. 5, 2004
`
`Sheet 13 0f 17
`
`US 6,801,604 B2
`
`132
`
`Application
`Environment /
`
`Task Manager
`
`SERCP
`
`Unspecified
`
`additional protocol
`
`SERCP
`
`(optional)
`
`TEL control
`I30
`
`
`RT-DSR: RTP Stream
`
`
`Conversational
`
`
`RT-DSR: RTCP
`
`RT-DSR: Call Control
`
`
`Engines
`
`
`
`131
`
`FIG. 12
`
` Conventional
`
`
`
`
`Web service
`compliant interface
`and behavior (e.g. WSDL
`and SOAP, etc...)
`
`Web Services
`
`
`
`.
`Web Servrce
`with simplified
`
`oroptimizedAPl
`or protocol
`
`.
`_
`Slmpllfied
`PFOtOCOI (XML RPC,
`RPC, limited API,
`simpler messaging etc..)
`
`
`150
`
`151
`
`’52
`
`FIG. 17
`
`TELESIGN EX1003
`
`Page 14
`
`TELESIGN EX1003
`Page 14
`
`
`
`US. Patent
`
`Oct. 5, 2004
`
`Sheet 14 0f 17
`
`US 6,801,604 132
`
`«508
`
`5:2
`
`ofiNEoEoiw
`
`_822n_
`
`
`
`85%.Emmam
`
`380$951
`
`8:352
`
`on
`
`{0282
`
`0m
`
`mmo
`
`_8o§n_
`
`28%mm
`
`85:8
`
`8%
`
`83.
`
`8%
`
`9:9558%
`
`2:833%
`
`cozméoE
`
`$80
`
`LawEwmmcmz
`
`388
`
`23
`
`mg.wE
`
`TELESIGN EX1003
`
`Page 15
`
`TELESIGN EX1003
`Page 15
`
`
`
`
`
`US. Patent
`
`Oct. 5, 2004
`
`Sheet 15 0f 17
`
`US 6,801,604 132
`
`140
`
`
`Consumer
`
`Manager)
`
`(Client Application, Task Manager, Load
`
`XML Service Request (SERCP)
`
`XML Service Request (SERCP)
`
`
`
`Business Facade
`
`'—
`
`
`Busmess Logic
`
`146
`
`
`
`
`
`
`
`Telephony/Audio l/O Service
`
`
`Speech Engine Service
`
`
`
`
`141
`
`142
`
`FIG. 14
`
`TELESIGN EX1003
`
`Page 16
`
`TELESIGN EX1003
`Page 16
`
`
`
`US. Patent
`
`Oct. 5, 2004
`
`Sheet 16 0f 17
`
`US 6,801,604 B2
`
`
`
`Upstream Codec Negotiation
`
`
`
`
`Downstream Codec Negotiation
`
`Establrsh connection
`
`Upstream Codec Negotratron
`
`Downstream Codec Negoflatrons
`
`
`AuroroWI_7_
`RTP + RTCP + RSVP + SERCP
`RTP"+"RTCP+mRSVP+SERCP"'
`
`
`
`GSM('paonaptype-SIM
`GSM(payloadtypeRf":
`Barge-In Detecttron-"Framexx"
`
`
`
`Request for new upstream
`codooafter framexxx.
`
`Upstream Codec Negotiation
`
`etc...
`
`etc...
`
`Request for new upstream
`__oopooafterframe xxx
`Upstream Code-E Negotlatlon ........................
`
`
`
`..........
`
`...
`
`FIG. 15
`
`TELESIGN EX1003
`
`Page 17
`
`TELESIGN EX1003
`Page 17
`
`
`
`US. Patent
`
`Oct. 5, 2004
`
`Sheet 17 0f 17
`
`US 6,801,604 B2
`
`with negotiated codecs
`
`Established DSR connection
`
`Engrne Capability Determrnatrons
`
`Engine Capability Determinations
`
`
`
`Engine reservation (Service combination)
`
`Engine reservation(Serwcecomlrinat'ionlm"""""'"”"""1‘-- .
`
`
`etc...
`
`
`Remote control Commands including parameters
`and data file settings &associated CDP & frame
`
`Remote control Commandsrncludirrgparamelersfl
`and data file settings & associated CDP & frame
`
`Results, events, resulting downstream RTP
`
`Results, events. resulting downstream RTP
`
`etc...
`
`FIG. 16
`
`TELESIGN EX1003
`
`Page 18
`
`TELESIGN EX1003
`Page 18
`
`
`
`US 6,801,604 B2
`
`1
`UNIVERSAL IP-BASED AND SCALABLE
`ARCHITECTURES ACROSS
`CONVERSATIONAL APPLICATIONS USING
`WEB SERVICES FOR SPEECH AND AUDIO
`PROCESSING RESOURCES
`
`CROSS REFERENCE TO RELATED
`APPLICATION
`
`This application claims priority to US. Provisional Appli-
`cation Ser. No. 60/300,755, filed on Jun. 25, 2001, which is
`incorporated herein by reference.
`
`TECHNICAL FIELD
`
`The present invention relates generally to systems and
`methods for conversational computing and, in particular, to
`systems and methods for building distributed conversational
`applications using a Web services-based model wherein
`speech engines (e.g., speech recognition) and audio I/O
`systems are implemented as programmable services that can
`be asynchronously programmed by an application using a
`standard, extensible SERCP (speech engine remote control
`protocol), to thereby provide scalable and flexible IP-based
`architectures that enable deployment of the same application
`or application development environment across a wide range
`of voice processing platforms and networks/gateways (e.g.,
`PSTN (public switched telephone network), Wireless,
`Internet, and VoIP (voice over IP)). The invention is further
`directed to systems and methods for dynamically allocating,
`assigning, configuring and controlling speech resources such
`as speech engines, speech pre/post processing systems,
`audio subsystems, and exchanges between speech engines
`using SERCP in a web service-based framework.
`BACKGROUND
`
`Telephony generally refers to any telecommunications
`system involving the transmission of speech information in
`either wired or wireless environments. Telephony applica-
`tions include, for example, IP telephony and Interactive
`Voice Response (IVR), and other voice processing plat-
`forms. IP telephony allows voice, data and video collabo-
`ration through existing IP telephony-based networks such as
`LANs, WANs and the Internet as well as IMS (IP multime-
`dia services) over wireless networks. Previously, separate
`networks were required to handle traditional voice, data and
`video traffic, which limited their usefulness. Voice and data
`connections where typically not available simultaneously.
`Each required separate transport protocols/mechanisms and
`infrastructures, which made them costly to install, maintain
`and reconfigure and unable to interoperate. Currently, vari-
`ous applications and APIs are commercially available that
`that enable convergence of PSTN telephony and telephony
`over Internet Protocol networks and 2.5G/3G wireless net-
`works. There is a convergence among fixed, mobile and
`nomadic wireless networks as well as with the Internet and
`
`voice networks, as exemplified by 2.5G, 3G and 4G.
`IVR is a technology that allows a telephone-based user to
`input or receive information remotely to or from a database.
`Currently,
`there is widespread use of IVR services for
`telephony access to information and transactions. An IVR
`system typically (but not exclusively) uses spoken directed
`dialog and generally operates as follows. Auser will dial into
`an IVR system and then listen to an audio prompts that
`provide choices for accessing certain menus and particular
`information. Each choice is either assigned to one number
`on the phone keypad or associated with a word to be uttered
`by the user (in voice enabled IVRs) and the user will make
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`2
`a desired selection by pushing the appropriate button or
`uttering the proper word.
`By way of example, a typical banking ATM transaction
`allows a customer to perform money transfers between
`savings, checking and credit card accounts, check account
`balances using IVR over the telephone, wherein information
`is presented via audio menus. With the IVR application, a
`menu can be played to the user over the telephone, whereby
`the menu messages are followed by the number or button the
`user should press to select the desired option:
`a. “for instant account information, press one,”
`b. “for transfer and money payment, press two,”
`c. “for fund information, press three,”
`d. “for check information, press four,”
`e. “for stock quotes, press five,”
`f. “for help, press seven,” etc.
`To continue, the user may be prompted to provide iden-
`tification information. Over the telephone, the IVR system
`may playback an audio prompt requesting the user to enter
`his/her account number (via DTMF or speech), and the
`information is received from the user by processing the
`DTMF signaling or recognizing the speech. The user may
`then be prompted to input his/her SSN and the reply is
`processed in a similar way. When the processing is
`complete, the information is sent to a server, wherein the
`account information is accessed, formatted to audio replay,
`and then played back to the user over the telephone.
`An IVR system may implement speech recognition in lieu
`of, or in addition to, DTMF keys. Conventional IVR appli-
`cations use specialized telephony hardware and IVR appli-
`cations use different software layers for accessing legacy
`database servers. These layers must be specifically designed
`for each application. Typically, IVR application developers
`offer
`their own proprietary speech engines and APIs
`(application program interface). The dialog development
`requires complex scripting and expert programmers and
`these proprietary applications are typically not portable from
`vendor to vendor (i.e., each application is painstakingly
`crafted and designed for specific business logic). Conven-
`tional IVR applications are typically written in specialized
`script languages that are offered by manufacturers in various
`incarnations and for different hardware platforms. The
`development and maintenance of such IVR applications
`requires qualified staff. Thus, current
`telephony systems
`typically do not provide interoperability, i.e., the ability of
`software and hardware on multiple machines from multiple
`vendors to communicate meaningfully.
`VoiceXML is a markup language that has been designed
`to facilitate the creation of speech applications such as IVR
`applications. Compared to conventional IVR programming
`frameworks that employ proprietary scripts and program-
`ming languages over proprietary/closed platforms,
`the
`VoiceXML standard provides a declarative programming
`framework based on XML (eXtensible Markup Language)
`and ECMAScript (see, e.g., the W3C XML specifications
`(www.w3.org/XML)
`and VoiceXML forum
`(www.voicexml.org)). VoiceXML is designed to run on
`web-like infrastructures of web servers and web application
`servers (i.e. the Voice browser). VoiceXML allows informa-
`tion to be accessed by voice through a regular phone or a
`mobile phone whenever it
`is difficult or not optimal
`to
`interact through a wireless GUI micro-browser.
`to
`More importantly, VoiceXML is a key component
`building multi-modal systems such as multi-modal and
`conversational user interfaces or mobile multi-modal brows-
`
`ers. Multi-modal solutions exploit the fact
`
`that different
`
`TELESIGN EX1003
`
`Page 19
`
`TELESIGN EX1003
`Page 19
`
`
`
`US 6,801,604 B2
`
`3
`interaction modes are more efficient for different user inter-
`actions. For example, depending on the interaction, talking
`may be easier than typing, whereas reading may be faster
`than listening. Multi-modal interfaces combine the use of
`multiple interaction modes, such as voice, keypad and
`display to improve the user interface to e-business.
`Advantageously, multi-modal browsers can rely on
`VoiceXML browsers and authoring to describe and render
`the voice interface.
`There are still key inhibitors to the deployment of com-
`pelling multi-modal applications. Most arise out of the
`current
`infrastructure and device platforms.
`Indeed,
`the
`current networking infrastructure is not configured for pro-
`viding seamless, multi-modal access to information. Indeed,
`although a plethora of information can be accessed from
`servers over a communications network using an access
`device (e.g., personal information and corporate information
`available on private networks and public information acces-
`sible via a global computer network such as the Internet), the
`availability of such information may be limited by the
`modality of the client/access device or the platform-specific
`software applications with which the user is interacting to
`obtain such information. For instance, current wireless net-
`work infrastructure and handsets do not provide simulta-
`neous voice and data access. Middleware, interfaces and
`protocols are needed to synchronize and manage the differ-
`ent channels. In light of the ubiquity of IP-based networks
`such as the Internet, and the availability of a plethora a
`services and resources on the Internet, the advantages of
`open and interoperable telephony systems are particularly
`compelling for voice processing applications such as IP
`telephony systems and IVR.
`Another hurdle is that development of multi-modal/
`conversational applications using current
`technologies
`requires not only knowledge of the goal of the application
`and how the interaction with the users should be defined, but
`a wide variety of other interfaces and modules external to the
`application at hand, such as (i) connection to input and
`output devices (telephone interfaces, microphones, web
`browsers, palm pilot display); (ii) connection to variety of
`engines (speech recognition, natural
`language
`understanding, speech synthesis and possibly language
`generation); (iii) resource and network management; and
`(iv) synchronization between various modalities for multi-
`modal or conversational applications.
`Accordingly, there is strong desire for development of
`distributed conversational systems having scalable and flex-
`ible architectures, which enable implementation of such
`systems over a wide range of application environments and
`voice processing platforms.
`SUMMARY OF THE INVENTION
`
`The present invention relates generally to systems and
`methods for conversational computing and, in particular, to
`systems and methods for building distributed conversational
`applications using a Web services-based model wherein
`speech engines (e.g., speech recognition) and audio I/O
`systems are implemented as programmable services that can
`be asynchronously programmed by an application using a
`standard, extensible SERCP (speech engine remote control
`protocol), to thereby provide scalable and flexible IP-based
`architectures that enable deployment of the same application
`or application development environment across a wide range
`of voice processing platforms and networks/gateways (e.g.,
`PSTN (public switched telephone network), Wireless,
`Internet, and VoIP (voice over IP)).
`The invention is further directed to systems and methods
`for dynamically allocating, assigning, configuring and con-
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`4
`trolling speech resources such as speech engines, speech
`pre/post processing systems, audio subsystems, and
`exchanges between speech engines using SERCP in a web
`service-based framework.
`
`In one preferred embodiment, a SERCP framework,
`which is used for speech engine remote control and network
`and system load management,
`is implemented using an
`XML-based web service framework wherein speech engines
`and resources comprise programmable services, wherein (i)
`XML is used to represent data (and XML Schemas to
`describe data types); (ii) an extensible messaging format is
`based on SOAP;
`(iii) an extensible service description
`language is based on WSDL, or an extension thereof, as a
`mechanism to describe the commands/interface supported
`by a given service;
`(iv) UDDI
`(Universal Description,
`Discovery, and Integration) is used to advertise and locate
`the service; and wherein (v) WSFL (Web Service Flow
`Language) is used to provide a generic mechanism from
`combining speech processing services through flow compo-
`sition.
`
`A conversational system according to an embodiment of
`the present invention assumes an application environment in
`which a conversational application comprises a collection of
`audio processing engines (e.g., audio I/O system, speech
`processing engines, etc.) that are dynamically associated
`with an application, wherein the exchange of audio between
`the audio processing engines is decoupled from control and
`application level exchanges and wherein the application
`generates control messages that configure and control the
`audio processing engines in a manner that renders the
`exchange of control messages independent of the application
`model and location of the engines. The speech processing
`engines can be dynamically allocated to the application on
`either a call, session, utterance or persistent basis.
`Preferably, the audio processing engines comprise web
`services that are described and accessed using WSDL (Web
`Services Description Language), or an extension thereof.
`In yet another aspect, a conversational system comprises
`a task manager, which is used to abstract from the
`application, the discovery of the audio processing engines
`and remote control of the engines.
`The systems and methods described herein may be used
`in various frameworks. One framework comprises a
`terminal-based application (located on the client or local to
`the audio subsystem) that remotely controls speech engine
`resources. One example of a terminal based application
`includes a wireless handset-based application that uses
`remote speech engines, e. g., a multimodal application in “fat
`client configuration” with a voice browser embedded on the
`client that uses remote speech engines. Another example of
`a terminal-based application comprises a voice application
`that operates on a client having local embedded engines that
`are used for some speech processing tasks, and wherein the
`voice application uses remote speech engines when (i) the
`task is too complex for the local engine, (ii) the task requires
`a specialized engine,
`(iii)
`it would not be possible to
`download speech data files (grammars, etc .
`.
`. ) without
`introducing significant delays, or (iv) when for IP, security
`or privacy reasons, it would not be appropriate to download
`such data files on the client or to perform the processing on
`the client or to send results from the client.
`
`Another usage framework for the invention is to enable an
`application located in a network to remotely control different
`speech engines located in the network. For example, the
`invention may be used to (i) distribute the processing and
`perform load balancing, (ii) allow the use of engines opti-
`
`TELESIGN EX1003
`
`Page20
`
`TELESIGN EX1003
`Page 20
`
`
`
`US 6,801,604 B2
`
`5
`
`mized for specific tasks, and/or to (iii) enable access and
`control of third party services specialized in providing
`speech engine capabilities.
`These and other aspects, features, and advantages of the
`present invention will become apparent from the following
`detailed description of the preferred embodiments, which is
`to be read in connection with the accompanying drawings.
`
`DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 is a block diagram of a speech processing system
`according to an embodiment of the present invention.
`FIG. 2 is a block diagram of a speech processing system
`according to an embodiment of the invention.
`FIGS. 3a—3d are diagrams illustrating application frame-
`works that can be implemented in a speech processing
`system according to the invention.
`FIG. 4 is a block diagram of a speech processing system
`according to an embodiment of the invention, which uses a
`conversational browser.
`
`FIG. 5 is a block diagram of a speech processing system
`according to an embodiment of the invention.
`FIG. 6 is a block diagram of a speech processing system
`according to an embodiment of the invention.
`FIG. 7 is a block diagram of a speech processing system
`according to an embodiment of the invention.
`FIG. 8 is a flow diagram of a method for processing a call
`according to one aspect of the invention.
`FIG. 9 is a block diagram of a speech processing system
`according to an embodiment of the invention.
`FIG. 10 is a block diagram of a speech processing system
`according to an embodiment of the invention.
`FIG. 11 is a block diagram of a speech processing system
`according to an embodiment of the invention.
`FIG. 12 is a block diagram of a speech processing system
`according to an embodiment of the invention.
`FIG. 13 is a block diagram illustrating a DSR system that
`may be implemented in a speech processing system accord-
`ing to an embodiment of the invention.
`FIG. 14 is a block diagram of a web service system
`according to an embodiment of the invention.
`FIG. 15 is a diagram illustrating client/server communi-
`cation using a DSR protocol stack according to an embodi-
`ment of the present invention.
`FIG. 16 is a diagram illustrating client/server communi-
`cation of SERCP (speech engine remote control protocol)
`data exchanges according to an embodiment of the present
`invention.
`
`FIG. 17 is a block diagram of a web service system
`according to another embodiment of the invention.
`DETAILED DESCRIPTION OF PREFERRED
`EMBODIMENTS
`
`The present invention is directed to systems and method
`for implementing universal IP-based and scalable conver-
`sational applications and platforms that are interoperable
`across a plurality of conversational applications, program-
`ming or execution models and systems. The term “conver-
`sational” and “conversational computing” as used herein
`refers to seamless, multi-modal
`(or voice only) dialog
`(information exchanges) between user and machine and
`between devices or platforms of varying modalities (I/O
`capabilities), regardless of the I/O capabilities of the access
`device/channel, preferably, using open, interoperable com-
`
`5
`
`10
`
`15
`
`20
`
`25
`
`30
`
`35
`
`40
`
`45
`
`50
`
`55
`
`60
`
`65
`
`6
`munication protocols and standards, as well as a conversa-
`tional programming model (e.g., conversational gesture-
`based markup language) that separates the application data
`content (tier 3) and business logic (tier 2) from the user
`interaction and data model that the user manipulates. Con-
`versational computing enables humans and machines to
`carry on a dialog as natural as human-to-human dialog.
`Further, the term “conversational application” refers to an
`application that supports multi-modal, free flow interactions
`(e.g., mixed initiative dialogs) within the application and
`across independently developed applications, preferably
`using short term and long term context (including previous
`input and output) to disambiguate and understand the user’s
`intention. Preferably, conversational applications utilize
`NLU (natural language understanding). Multi-modal inter-
`active dialog comprises modalities such as speech (e.g.,
`authored in VoiceXML), visual
`(GUI)
`(e.g., HTML
`(hypertext markup language)), constrained GUI (e.g., WML
`(wireless markup language), CHTML (compact HTML),
`HDML (handheld device markup language)), and a combi-
`nation of such modalities (e.g., speech and GUI). Further,
`the invention supports voice only (mono-modal) machine
`driven dialogs and any level of dialog capability in between
`voice only and free flow multimodal capabilities. As
`explained below, the invention provides a universal archi-
`tecture that can handle all these types of capabilities and
`deployments.
`Conversational applications and platforms according to
`the present invention preferably comprise a scalable and
`flexible framework that enables deployment of various types
`of applications and application development environment to
`provide voice access using various voice processing plat-
`forms such as telephony cards and IVR systems over
`networks/mechanisms such as PSTN, wireless, Internet, and
`VoIP networks/gateways, etc. A conversational system
`according to the invention is preferably implemented in a
`distributed, multi-tier client/server environment, which
`decouples the conversational applications from distributed
`speech engines and the telephony/audio I/O components. A
`conversational platform according to the invention is pref-
`erably interoperable with the existing WEB infrastructure to
`enable delivery of voice applications over telephony, for
`example, taking advantage of the ubiquity of applications
`and resources available over the Internet. For example,
`preferred telephony applications and systems according to
`the invention enable business enterprises and service pro-
`viders to give callers access to their business applications
`and data, anytime, anyplace using any telephone or voice
`access device.
`
`Referring now to FIG. 1, a block diagram illustrates a
`conversational system 10 according to an embodiment of the
`invention. The system 10 comprises a client voice response
`system 11 that executes on a host machine, which is based,
`for example, on a AIX, UNIX, or DOS/Windows operating
`system platform. The client application 11 provides the
`connectivity to the telephone line (analog or digital), other
`voice networks (such as IMS, VoIP, etc., wherein the appli-
`cation 11 may be considered as a gateway to the network (or
`a media processing entity), and other voice processing
`services (as explained below). Incoming calls/connections
`are answered by an appropriate client application running on
`the host machine. More specifically, the host machine can be
`connected to a PSTN, VoIP network, wireless network, etc.,
`and accessible by a user over an analog telephone line or an
`IDSN (Integrated Services Digital Network)
`line,
`for
`example. In addition,
`the host client machine 11 can be
`connected to a PBX (private branch exchange) system,
`
`TELESIGN EX1003
`
`Page 21
`
`TELESIGN EX1003
`Page 21
`
`
`
`US 6,801,604 B2
`
`7
`central office or automatic call distribution center, a VoIP
`gateway, a wireless support node gateway, etc. The host
`comprises the appropriate software and APIs that allows the
`client application 11 to interface to various telephone sys-
`tems and video phones systems, such as PSTN, digital ISDN
`and PBX access, VoIP gateway the voice services on the
`servers. The system 10 is preferably operable in various
`connectivity environments including, for example, T1, E1,
`ISDN, CAS, SS7, VoIP, wireless, etc.
`The voice response system 11 (or gateway) comprises
`client enabling code that operates with one or more appli-
`cation servers and conversational engine servers over an IP
`(Internet Protocol)-based network 13. The IP network 13
`may comprise, a LAN, WAN, or a global communication
`network such as the Internet or wireless network (IMS). In
`one exemplary embodiment, the host 11 machine comprises
`an IBM RS/6000 computer that comprises Direct Talk
`(DT/6000)®, a commercially available platform for voice
`processing applications. Direct Talk® is a versatile voice
`processing platform that provides expanded functionality to
`IVR applications. DirectTalk enables the development and
`operation of automated customer service solutions for vari-
`ous enterprises and service providers. Clients, customers,
`employees and other users can interact directly with busi-
`ness applications using telephones connected via public or
`private networks. DirectTalk supports scalable solutions
`from various telephony channels operating in customer
`premises or within telecommunication networks. It is to be
`understood, however, that the voice response system 11 may
`comprise any application that is accessible via telephone to
`provide telephone access to one or more applications and
`databases, provide interactive dialog with voice response,
`and data input via DTMF (dual tone multi-frequency). It is
`to be appreciated that other gateways and media processing
`entities can be considered.
`
`The system 10 further comprises one or more application
`servers (or web servers) and speech servers that are distrib-
`uted over the network 13. The system 10 comprises one or
`more conversational applications 14 and associate