`
`+ Trae1LTE: ee ee | P
`
`
`
`
`
`Notes in Computer Scrence
`
`Moricael8 |
`
`Harry Bunt Robbert-Jan Beun
`Tijn Borghuis (Eds.)
`eae cra
` yoeeeee
`1heae ee nessun
`if iiaMiaAMaPmal
`
`Multimodal
`' Human-Computer
`Communication
`
`Systems, Techniques, and Experiments
`
`
`
`GOOGLE EXHIBIT 1032
`
`Page 1 of 19
`
`
`
`Lecture Notes in Artificial Intelligence
`
`1374
`
`Subseries of Lecture Notes in Computer Science
`Edited by J. G. Carbonell and J. Siekmann
`
`Lecture Notes in Computer Science
`Edited by G. Goos, J. Hartmanis and J. van Leeuwen
`
`Page 2 of 19
`
`
`
`
`
`Springer
`Berlin
`Heidelberg
`New York
`Barcelona
`Budapest
`Hong Kong
`London
`Milan
`Paris
`Santa Clara
`Singapore
`Tokyo
`
`Page 3 of 19
`
`
`
`Page 3 of 19
`
`
`
`Harry Bunt Robbert-Jan Beun
`Tijn Borghuis (Eds.)
`
`Multimodal
`Human-Computer
`Communication
`
`Systems, Techniques,
`and Experiments
`
`Page 4 of 19
`
`
`
`Shy
`
`Te
`4
`4
`%
`
`Me
`i
`ef
`a
`
`aP
`
`s
`&
`
`iP.
`"3
`*e
`‘
`ty
`ey
`me
`4s
`%
`K
`“4
`a
`a
`
`hy
`
`er
`n
`Ay,
`®, {
`my
`ry i
`ie "
`ey
`a
`&
`
`
`
`Volume Editors
`
`Harry Bunt
`
`Tilburg University
`Warandelaan 2, 5000 LETilburg, The Netherlands
`E-mail: bunt@kub.nl
`
`Robbert-Jan Beun
`Center for Research on UserSystem Interaction (IPO)
`P.O. Box 513, 5600 MB Eindhoven,The Netherlands
`E-mail; rjbeun@ipo.tue.nl
`
`Tijn Borghuis
`Eindhoven University of Technology
`P.O, Box 513, 5600 MB Eindhoven,The Netherlands
`E-mail: tijn@ win.tue.nl
`
`Cataloging-in-Publication Data applied for
`g
`PP
`Die Deutsche Bibliothek - CIP-Einheitsaufnahme
`Multimodal human computer communication : systems,
`techniques, and experiments / Harry Bunt... (ed.). - Berlin ;
`Heidelberg ; New York ; Barcelona ; Budapest ; Hong Kong;
`London ; Milan ; Paris ; Santa Clara ; Singapore ; Tokyo : Springer,
`1998
`(Lecture notes in computer science ; Vol. 1374 : Lecture notes in
`artificial intelligence)
`ISBN 3-540-64380-X
`
`CR Subject Classification (1991): 1.2, H.5.1-2, 1.3.6, D.2.2, K.4.2
`
`ISSN 0302-9743
`ISBN 3-540-64380-X SpringerVerlag Berlin Heidelberg New York
`
`This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
`concerned,specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
`reproduction on microfilms orin any other way, and storage in data banks. Duplicationofthis publication
`orparts thereofis permitted only under the provisions of the German Copyright Law of September9, 1965,
`in dts current version, and permission for use must always be obtained from Springer -Verlag. Violations are
`linble for prosecution under the German Copyright Law.
`© Springer-Verlag Berlin Heidelberg 1998
`Printed in Germany
`
`Typesetting; Camera ready by author
`
`SPIN 10631926 06/9142- 4543210—Printed on acid-free paper
`
`Page 5 of 19
`
`IF-3Y6O
`
`Page 5 of 19
`
`
`
` Preface
`
`
`
`This volume contains revised versions of seventeenselected papers fromthe First
`International Conference on Cooperative Multimodal Communication (CMG/95),
`held in Eindhoven, the Netherlands, in May 1995. This was thefirst conference
`in a series, of which the second one was held in Tilburg, The Netherlands, in
`January 1998. Three of these papers were presented by invited speakers; those
`by Mark Maybury, Bonnie Webber, and Kent Wittenburg. From the submitted
`papers that were accepted by the CMC/95 program committee, thirteen were
`selected for publication in this volume, after revision.
`We thank the program committee for their excellent and timely feedback to
`authors of submitted papers, and at alater stage for advising on the contents
`of this volume and for providing additional suggestions for improving the se-
`lected contributions. The program committee consisted of Norman Badler, Harry
`Bunt, Jeroen Groenendijk, Walther von Hahn, Dieter Huber, Hans Kamp, John
`Lee, Joseph Mariani, Mark Maybury, Paul Me Kevitt, Rob Nederpelt, Kees van
`Overveld, Ray Perrault, Donia Scott, Wolfgang Wahlster, Bonnie Webber, and
`Kent Wittenburg. We thank the Royal Dutch Academy of Sciences (KNAW)
`and the Organization for Cooperation among Universities in Brabant (SOBU)
`for their grants that, made the conference possible.
`
`Harry Bunt
`Robbert-Jan Beun
`Tijn Borghuis
`
`Page 6 of 19
`
`
`
`Issues in Multimodal Human-Computer Communication
`
`Towards Cooperative Multimedia Interaction
`
`Multimodal Cooperation with the DenK System
`Harry Bunt, René Ahn, Robbert-Jan Beun, Tijn Borghwis and
`
`Synthesizing Cooperative Conversation
`Catherine Pelachaud, Justine Cassell, Norman Badler, Mark Steedman,
`Scott Prevost and Matthew Stone
`
`Instructing Animated Agents: Viewing Language in Behavioral
`
` Table of Contents
`
`Part II: Techniques
`
`Modeling and Processing of Oral and Tactile Activities in the
`
`Jacques Sirour, Mare Guyomard, Franck Multon and
`Christophe Remondeau
`
`Multimodal Maps: An Agent-Based Approach
`Adam Cheyer and Lue Julia
`
`Using Cooperative Agents to Plan Multimodal Presentations
`Yi Han and Ingrid Zukerman
`
`Developing Multimodal Interfaces: A Theoretical Framework
`and Guided Propagation Networks
`Jean-Claude Martin, Remko Veldman. and Dominique Béroule
`
`13
`
`39
`
`68
`
`89
`
`101
`
`111
`
`122
`
`158
`
`Page 7 of 19
`
`
`
`A Multimedia Interface for Circuit Board Assembly
`Fergal McCaffery, Michael MeTear and Maureen Murphy
`
`Visual Language Parsing: If I Had a Hammer...
`Kent Witlenburg
`
`Anaphora in Multimodal Discourse
`John Lee and Keith Stenning
`
`Part III: Experiments
`
`Speakers’ Responses to Requests for Repetition in a
`Multimedia Language Processing Environment
`Laurel Fais, Kytung-ho Loke-Kim and Young-Duk Park
`
`Object Reference in Task-Oriented Keyboard Dialogues
`Anita Cremers
`
`Referent Identification Requests in Multi-Modal Dialogs
`Tsuneaki Kato and Yukiko I. Nakano
`
`Studies into Full Integration of Language and Action
`Carla Huls and Edwin Bos
`
`The Role of Multimodal Communication in Cooperation:
`The Cases of Air Traffic Control
`Marie-Christine Bressolle, Bruno Pavard and Marcel Leroux
`
`Author Index
`
`Page 8 of 19
`
`Page 8 of 19
`
`
`
`Multimodal Maps: An Agent-Based Approach
`
`Adam Cheyer and Lue Julia
`
`SRI International
`333 Ravenswood Ave
`Menlo Park, CA 94025 - USA
`
`In this paper, wediscuss how multiple input modalities may
`be combined to produce more natural user interfaces. To illustrate this
`technique, we present a prototype map-based application for a travel
`planning domain, The application is distinguished by a synergistic com-
`bination of handwriting, gesture and speech modalities; access to exist-
`ing data sources including the World Wide Web; and a mobile handheld
`interface. To implement
`the described application, a hierarchical dis-
`tributed network of heterogeneous software agents was augmented by
`appropriate functionality for developing synergistic multimodal applica-
`
`Introduction
`
`As computer systems become more powerful and complex, efforts to make com-
`puter interfaces more simple and natural become increasingly important. Nat-
`ural interfaces should be designed tofacilitate communication in ways people
`are already accustomed to using. Such interfaces allow users to concentrate on
`the tasks they are trying to accomplish, not worry about what they must do to
`control the interface.
`In this paper, we begin by discussing what input modalities humans are
`comfortable using when interacting with computers, and how these modalities
`should best be combined in order to produce natural interfaces. In Sect. 3, we
`present a prototype map-based application for the travel planning domain which
`uses a synergistic combination of several input modalities. Section 4 describes
`the agent-based approach we used to implement the application and the work on
`whichit is based. In Sect. 5, we summarize our conclusions andfuture directions,
`
`2 Natural Input
`
`Input Modalities
`
`Page 9 of 19
`
`
`
`dimension direct manipulation interfaces. Gestures allow users to communicate
`a surprisingly wide range of meaningful requests with a few simple strokes, Re-
`search has shown that multiple gestures can be combined to form dialog, with
`rules of temporal grouping overriding temporal sequencing (Rhyne, 1987). Ges-
`tural commandsare particularly applicable to graphical or editing type tasks,
`Direct manipulation interactions possess many desirable qualities: commu-
`nication is generally fast and concise; input techniques are easy to learn and
`remember; the user has a good idea about what can be accomplished, as thevi-
`sual presentation of the available actions is generally easily accessible. However,
`direct manipulation suffers from limitations when trying to access or describe
`entities which are not or can not bevisualized by the user.
`Limitations of direct manipulation style interfaces can be addressed by an-
`other interface technology, that of natural language interfaces. Natural language
`interfaces excel
`in describing entities that are not currently displayed on the
`monitor,
`in specifying temporal relations between entities or actions, and in
`identifying members ofsets. These strengths are exactly the weaknesses of di-
`rect manipulation interfaces, and concurrently, the weaknesses of natural lan-
`guage interfaces (ambiguity, conceptual coverage, etc.) can be overcome by the
`strengths of direct manipulation.
`Natural language content can be entered throughdifferent input modalities,
`including typing, handwriting, and speech. It is important to note that, while
`the same textual content can be provided by the three modalities, each modality
`has widely varying properties.
`
`— Spoken language is the modality usedfirst and foremost in human-human
`interactive problemsolving (Cohenetal., 1990). Speechis an extremely fast
`medium, several times faster than typing or handwriting. In addition, speech
`input contains content that is not present in other forms of natural language
`input, such as prosidy,
`tone and characteristics of the speaker (age, sex,
`accent).
`— Typing is the most common way of entering information into a computer,
`because it is reasonably fast, very accurate, and requires no computational
`resources.
`
`~ Handwriting has been shown to be useful for certain types of tasks, such as
`performing numerical calculations and manipulating names which are dif-
`ficult to pronounce (Oviatt, 1994; Oviatt and Olson, 1994). Because of its
`relatively slow production rate, handwriting may induce users to produce
`different types of input
`than is generated by spoken language; abbrevia-
`tions, symbols and non-grammatical patterns may be expected to be more
`prevalent amid written input.
`
`Page 10 of 19
`
`Page 10 of 19
`
`
`
`Multimodal Maps: An Agent-Based Approach
`
`113
`
`2.2 Combination of Modalities
`
`As noted in the previous section, direct manipulation and natural language seem
`to be very complementary modalities. It is therefore not surprising that a number
`of multimodal systems combine the two.
`Notable among such systems is the Cohen’s Shoptalk system (Cohen, 1992),
`a prototype manufacturing and decision-support systemthat aids in tasks such
`as quality assurance monitoring, and production scheduling. The natural lan-
`guage module of Shoptalk is based on the Chat-85 natural language system
`(Warren and Perreira, 1982) and is particularly good at handling time, tense,
`and temporal reasoning.
`A numberofsystems havefocused on combining the speedof speech with the
`reference provided by direct manipulation of a mouse pointer. Such systems in-
`clude the XTRA system(Allegayeretal, 1989), CUBRICON(Neal and Shapiro,
`1991), thePAC-Amodeus model (Nigay and Coutaz, 1993), and TAPAGE (Faure
`XTRA and CUBRICONareboth systems that combine complex spoken
`input with mouseclicks, using several knowledgesources for reference identifica-
`tion.CUBRICON’s domainis a map-basedtask, makingit similar to the applica-
`tion developedin this paper. However, the two are different in that CUBRICON
`ean only use direct manipulation to indicate a specific item, whereas our sys-
`tem produces a richer mixing of modalities by adding both gestural and written
`language as input modalities.
`‘The PAC-Amodeussystems such as VoicePaint and Notebook allow the user
`to synergistically combine vocal or mouse-click commands when interacting with
`notes or graphical objects. However, due to the selected domains, the natural
`language inputis very simple, generally of the style “Insert a note here”,
`TAPAGBis another system that allows true synergistic combination of spo-
`ken input with direct manipulation. Like PAC-Amodeus, TAPAGE’s domain
`provides only simple linguistic input. However, TAPAGE uses a pen-based in-
`terface instead of a mouse, allowing gestural commands. TAPAGE,selected as a
`building block for our map application, will be described morein detail in Sect.
`
`Other interesting work regarding the simultaneous combination of handges-
`tures and gaze canbe foundin Bolt (1980) and Koons, Sparrell and Thorisson
`
`3° A Multimodal Map Application
`
`In this section, we will describe a prototype map-based applicationfor atravel
`planning domain.In order to provide the most natural user interface possible, the
`
`Page 11 of 19
`
`
`
`0.2ma|Travel Planning: San Francisco
`
`Fig. 1. Multimodal application for travel planning
`
`— The user interface must be light and fast enough to run on a handheld PDA
`while able to access applications and datathat may require a more powerful
`machine,
`~ Existing commercial or research natural language and speech recognition
`systems should be used.
`interface, a user must be able to transparently
`~ Through the multimodal
`access a wide variety of data sources, including information stored in HTML
`form on the World Wide Web,
`
`Asillustrated in Fig. 1, the user is presented with a pen sensitive map dis-
`play on which drawn gestures and written natural language statements may be
`combined with spoken input. As opposed toa static paper map,thelocation,res-
`olution, and content presented by the map change, according to the requests of
`the user. Objects of interest, such as restaurants, movie theaters, hotels, tourist
`sites, municipal buildings, etc. are displayed as icons. The user may ask the map
`to perform various actions. For example:
`~ distance calculation: e.g. “How far is the hotel from Fisherman's Wharf?”
`object location: e.g. “Where is the nearest post office?”
`jiltering : e.g. “Display the French restaurants within 1 mile of this hotel.”
`informationretrieval : e.g. “Show meall available information about Alca-
`traz.”
`
`~
`
`The application also makes use of multimodal (multimedia) output as well
`as input; video, text, sound and voice can all be combined when presenting an
`answer to a query.
`
`Page 12 of 19
`
`Page 12 of 19
`
`
`
`Multimodal Maps: An Agent-Based Approach
`
`115
`
`During input, requests can be entered using gestures (see Fig. 2 for sample
`gestures), handwriting, voice, or a combination of pen and voice, For instance,
`in order to calculate the distance between two points on the map, a command
`may be issued using the following:
`gesture, by simply drawing a line between the two points of interest.
`~ voice, by speaking “Whatis the distance from thepost office to the hotel?”,
`handwriting, by writing “dist. p.o. to hotel?”
`— synergistic combination ofpen and voice, by speaking “What is the distance
`from here to this hotel?” while simultaneously indicating the specified loca-
`tions by pointing or circling.
`Notice that in our example of synergistic combination of pen and voice, the
`arguments to the verb “distance” can be specified before, at the same time, or
`shortly after the vocalization of the request to calculate thedistance. If a user's
`request is ambiguous or underspecified, the system will wait several seconds and
`then issue a prompt requesting additional information.
`Theuserinterface runs on pen-equipped PC’s or a Dauphin handheld PDA
`(Dauphin, DTR-1 User’s Manual) using either a microphoneora telephonefor
`voice input. The interface is connected either by modemor ethernet to a server
`machine which will manage database access, natural language processing and
`speechrecognitionfor the application. The resultis a mobile system that provides
`a synergistic pen/voiceinterface to remote databases.
`For gestural com-
`In general, the speed of the system is quite acceptable.
`mands, which are handled locally on the user interface machine, a response is
`producedin less than one second. For handwritten commands, the time to ree-
`ognize the handwriting, process the English query, access a database and begin
`to display the results on the user interface is less than three seconds (assuming
`an ethernet connection, and good network and database response). Solutions to
`verbal commandsare displayed in three to five seconds after the end of speech
`has been detected; partial feedback indicating the current status of the speech
`recognition is providedearlier.
`
`Select
`
`Move, Scroll, Select
`
`Zoom In
`
`Distance
`
`Fig. 2. Sample gestures
`
`Page 13 of 19
`
`
`
`to augmenta proven agent- based architecture with functionalities developed for
`a synergistically multimodal application. The result is a flexible methodology for
`designing and implementing distributed multimodal applications.
`
`4.1 Building Blocks
`Open Agent Architecture. The Open Agent Architecture (OAA) (Cohen
`et al., 1994) provides a framework for coordinating a society of agents which
`interact to solve problems for the user. Through the use of agents, the OAA
`provides distributed access to commercial applications, such as mail systems,
`calendar programs, databases, ete.
`The Open Agent Architecture possesses several properties which make it a
`goodcandidate for our needs:
`AnInteragent Communication Language (ICL) and Query Protocol have
`been developed, allowing agents to communicate among themselves. Agents
`can run ondifferent platforms and be implemented in a variety of program-
`ming languages.
`— Several natural language systems have been integrated into the OAA‘which
`convert English into the Interagent Communication Language. Inaddition,
`a speechrecognition agent has been developedto provide transparent access
`to the Coronaspeech recognition system.
`~ Theagent architecture has been used to provide natural language andagent
`access to various heterogeneous data and knowledge sources.
`~ Agent interaction is very fine-grained. The architecture was designed so that
`a number of agents can work together, when appropriatein parallel, to pro-
`ducefast responses to queries,
`The architecture for the OAA, based loosely on Schwartz's FLiPSIDE system
`(Schwartz, 1993), uses a hierarchical configuration where client agents connect ta
`a “facilitator” server. Facilitators provide content-based message routing, global
`data management, and process coordination for their set of connected agents.
`Facilitators can, in turn, be connectedasclients of other facilitators. Each facil-
`itator records the published functionality of their sub-agents, and when queries
`arrive in Interagent. Communication Language form,
`they are responsible for
`breaking apart any complex queries and for distributing goals to the appropri-
`ate agents. An agent solving a goal may require supporting information and
`the agent architecture provides numerous means of requesting data from other
`agents or from the user,
`Among the assortment of agentarchitectures, the Open Agent Architecture
`can be most closely compared to work by the ARPA knowledge sharing commu-
`nity (Genesereth andSingh, 1994). The OAA’s query protocol, Interagent Com-
`munication Language andFacilitator mechanisms have similar instantiations in
`
`Page 14 of 19
`
`Page 14 of 19
`
`
`
`Multimodal Maps: An Agent-Based Approach
`
`117
`
`the SHADEproject, in the form ofKQML, KIF andvarious independent capabil-
`ity matchmakers. Other agent architectures, such as General Magic’s Telescript
`(General Magic, 1995), MASCOS (Park et al, submitted), or the CORBA dis-
`tributed object approach (Object Management Group, 1991) do not provideas
`fully developed mechanisms for interagent communication and delegation.
`The Open Agent Architecture provides capability for accessing distributed
`knowledge sources through natural language andvoice, butit is lacking integra-
`tion with a synergistic multimodal interface.
`
`TAPAGE. TAPAGE (edition de Tableaux par la Parole et la Geste) is a syn-
`ergistic pen/voice system for designing andcorrecting tables.
`To capture signals emitted during a user’s interaction, TAPAGE integrates
`a set of modality agents, each responsible for a very specialized kind of signal
`(Faure and Julia, 1994), The modality agents are connected to an ‘interpret
`agent’ which is responsible for combining the inputs across all modalities to
`form a valid commandfor the application. The interpret agent receives filtered
`results from the modality agents, sorts the information into the correct fields,
`performs type-checking on the arguments, and prompts the user for any missing
`information, according to the model of the interaction. Theinterpret agent is also
`responsible for merging the data streams sent by the modality agents, and for
`resolving ambiguities among them, based on its knowledgeof the application’s
`internal state, Another function of the interpret agent is to produce reflexes:
`reflexes are actions output at the interface level without involving the functional
`core of the application.
`‘The TAPAGE systemcan accept multimodal input, but it is not a distributed
`system; its functional core is fixed. In TAPAGE, theset of linguistic inputis
`limited to a verb object argument format.
`
`In the Open Agent Architecture, agents are distributed entities that can run
`on different machines, and communicate together to solve a task for the user.
`In TAPAGE,agents are used to provide streams of input to a central interpret
`Process, responsible for merging incoming data. A generalization of these two
`types of agents could be:
`Macro Agents: contain some knowledgeandability to reason about a domain,
`and can answer or make queries to other macro agents using the Interagent,
`Communication Language.
`Micro Agents: are responsible for handling a single input or output data
`stream, either filtering the signal to or from a hierarchically superior ‘interpret’
`
`The network architecture that we used was hierarchical at two resolutions:
`Micro agents are connected to a superior macro agent, and macro agents are
`
`Page 15 of 19
`
`
`
`among agents produced by auser's request.
`Speech Recagnition (SR) Agent: The SR agent provides a mapping fromthe
`Interagent Communication Language to the API for the Decipher (Corona)
`speech recognition system (Cohenet al., 1990), a continuous speech speaker
`independent recognizer based on Hidden Markov Modeltechnology. This macro
`agent is also responsible for supervising a child micro agent whose task is to con-
`trol the speech datastream. The SR. agent can provide feedback to aninterface
`agent about the current status and progress of the micro agent (e.g. “listening”,
`“end of speech detected”, etc.) This agent is written in G;
`Natural Language (NL) Parser Agent: translates English expressions into the
`Interagent Communication Language (ICL), For a more complete description of
`the ICL, see Cohen etal. (Cohenet al., 1994). The NL agent we selected for
`ourapplicationis the simplest of those integrated into the OAA. It is written in
`Prolog using Definite Clause Grammars, and supports adistributed vocabulary;
`each agent dynamically adds word definitions as it connects to the network,
`A current project is underway to integrate the Gemini natural language sys-
`tem (Cohen et al., 1990), a robust bottom up parser and semantic interpreter
`specifically designed for use in Spoken Language Understanding projects,
`Database Agents: Database agents can reside at local or remote locations
`and can be grouped hierarchically according to content. Micro agents can be
`connected to database agents to monitor relevant positions or events in real
`time. In our travel planning application, database agents provide maps for each
`city, as well as icons, vocabulary and information about available hotels, restau-
`rants, movies, theaters, municipal buildings and tourist attractions. Three types
`of databases were used: Prolog databases, X.500 hierarchical databases, and
`data loaded automatically by scanning HTML pages from the World Wide Web
`(WWW). In one instance, a local newspaper provides weekly updates to its
`Mosaic-accessible list of current movie times and reviews, as well as adding sey-
`eral new restaurant reviews to a growing collection; this information is extracted
`by an HTMLreading database agent and made accessible to the agent archi-
`tecture. Descriptions and addresses of new restaurants are presented to the user
`on request, and the user can choose to add them to the permanent database
`by specifying positional coordinates on the map (e.g. “add this new restaurant
`here”), information lacking in the WWW database.
`Reference Resolution Agent: This agent is responsible for merging requests
`arriving in parallel from different modalities, and for controlling interactions
`between the user interface agent, database agents and modality agents. In this
`implementation, the reference resolution agent is domain specific: knowledge is
`encoded as to what actions must be performed to resolve each possible type of
`ICL request in its particular domain. For a given ICLlogical form, the agent can
`verify argument types, supply default. values, and resolve argument references.
`Some argument references are descriptive (“How faris it to the hotel on Emerson
`Street?”); in this case, a domain agent will try to resolve the definite reference by
`
`Page 16 of 19
`
`Page 16 of 19
`
`
`
`Multimodal Maps: An Agent-Based Approach
`
`119
`
`sending database agent requests, Other references, particularly when contextual
`or deictic, are resolved by the user interface agent (“What are the rates for this
`hotel?”). Once arguments to a query have beenresolved, this agent coordinates
`the actions and calculations necessary to produce the result of the request.
`Interface Agent: This macro agent is responsible for managing whatis cur-
`rently being displayed to the user, and for accepting the user’s multimodalinput.
`The Interface Agent also coordinates client modality agents and resolves ambi-
`guities among them : handwriting and gestures are interpreted locally by micro
`agents and combined with results from the speech recognition agent, running
`on a remote speech server, The handwriting micro-agent
`interfaces with the
`Microsoft PenWindows API andaccesses a handwriting recognizer by CIC Cor-
`poration. The gesture micro- agent accesses recognition algorithms developed
`
`Animportant task for the interface agent is to record which objects of each
`type are currently salient, in order to resolve contextual references such as “the
`hotel” or “where I was before.” Deictic references are resolved by gestural or
`direct manipulation commands. If no such indication is currently specified, the
`user interface agent waits long enoughto give the user an opportunity to supply
`the value, and then prompts the userforit.
`
`
`
`{
`a
`I Facilitator Agents
`a) Maero Agents
`i
`i
`_ Modality Agents
`
`'
`
`| NL: Natural Language Agent
`|
`: SR: Speech Recognition Agent
`RR: Reference Resolution Agent
`| UL: UserInterface Agents
`| WWW: World Wide Web Agent |
`
`Fig. 3. Agent Architecture for Map Application
`
`Page 17 of 19
`
`
`
`1. A user speaks: “How far is the restaurant from this hotel?”
`2. The speech recognition agent monitors the status and results from its micro
`agent, sending feedback received by the user interface agent. When the string
`is recognized, a translation is requested.
`3. The English request is received by the NL agent and translated into ICL
`form.
`4. The reference resolution agent (RR) receives the ICL distance request con-
`taining one definite and onedeictic reference and asks for resolution of these
`references.
`5. The interface agent uses contextual structures to find what “the restaurant”
`refers to, and waits for the user to make a gesture indicating “the hotel”,
`issuing prompts if necessary.
`the domain agent (RR) sends
`6. When the references have been resolved,
`database requests asking for the coordinates of the items in question. It
`then calculates the distance according to the scale of the currently displayed
`map, and requests the user interface to produce output displaying the result
`of the calculation.
`
`5 Conclusions
`
`By augmenting an existing agent-based architecture with concepts necessary for
`synergistic multimodal input, we were able to rapidly develop a map-based ap-
`plication for a travel planning task. The resulting application has met ourinitial
`requirements: a mobile, synergistic pen/voice interface providing good natural
`language access to heterogeneous distributed knowledge sources. The approach
`used was general and should provide a for developing synergistic multimodal
`applications for other domains.
`The system described here is one of the first that accepts commands made
`of synergistic combinations of spoken language, handwriting and gestural input.
`This fusion of modalities can produce more complex interactions than in many
`systems and the prototype application will serve as a testbed for acquiring a
`better understanding of multimodal input.
`In the near future, we will continue to verify and extend our approach by
`building other multimodal applications. We are interested in generalizing the
`methodology even further; work has already begun on an agent-building tool
`which will simplify and automate many of the details of developing new agents
`and domains,
`
`References
`
`Allegayer, J., Jansen-Winkeln, R., Reddig, C. and Reithinger, N. (1989) Bidirectional
`use of knowledge in the multi-modal NL access system XTRA. In Proceedings of
`IJCAI-89, Detroit, pp. 1492-1497.
`
`Page 18 of 19
`
`Page 18 of 19
`
`
`
`Multimodal Maps: An Agent-Based Approach
`
`121
`
`Bolt, R. (1980) Put that there: Voice and Gesture at the Graphic Interface, Computer
`Graphics, 14(3), pp. 262-270,
`Cohen, M., Murveit, H., Bernstein, J., Price, P., and Weintraub, M. (1990) The DE-
`CIPHER. Speech Recognition System. In 1990 IEE ICASSP, pp. 77-80.
`Cohen, P. (1992) Therole of natural languagein a multimodal interface. In Proceedings
`of UIST’92, pp. 143-149,
`Cohen, P.R., Cheyer, A., Wang, M. and Baeg, §.C. (1994) An Open Agent Architecture.
`In Proceedings AAAI'94 — SA, Stanford, pp. 1-8.
`Dauphin DTR-1 User’s Manual, Dauphin Technology, Inc., Lombard, Ill 60148.
`Faure, C. and Julia, L. (1994) An Agent-Based Architecturefor a Multimodal Interface.
`In Proceedings AAAI'94 — IM4S, Stanford, pp. 82-86.
`Genesereth, M, and Singh, N.P. (1994) A knowledge sharing approachto software inter-
`operation, unpublished manuscript, Computer Science Department, Stanford Uni-
`
`Teleseript Product Documentation (1995), General Magie Inc.
`Koons, D.B., Sparrell, C.J,, and Thorisson, K.R. (1993) Integrating Simultaneous In-
`put from Speech, Gaze and Hand Gestures. In Intelligent Multimedia Interfaces,
`Maybury, M.T. (ed.), Menlo Park: AAAI Press/MITPress.
`(1993)
`Intelligent Multimedia Interfaces, Menlo Park: AAAI
`
`(1991) Intelligent Multi-media Interface Technology.
`Neal, J.G., and Shapiro, §5,C,
`In Intelligent User Interfaces, Sullivan, J.W. and Tyler, $.W. (eds.), Reading:
`Addison-Wesley Pub. Co., pp. 11-43.
`Nigay, L. and Coutaz, J. (1993) A Design Space for Multimodal Systems: Concurrent
`Processing and Data Fusion. In Proceedings InterCHI'93, Amsterdam, ACM Press,
`
`Object Management Group (1991) The CommonObject Request Broker: Architeeture
`and Specification, OMG Document Number 91,12.1.
`Oviatt, 8. (1994) Toward Empirically-Based Design of Multimodal Dialogue Systems.
`In Proceedings of AAAI’94 - IM4S, Stanford, pp. 30-36.
`Oviatt, S. and Olsen, E. (1994) Integration Themes in Multimodal Human-Computer
`Interaction. In Proceedings of ICSLP’94, Yokohama, pp. 551-554.
`Park, 8.K., Choi J.M., Myeong-WukJ., Lee G.L., and Lim Y.H, (submitted for publi-
`cation), MASCOS : A Mulli-Agent System as the Computer Seeretary.
`Rhyne J. (1987) Dialogue Management for Gestural Interfaces, Computer Graphics,
`
`(1993) Cooperating heterogeneous systems: A blackboard-based meta
`approach, Technical Report 93-112, Center for Automation and Intelligent Systems
`Research, Case Western Reserve University, Cleveland Ohio, (unpublished PhD.
`
`Sullivan, J. and Tyler, S. (eds.) (1991) Intelligent User Interfaces, Reading: Addison-
`
`Warren, D. and Pereira, F. (1982) An Efficient Easily Adaptable System for Interpret-
`ing Natural Language Queries, American Journal of Computational Linguistics,
`
`Page 19 of 19
`
`