throbber
QuickSet: Multimodal Interaction
`for Distributed Applications
`
`Philip R. Cohen, Michael Johnston, David McGee, Sharon Oviatt,
`Jay Pittman, Ira Smith, Liang Chen and Josh Clow
`Center for Human Computer Communication
`Oregon Graduate Institute of Science and Technology
`P.O.Box 91000
`Portland, OR 97291-1000 USA
`Tel: 1-503-690-1326
`E-mail: pcohen@cse.ogi.edu
`http://www.cse.ogi.edu/CHCC
`
`ABSTRACT
`
`This paper presents an emerging application of multimodal
`interface research to distributed applications. We have
`developed the QuickSet prototype, a pen/voice system running
`on a hand-held PC, communicating via wireless LAN through an
`agent architecture to a numberof systems,
`including NRaD’s'
`LeatherNet system, a distributed interactive training simulator
`built for the US Marine Corps. The paper describes the overall
`system architecture, a novel multimodal integration strategy
`offering mutual compensation among modalities, and provides
`examples of multimodal simulation setup. Finally, we discuss
`our applications experience and evaluation.
`agent architecture,
`KEYWORDS: multimodal
`interfaces,
`natural
`language
`gesture recognition,
`speech recognition,
`processing, distributed interactive simulation.
`1.
`INTRODUCTION
`
`A new generation of multimodal systems is emerging in which
`the user will be able to employ natural communication
`modalities,
`including voice, hand and pen-based gesture, eye-
`tracking, body-movement, etc.
`[Koons et al., 1993; Oviatt,
`1992, 1996; Waibel et al., 1995]
`in addition to the usual
`graphical user interface technologies.
`In order to make
`progress on building such systems, a principled method of
`modality integration, and a general architecture to support it is
`needed. Such a framework should provide sufficient flexibility
`to enable rapid experimentation with different modality
`integration
`architectures
`and
`applications.
`This
`experimentation will allow researchers to discover how each
`communication modality can best contribute its strengths yet
`compensate for the weaknesses of the others.
`
`1 NRaD = US Navy Command and Contro! Ocean Systems Center
`Research Development Test and Evaluation (San Diego).
`
`Fortunately, a new generation of distributed system frameworks
`is now becoming standardized,
`including the CORBA and
`DCOM frameworks for distributed object systems. At a higher
`level, multiagent architectures are being developed that allow
`integration and interoperation of semi-autonomous knowledge-
`based components
`or “agents”. The advantages of these
`architectural
`frameworks
`are modularity, distribution,
`and
`asynchrony — a subsystem can request
`that
`a certain
`functionality be provided without knowing who will provide it,
`whereit resides, how to invoke it, or how long to wait for it.
`In virtue of these qualities,
`these frameworks provide a
`convenient platform for experimenting with new architectures
`and applications.
`
`a collaborative,
`In this paper, we describe QuickSet,
`multimodal system that employs such a distributed, multiagent
`architecture to integrate not only the various user interface
`components, but also a collection of distributed applications.
`QuickSet provides a new unification-based mechanism for
`fusing partial meaning representation fragments derived from
`the input modalities.
`In so doing,
`it selects the best joint
`interpretation among
`the
`alternatives
`presented by the
`underlying
`spoken
`language
`and
`gestural modalities.
`Unification also supports multimodal discourse. The system is
`scaleable
`from handheld
`to wall-sized
`interfaces,
`and
`interoperates across a number of platforms (PC’s to UNIX
`workstations).
`Finally, QuickSet has been applied to a
`collaborative military training system,
`in which it is used to
`control a simulator and a 3-D virtual
`terrain visualization
`system.
`
`This paper describes the “look and feel” of the multimodal
`interaction with a variety of back-end applications,
`and
`discusses the unification-based architecture that makes this new
`class of interface possible.
`Finally,
`the paper discusses the
`application of the technology for the Department of Defense.
`
`31
`
`DISH, Exh. 1023, p. 1
`
`DISH, Exh. 1023, p. 1
`
`

`

`2. QUICKSET
`QuickSet is a collaborative, handheld, multimodal system for
`interacting with distributed applications.
`In virtue of its
`modular, agent-based design, QuickSet has been applied to a
`number of applications in a relatively short period of time,
`including:
`
`e
`

`
`is used to
`Simulation Set-up and Control — Quickset
`control LeatherNet [Clarkson and Yi, 1996], a system
`employed in
`training platoon leaders and company
`commanders at
`the USMC base at Twentynine Palms,
`California. LeatherNet simulations are created using the
`ModSAF simulator [Courtmanche and Ceranowicz, 1995] ©
`and can be visualized in a wall-sized virtual reality CAVE
`environment [Cruz-Neira et al., 1993; Zyda et al., 1992]
`called CommandVu. A QuickSet user can create entities,
`give them missions,
`and control
`the virtual
`reality
`environment
`from the
`handheld
`PC.
`QuickSet
`communicates over a wireless LAN via the Open Agent
`Architecture (OAA) [Cohen et al., 1994] to ModSAF, and
`to CommandVu, each of which have been made into agents
`in the architecture.
`
`Force Laydown — QuickSetis being used in a second effort
`called ExInit (Exercise Initialization), that enables users to
`create large-scale (division- and brigade- sized) exercises.
`Here, QuickSet interoperates via the agent architecture
`with a collection of CORBA servers.
`
`® Medical Informatics — A version of QuickSet is used in
`selecting healthcare in Portland, Oregon.
`In this
`application, QuickSet retrieves data from a database of
`2000 records about doctors, specialties, and clinics.
`Next, we tum to the primary application
`of QuickSet
`technology.
`
`3. NEW INTERFACES FOR DISTRIBUTED
`SIMULATION
`Begun as SIMNETin the 1980’s (Thorpe, 1987], distributed,
`interactive simulation (DIS) training environments attempt to
`provide a high degree of fidelity in simulating combat
`equipment, movement, atmospheric effects, etc. One of the
`U.S. Government’s goals, which has partially motivated the
`present research,
`is to develop technologies that can aid in
`substantially reducing the time and effort needed to create large-
`scale scenarios. A recently achieved milestone is the ability to
`create and simulate a large-scale exercise, in which there may be
`on the order of 60,000 entities (e.g., a vehicle or a person).
`QuickSet addresses two phases of user interaction with these
`simulations:
`creating and positioning
`the
`entities,
`and
`supplying their initial behavior.
`In the first phase, a user
`“lays down” or places forces on the terrain, which need to be
`positioned in realistic ways, given the terrain, mission,
`available equipment, etc. In addition to force laydown the user
`needs to supply them with behavior, which may involve
`complex maneuvering, communication, etc.
`Our contribution to this overall effort is to rethink the nature
`of the user interaction. As with most modern simulators, DISs
`are controlled via graphical user interfaces (GUIs). However,
`GUI-based interaction is rapidly losing its benefits, especially
`when large numbers of entities
`need to be created and
`controlled, often resulting in enormous menu trees. At the same
`time, for reasons of mobility and affordability, there is a strong
`user desire to be able to create simulations on small devices
`(e.g., PDA's). This impending collision of trends for smaller
`
`screen size and for more entities requires a different paradigm
`for human-computer interaction with simulators,
`Amajor design goal for QuickSet is
`to provide the same user
`input capabilities
`for handheld, desktop,
`and wall-sized
`terminal hardware. We believe that only voice and gesture-
`based interaction comfortably span this
`range. QuickSet
`provides both of
`these modalities because it has been
`demonstrated
`that
`there exist
`substantive language,
`task
`performance, and user preference advantages for multimodat
`interaction over speech-only and gesture-only interaction with
`map-based tasks [Oviatt, 1996; Oviatt, in press].? Specifically,
`for these tasks, multimodal input results in 36% fewer task
`performance errors, 35% fewer spoken disfluencies, 10% faster
`‘task performance, and 23% fewer words, as compared to a
`speech-only interaction. Multimodal pen/voice interaction is
`known to be advantageous for small devices, for mobile users
`who may encounter different circumstances, for error avoidance
`and correction, and for robustness [Oviatt, 1992; Oviatt 1995].
`
`interface
`voice/gesture
`a multimodal
`summary,
`In
`complements, but also promises to address the limitations of,
`current GUI technologies
`for controlling simulators.
`In
`addition, it has been shown to have numerous advantages over
`voice-only interaction for map-based tasks. These findings had
`a direct bearing on the interface design and architecture of
`QuickSet.
`
`SYSTEM ARCHITECTURE
`4.
`In order to build QuickSet, distributed agent technologies based
`on the Open Agent Architecture’ were employed becauseof its
`flexible asynchronous capabilities,
`its ability to run the same
`set of software
`components
`in a variety of hardwaro
`configurations, ranging from standalone on the handheld PC to
`distributed operation across numerous computers, and its casy
`connection
`to
`legacy applications.
`Additionally,
`the
`architecture
`supports
`user mobility
`in
`that
`less
`computationally-intensive agents (¢.g., the map interface) can
`run on the handheld PC, while more computationally-intensive
`processes
`(e.g., natural
`language processing)
`can operate
`elsewhere on the network. The agents may be written in any
`programming language (here, Quintus Prolog, Visual C++,
`Visual Basic, and Java), as long as they communicate via an
`interagent communication language. The configuration of
`agents used in the QuickSet system is illustrated in Figure 1. A
`brief description of each agent follows.
`
`Facilitator with blackboard
`Tooting,Gapaiching, nepadale, nheeed siete,
`triggering,
`
` tmwpenece) [tee
`
`ota
`bon
`fac
`
`Figure 1: The facilitator, channeling
`queries to capable agents.
`
`2 Our prior research [Cohenetal., 1989; Cohen, 1992] has demonstrated
`the advantages of a multimodal interface offering natural language and
`direct manipulation for controlling simulators and reviewing their results.
`2 Open Agent Architecture is a trademark of SRI International.
`
`DISH, Exh. 1023, p. 2
`
`DISH, Exh. 1023, p. 2
`
`

`

`5. EXAMPLES
`
`5.1 Leathernet
`
`the user views a map from the ModSAF
`Holding QuickSet,
`simulation. With speech and pen, she then adds entities into
`the ModSAF simulation. For example,
`to create a unit
`in
`QuickSet, the user would hold the pen at the desired location and
`utter: ‘red T72 platoon” resulting in a new platoon of the
`specified type being created. The user then adds a barbed-wire
`fence to the simulation by drawinga line at the desired location
`while uttering “barbed wire.” A fortified line can be added
`multimodally, by drawing a simple line and speaking its label,
`or unimodally, by drawing its military symbology. A minefield
`of an amorphous shape is drawn andis labeled verbally. Finally
`an MiAI1platoon is created as above. Then the user can assign
`atask to the platoon by saying “MIAI platoon follow this
`
`
`Figure 4: QuickSet running on a wireless handheld PC. The
`user has created numerous units,
`fortifications
`and objectives.
`
`The results of these commands are visible on the QuickSet
`screen, as seen in Figure 4,
`as well as on the ModSAF
`simulation, which has been executing the user’s QuickSet
`commands in the virtual world (Figure 5).
`
`route” while drawing the route with the pen.
`SPOKEN AND GESTURAL INTERACTION
`
`Figure 5: Controlling the CommandVu 3-D visualization
`via QuickSet interaction. QuickSet tablets are on the
`desks,
`
`Twospecific aspects of QuickSet to be discussed below are its
`usage as a collaborative system, and its ability to control a
`virtual reality environment.
`
`5.1.1 Collaboration.
`In virtue of the facilitated agent architecture, when two or more
`user interfaces connected to the same network of facilitators
`subscribe to and/or produce common messages,
`they (and their
`users) become part of a collaboration. The agent architecture
`offers a framework for heterogeneous collaboration,
`in that
`users can have very different interfaces, operating on different
`types of hardware platforms, and yet be part of a collaboration.
`For instance, by subscribing to the entity-location database
`messages, multiple QuickSet user interfaces can be notified of
`changes in the locations of entities, and can then render them
`in whatever form is suitable,
`including 2-D map-based, web-
`based, and 3-D virtual reality displays.
`Likewise, users can
`interact with different interfaces (e.g., placing entities on the
`2-D map or 3-D VR) and thereby affect the views seen by other
`users.
`To allow for
`tighter
`synchronicity,
`the current
`implementation also allows users to decide to couple their
`interface to those of the other users connected to a given
`network of facilitators. Then, when one interface pans and
`zooms, the other coupled ones do as well, Furthermore, coupled
`interfaces subscribe to the “ink” messages, meaning one user's
`ink appears on the others’ screens,
`immediately providing a
`shared drawing system. On the other hand, collaborative
`systems also require facilities to prevent users from interfering
`with one another. QuickSet
`incorporates authentication of
`messages in order that one user’s speech is not accidentally
`integrated with another’s gesture.
`In the future, we will provide a subgrouping mechanism for
`users, such that there can be multiple collaborating groups
`using the samefacilitator, thereby allowing
`users to be able
`to choose to join collaborations of specific subgroups. Also to
`be developed is a method for handling conflicting actions
`during a collaboration.
`
`5.1.2 Multimodal Control of Virtual Travel
`flight
`Most
`terrain visualization systems
`allow only for
`control, either through a joystick (or equivalent), via keyboard
`commands, or via mouse movement. Unfortunately,
`to make
`effective use of such interfaces, people need to be pilots, or at
`least know where they are going. Believing this
`to be
`unnecessarily restrictive,
`our virtual reality set-up follows the
`approach recommended by Baker and Wickens [unpublished
`ms]., Brooks [1996], and Stoakley et al,
`[1995] in offering
`two “linked” displays — a 2-D “birds-eye” map-based display
`(QuickSet), and the 3-D CommandVuvisualization. In addition
`to the existing 3-D controls,
`the user can issue spoken or
`multimodal commands via the handheld PC to be executed by
`CommandVu. Sample commandsare:
`“CommandVu, heads up display on,”
`“take me to objective alpha”
`“fly me to this platoon <gesture on QuickSet map>” (see Figure
`4).
`“fly me along this route <draws route on QuickSet map> at fifty
`meters”
`
`offers distinct
`virtual worlds
`Spoken interaction with
`advantages over direct manipulation,
`in that users are able to
`describe entities and locations that are not in view, can be
`teleported to those out-of-view locations and entities, and can
`
`33
`
`DISH, Exh. 1023, p. 3
`
`DISH, Exh. 1023, p. 3
`
`

`

`QuickSet interface: On the handheld PC is a geo-referenced
`map of some region,’ such that entities displayed on the map
`are registered to their positions on the actual
`terrain, and
`thereby to their positions on each of the various user interfaces
`connected to the simulation. The map interface provides the
`usual pan and zoom capabilities, multiple overlays,
`icons, etc.
`Twolevels of map are shown at once, with a small rectangle
`shown on a miniature version ofthe larger scale map indicating
`the portion of it shown on the main map interface.
`Employing pen, speech, or more frequently, multimodal input,
`the user can annotate the map, creating points,
`lines, and areas
`of various types. The user can also create entities, give them
`behavior, and watch the simulation unfold from the handheld.
`When the pen is placed on the screen, the speech recognizer is
`activated,
`thereby allowing users.
`to
`speak and gesture
`simultaneously. The interface offers controls
`for various
`parameters of speech recognition, for loading different maps,
`for entering
`into
`collaborations with other users,
`for
`connecting to different facilitators, and for discovering other
`agents who are connected to the facilitator. The QuickSet
`system also offers a novel map-labeling algorithm that
`attempts to minimize the overlap of map labels as the user
`creates more complex scenarios, and as the entities move (cf.
`[Christensen et al., 1996}).
`Speech recognition agent: The speech secognition- agent
`used in QuickSet is built on IBM’s VoiceType Application
`Factory and VoiceType 3.0, recognizers, as well as Microsoft
`Whisper speech recognizer.
`Gesture recognition agent: QuickSet’s pen-based gesture
`recognizer consists of both a neural network [Pittman, 1991,
`Mankeet al., 1994] and a set of hidden Markov models. The
`digital ink is size-normalized, centered in a 2D image, and fed
`into the neural network as pixels. The ink is also smoothed,
`resampled, converted to deltas, and given as input to the HMM
`recognizer. The system currently recognizes 68 pen-gestures,
`including various military map symbols
`(platoon, mortar,
`fortified line, etc.), editing gestures (deletion, grouping), route
`indications,
`area indications,
`taps,
`ete.
`The probability
`estimates from the two recognizers are combined to yield
`probabilities
`for each of the possible interpretations. The
`inclusion of route and area indications creates a special problem
`for the recognizers, since route and area indications may have a
`variety of shapes. This problem is further compounded by the
`fact that the recognizer needs to be robust in the face of sloppy
`writing. More typically,
`sloppy forms of various map
`symbols, such as those illustrated in Figure 3, will often take
`the same shape as some route and area indications. A solution
`for this problem can be found by combining the outputs from
`the gesture recognizer with the outputs from the speech
`recognizer, as is described in the following section.
`
`S$ mC
`OE 3)
`Figure 2 Pen drawings of routes and areas. Routes and
`areas do not have signature shapes that can be
`used to identify them.
`
`4 QuickSet can employ either UTM orLatitude/Longitude coordinate
`systems.
`
`lax B
`
`mortar
`
`tank
`platoon
`
`deletion mechanized
`company
`
`Figure 3: Typical peninput from real users, The recognizer must
`be robustin the face of sloppy input.
`
`language agent
`agent: The natural
`language
`Natural
`currently employs a definite clause grammar and produces typed
`feature structures asa representation of the utterance meaning.
`Currently, for the force laydown and mission assignment tasks,
`the language consists of noun phrases that label entities, as
`well as a variety of imperative constructs
`for supplying
`behavior.
`
`Text-to-Speech agent: Microsoft’s text-to-speech system
`has been incorporated as an agent, residing on each individual
`PC.
`
`tho
`task of
`The
`agent:
`integration
`Multimodal
`integrator agent is to field incoming typed feature structures
`representing individual
`interpretations of
`speech and of
`gesture, and identify the best potential unified interpretation,
`multimodal or unimodal.
`In order for speech and gesture to be
`incorporated into a multimodal interpretation,
`they need to bo
`both semantically and temporally compatible. The output of
`this agent is a typed feature structure representing the preferred
`interpretation, which is ultimately routed to the bridge agent
`for execution. A more detailed description of multimodal
`interpretation is in Section 6.
`developed
`agent,
`simulation
`Simulation
`agent:
`The
`primarily by SRI International
`[Moore et al., 1997], but
`modified by us for multimodal
`interaction,
`serves
`as
`the
`communication chatinel between the OAA-brokered agents and
`the ModSAF simulation system. This agent offers an API for
`ModSAF that other agents can use.
`Web display agent: The Web display agent can be used to
`create entities, points,
`lines, and areas, and posts queries for
`updates
`to the state of the simulation via Java code that
`interacts with the blackboard and facilitator. The queries are
`routed to the running ModSAF simulation, and the available
`entities can be viewed over a WWW connection.
`
`CommandVu agent: Since the CommandVu virtual reality
`system is an agent,
`the same multimodal
`interface on the
`handheld PC can be used to create entities and to fly the user
`through the 3-D terrain.
`Application bridge agent: The bridge agent generalizes
`the underlying applications’ API to typed feature structures,
`thereby providing an interface to the various applications such
`as ModSAF, CommandVu, and Exinit.
`This allows
`for a
`domain-independent
`integration
`architecture
`in which
`constraints on multimodal interpretation are stated in terms of
`higher-level constructs such as typed feature structures, greatly
`facilitating reuse.
`CORBA bridge agent: This agent converts OAA messages
`to CORBA IDL(Interface Definition Language) for the Exercise
`Initialization project.
`To see how QuickSet is used, we present
`examples.
`
`the following
`
`34
`
`DISH, Exh. 1023, p. 4
`
`DISH, Exh. 1023, p. 4
`
`

`

`ask questions about entities in the scene. We are currently
`engaged in research to allow the user to gesture directly into
`the 3-D scene while speaking, a capability that will make these
`more sophisticated interactions possible.
`
`
`
`derived from using multimodal interaction. Figure 7: QuickSet used for Exinit —
`
`large-scale exerciseinitialization
`
`_ Numerous features describing engineering works, such as a
`fortified line, a berm, minefields, etc. have also been added to
`the map using speech and gesture.
`Then the user creates a
`number of armored companies facing 45 degrees in defensive
`posture; he is now beginning to add armored companies facing
`225 degrees, etc. Once the user is finished positioning the
`5.2 Exercise Initialization: Exinit
`entities, he can ask for them to be deployed to a lower-level
`QuickSet has been incorporated into the DoD’s new Exercise
`(e.g., platoon).
`Initialization tool, whose job is to create the force laydown and
`An informal user test was recently run in which an experienced
`initial mission assignments
`for very large-scale simulated
`ExInit user (who had created the 60,000 entity scenario)
`scenarios. Whereas previous manual methods for initializing
`scenarios resulted in a large number of people spending more
`designed his own test scenario involving the creation of 8 units
`than a year in order to create a division-sized scenario, a
`and 15 control measures (e.g.,
`the lines and areas shown in
`Figure 7). The user first entered the scenario via the ExInit
`60,000+ entity scenario recently took a single ExInit user 63
`graphical user interface, a standard Microsoft Windows mouse-
`hours, most of which was computation.
`menu-based GUI.
`Then, after a relatively short
`training
`ExInit is distinctive in its use of CORBA technologies as the
`session with QuickSet, he created the same scenario using
`interoperation framework, and its use of inexpensive off-the-
`speech and gesture.
`Interaction via QuickSet resulted in a two-
`shelf personal computers. ExInit’s CORBAservers (written or
`fold to seven-fold speedup, depending on the size of the units
`integrated by MRJ Corp. and Ascent Technologies) include a
`
`involved (companies or battalions).|Although a more
`relational
`database
`(Microsoft Access
`or Oracle),
`a
`comprehensive user test remains to be conducted,this early data
`geographical
`information system (CARIS), a “deployment”
`point indicates the productivity gains that can potentially be
`server that knows how to decompose a high-level unit into
`smaller ones and position them in realistic ways with respect to
`the terrain, a graphical user
`interface,
`and QuickSet
`for
`voice/gesture interaction.
`In order for the QuickSet interface to work as part of the larger
`ExInit system, a CORBA bridge agent was written for the OAA,
`which communicated via IDL to the CORBAside, and via the
`interagent communication language to the OAA agents. Thus,
`to the CORBA servers, QuickSet is viewed as a Voice/Gesture
`server, whereas to the QuickSet agents, ExInit
`is simply
`another application agent. Users can interact with the QuickSet
`map interface (which offers a fluid multimodal interface), and
`view ExlInit as a “back-end” application similar to ModSAF. A
`diagram of the QuickSet-ExInit architecture can be found in
`Figure 6.
`Shown there as well is a connection to DARPA’s
`Advanced Logistics Program demonstration system for which
`QuickSetis the user interface.
`the
`To illustrate the use of QuickSet for Exinit, consider
`example of Figure 7,
`in which, a user has said: “Multiple
`boundaries,” followed in rapid succession by a series of
`multimodal utterances
`such as “Battalion <draws
`line>,”
`“Company <draws line>,” ete. The first utterance tells ExInit
`that subsequent input is to be interpreted as a boundary line, if
`possible. When the user then names an echelon and draws a
`line, the multimodal input is interpreted as a boundary of the
`appropriate echelon.
`
`QuickSet
`
`Ye=f
`
`Figure 6: Exlinit Architecture
`
`[Interaction with Medical
`5.3 Multimodal
`Information: MIMI
`The last example a QuickSet-based application is MIMI, which
`allows users to find appropriate health care in Portland,
`Oregon. Working with the Oregon Health Sciences University,
`a prototype was developed that allows users to inquire using
`speech and gesture about available health care providers. For
`example, a user might say “show meall psychiatrists in this
`neighborhood <circling gesture on map>”. The
`system
`translates the multimodal input into a query to a database of
`doctor records. The query results in a series of icons being
`displayed on the map. Each of these icons contains one or
`more health care providers meeting the appropriate criterion.
`Figure 8 show the map-based interaction supported by MIMI.
`
`35
`
`DISH, Exh. 1023, p. 5
`
`DISH, Exh. 1023, p. 5
`
`

`

`
`
`Information
`
`Users can ask to see details of the providers and clinics, ask
`follow-up questions, and inquire about transportation to those
`sites.
`
`interface to a
`In summary, QuickSet provides a multimodal
`numberof distributed applications, including simulation, force
`laydown, virtual reality, and medical informatics. The heart of
`the system is its ability to integrate continuous spoken
`language and continuous gesture.
`Section 6 discusses the
`unification-based architecture that supports
`this multimodal
`integration.
`6. MULTIMODAL INTEGRATION
`Given the advantages of multimodal interaction, the problem of
`integrating multiple communication modalities is key to future
`human-computer interfaces. However, in the sixteen years since
`the
`“Put-That-There”
`system [Bolt
`1980],
`research on
`multimodal integration has yet to yield a reusable scaleable
`architecture for the construction of multimodal systems that
`integrate gesture and voice. As we reported in Johnston et al.
`{1997], we see four major
`limiting factors
`in previous
`approaches to multimodal integration:
`
`e
`
`The majority of approaches only consider simple deictic
`pointing
`gestures made with a mouse
`[Brison
`and
`Vigouroux (ms.); Cohen 1992; Neal and Shapiro 1991;
`Wauchope 1994] or with the hard [Bolt, 1980; Koonset al
`1993].
`
`¢ Most previous approaches have been primarily language-
`driven,
`treating gesture as a secondary dependent mode
`(Neal and Shapiro
`1991, Cohen 1992; Brison
`and
`Vigourovx (ms.), Koons et al 1993, Wauchope 1994}.
`In
`these approaches, integration of gesture is triggered by the
`appearance of expressions in the speech stream whose
`reference needs to be resolved, such as definite and deictic
`noun phrases (e.g. ‘the platcon facing east,’
`‘this one’,
`etc.).
`.
`
`@
`

`
`Noneof the existing approaches provide a well-understood
`and generally applicable common meaning representation
`for the different modes.
`
`None of the existing approaches provide a general and
`formally-well
`defined mechanism for multimodal
`integration.
`
`e
`
`those
`
`6.1 Multimodal Architecture Requirements
`In order to create such a mechanism we need:
`e
`Parallel recognizers and “‘understanders” that produce a set
`of time-stamped meaning fragments for each continuous
`input stream
`e Acommon framework within which to represent
`meaning fragments
`that decides which
`6A time-sensitive grouping process
`meaning fragments from each modality stream should be
`combined. For example, should the gesture in a sequence of
`<speech, gesture,
`speech>
`be interpreted
`with the
`preceding speech, the following speech,or byitself?
`© Meaning “fusion” operations that combine semantically
`compatible
`meaning
`fragments.
`The
`modality
`combination operation needs to allow any meaningful part
`to be expressed in any of the available modalities
`e A process that choosesthe best joint interpretation of the
`multimodal
`input. Such a process will
`support mutual
`compensation of modes — allowing,
`for example, speech
`to compensate for errors in gesture recognition, and vice-
`versa.
`
`allows
`that
`architecture
`asynchronous
`e A flexible
`multiprocessing and can keep pace with human input.
`
`To
`
`Approach
`
`6.2 Overview Of Quickset’s
`Multimodal
`[ntegration
`Using a distributed agent architecture, we have developed a
`multimodal integration process for QuickSet that meets these
`goals.

`The system employs continuous speech and continuous
`gesture recognizers running in parallel. A wide range of
`continuous gestural input
`is supported, and integration
`may be driven by either mode.
`Typed feature structures are used to provide a clearly defined
`and well understood common meaning representation for
`the modes.
`
`@ Multimodal
`unification.
`
`integration
`
`is
`
`accomplished
`
`through
`

`
`@
`
`The integration is sensitive to the temporal characteristics
`of the input in each mode.
`The unification-based integration method allows spoken
`language and gesture to compensate for recognition errors
`in the other modality.
`
`e
`
`The agent architecture offers a flexible asynchronous
`framework within which to build multimodal systems.
`the
`In the remainder of this
`section, we briefly present
`multimodal integration method. Further information can be
`found in [Johnston et al., 1997].
`
`-
`
`6.3 A Temporally-Sensitive Unification-Based
`Architecture for Multimodal
`Integration
`One the mostsignificant challenges facing the development of
`effective multimodal
`interfaces concerns the integration of
`input from different modes. In QuickSet, inputs from each mode
`neéd to be both temporally and semantically compatible before
`they will be fused into an integrated meaning.
`
`6.3.1 Temporal compatibility
`In recent empirical work [Oviatt et al. 1997], it was discovered
`that when users speak and gesture in a sequential manner, they
`
`DISH, Exh. 1023, p. 6
`
`DISH, Exh. 1023, p. 6
`
`

`

`gesture first, then speak within a relatively short time window;
`speech rarely precedes gesture.
`AS a consequence, our
`multimodal intepreter prefers to integrate gesture with speech
`that follows within a short time interval,
`than with preceding
`speech. If speech arrives after that interval, the gesture will be
`interpreted unimodally. This temporally-sensitive architecture
`requires that there at least be time stamps for the beginning and
`end of each input stream. However,
`this
`strategy may be
`difficult to implement for a distributed environment in which
`speech recognition and gesture recognition might be performed
`by
`different machines
`on
`a
`network,
`requiring
`a
`synchronization of clocks. For this reason, it is preferable to
`have speech and gestural processing performed on the same
`machine.
`
`redundant input
`can combine complementary or
`modes “but rules out contradictory inputs.
`
`from both
`
`of
`
`typed feature
`
`6.3.3 Advantages
`unification
`Weidentify four advantages of using typed feature structure
`unification to support multimodal
`integration — partiality,
`mutual compensation,
`structure
`sharing,
`and multimodal
`discourse. These are discussed below.
`
`structure
`
`representations. The use of feature
`Partial meaning
`structures as a semantic representation framework facilitates the
`specification of partial meanings.
`Spoken or gestural input
`which partially specifies a command can be represented as an
`underspecified feature structure in which certain features are not
`instantiated, but are given a certain type based on the semantics
`of the input. For example, if a given speech input can be
`integrated with a line gesture,
`it can be assigned a feature
`structure with an underspecified location feature whose value is
`required to be of type fine, as in Figure 9 where the spoken
`phrase ‘barbed wire’ is assigned the feature structure shown,
`
`object:
`
`style: barbed__ wire
`color:red
`
`label: Barbed Wire”
`
`line_obj
`linel ]
`create_line
`location:,.
`Figure 9: Feature Structure for ‘barbed wire’
`
`Since QuickSetis a task-based system directed toward setting up
`a scenario for simulation,
`this phrase is
`interpreted as a
`partial

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket