throbber
Oregon Health & Science University
`OHSU Digital Commons
`
`Scholar Archive
`
`July 1995
`
`Spoken-language access to multimedia (SLAM) : a
`multimodal interface to the World-Wide Web
`David House
`
`Follow this and additional works at: http://digitalcommons.ohsu.edu/etd
`
`Recommended Citation
`House Dav d "Spoken language access to mult med a (SLAM) : a mult modal nterface to the World W de Web" (1995). Scholar
`Archive. Paper 3416.
`
`o s. I as bee accep ed o c us o
`Th s Thes s s b oug o you o ee a d ope access by OHSU D g a Co
`au o zed ad
`s a o o OHSU D g a Co
`o s. Fo o e
`o
`a o , p ease co ac c a p eu@o su.edu.
`
` Sc o a A c ve by a
`
`VWGoA - Ex. 1024
`Case No. IPR2016-00125
`Volkswagen Group of America, Inc. - Petitioner
`West View Research, LLC - Patent Owner
`
`1
`
`

`
`Spoken-Language Access to Multimedia (SLAM):
`A Multimodal Interface to the World-Wide Web
`
`David House
`
`B.Sc., North Camlina State University, 1993
`
`A thesis submitted to the faculty of the
`Oregon Graduate Institute of Science & Technology
`in partial fulfillment of the
`requirements for the degree
`Master of Science
`
`in
`
`Computer Science and Engineering
`
`July 1995
`
`2
`
`

`
`The thesis “Spoken Language Access to Multimedia (SLAM): A Multimodal Interface to the
`World-Wide Web" by David House has been examined and approved by the following
`Examination Committee:
`
`'
`DE avid G.
`Associate Professor
`
`Thesis Research Advisor
`
`Dr. MarkFanty
`Research Assistant Professor
`
`[7
`
`'
`
`Dr. Jonathag Walfible
`
`Associate Professor
`
`3
`
`

`
`Acknowledgments
`
`Above all, I would like to thank David Novick for the arrangements, guidance, ideas
`
`and support he provided in all aspects of this research.
`
`I wish to give special thanks Oscar Garcia for his contributions to the remote-recogni-
`
`tion model and Mark Fanty for work on the recognition module, as well as his thoughts on
`
`a speech-only interface to this system and other feedback. Further thanks go to Jonathan
`
`Walpole for his feedback and assistance as a thesis committee member, to Ken Maupin for
`
`his valuable help with World-Wide Web-, HTML-, and programming-related issues, and
`
`to Ron Cole for his ideas, support and feedback.
`
`Additional thanks to Henry Churchyard for providing the code from htmlchek used for
`
`HTML. document parsing, and to Michael Mauldin for providing statistics about common
`
`link words on the World-Wide Web. Thanks also to Yeshwant Muthusamy for providing a
`
`fast text-to-phoneme generator and to James Blakely for his help with creating the Motif
`
`code for the type-in version.
`
`Finally, thanks to the National Center for Supercomputing Applications (NCSA) at the
`
`University of Illinois at Urbana-Champaign for its role in the development of Mosaic.
`
`4
`
`

`
`Contents
`
`Acknowledgments
`
`Abstract
`
`1
`
`Introduction
`
`ii
`
`vii
`
`1
`
`1.1 The problem . .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .1
`
`. .2
`
`1.2 The approach .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`1.3 Overview of thesis .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .2
`
`2 Related work
`
`4
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .4
`
`2.1 Motivation .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`2.2
`
`Previous work with interface modalities for hypermedia . .
`
`2.3 Previous work with speech access to hypermedia systems .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. . .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .6
`
`. .8
`
`. .9
`
`2.4
`
`Previous work with spoken-language extensions toW browsers .
`
`.
`
`3 A comparison of speech- and mouse-based interfaces to hypermedia
`
`3.1 Mouse-based interfaces to hypermedia: advantages and disadvantages.
`
`.
`
`3.2 Speech-based interfaces to hypermedia: advantages and disadvantages .
`
`4
`
`Issues in creating a spoken-language extension to Mosaic
`
`4.1 Options and trade-offs in creating a speech-enabled WWW browser .
`
`4.2 Microphone variation .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`11
`
`. .11
`
`. .13
`
`16
`
`. .16
`
`. .23
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`5
`
`

`
`4.3
`
`Input style: Open microphone or touch-to-talk .
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .23
`
`. .24
`
`4.4 Recording format .
`
`.
`
`. .
`
`.
`
`4.5 Location of recognition .
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`. .24
`
`. .25
`
`4.6 Choice of recognizer . .
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .26
`
`4.7 Generating recognition models . .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`The SLAM System
`
`5.1
`
`SLAM architecture .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`28
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .29
`
`. .31
`
`5.2 SLAM implementation .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .32
`
`5.3
`
`SLAM speech recognition .
`
`.
`
`.
`
`.
`
`.
`
`5.4 Scope of speech-access to links .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .34
`
`. .35
`
`5.5 Parsing HTML documents .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .35
`
`5.6 Microphone input method.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`5.7 Generating and storing speech-enabled documents .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .36
`
`. .38
`
`5.8 Updating the user's screen .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .38
`
`5.9 Current status .
`
`.
`
`.
`
`5.10 Network rates for SLAM .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`. .40
`
`. .41
`
`5.11 User tests .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`Future Work
`
`44
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. . .45
`
`6.1
`
`Improved speech access .
`
`. .
`
`. .
`
`6.2 Other interface improvements.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .48
`
`. .48
`
`6.3
`
`Improvements to the recognition .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .49
`
`6.4
`
`Implementation improvements .
`
`6.5 User studies .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .51
`
`. .52
`
`6.6 Making greater use of the user's speech .
`
`Speech—only access to the WWW .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .52
`
`6.7
`
`6.8 Speech access to pictures and icons .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`6.9 Unconstrained multimodal access to the WWW . .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .53
`
`. .53
`
`iv
`
`6
`
`

`
`Bibliography
`
`Appendix A. SLAM code
`
`55
`
`61
`
`A.1 Modified code for gui.c .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`62
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`65
`
`A.2 Code for SLAM-enable.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`A.3 Code for SLAM remote recognition server .
`
`A.4 Code for SLAM remote recognition client . .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`Appendix B. Network Tests of SLAM
`
`Biographical Note
`
`68
`
`78
`
`82
`
`87
`
`7
`
`

`
`List of Tables
`
`3.1 Mouse-Based Interaction with Hypermedia . .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .13
`
`. . .15
`
`3.2 Spoken-Language Interaction with Hypermedia .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`List of Figures
`
`4.1 Hartsfield School’s home page (overuse of “here" link) .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .20
`
`. .20
`
`4.2 Faculty of Dentistry home page (overuse of “here” link) .
`
`4.3 Recent publications page (overuse of “Postscript” link) .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .21
`
`. .29
`
`5.1 SLAM architecture .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`5.2 Input file for current SLAM dcmonstation system .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.4-O
`
`8
`
`

`
`Abstract
`
`The World—\V1de Web (WWW), a global networked information system based on
`
`hypertext, has become extremely popular since it became available in 1992. In order to
`
`improve the ease of access to the information available on the W, as well as to give
`
`increased exposure to spoken language systems, we developed Spoken Language Access
`
`to Multimedia (SLAM), a spoken language extension to the graphical user interface of the
`
`World-Wide Web browser Mosaic.
`
`Although other research has been conducted on speech interfaces to hypertext, includ-
`
`ing speech interfaces to the WWW, SLAM differs in some key ways. For one, SLAM uses
`
`the complementary modalities of spoken language and direct manipulation to improve this
`
`interface to information on the Internet Also, SLAM makes the advantages of spoken lan-
`
`guage systems available to a wider audience by providing a recognition server available
`
`remotely across a network.
`
`This thesis describes previous work related to SLAM, particularly in the areas of mu]-
`
`timodality and speech interfaces to hypertext and hypermedia systems, including speech
`
`access to the W. This thesis also examines the issues and architecture of what is
`
`vii
`
`9
`
`

`
`believed to be the first spokemlanguage interface to the W to be easily run across
`
`platforms.
`
`This work is sponsored by a Small Grant for Exploratory Research (number 9069-120)
`
`from the National Science Foundation.
`
`viii
`
`10
`
`10
`
`

`
`Chapter 1
`
`Introduction
`
`1.1 The problem
`
`The World-Wide Web (WWW) (CERN, 1994) is a network-based standard for hyper-
`
`media documents that combines documents prepared in HyperText Markup Language
`
`(HTML) (NCSA, 1994a) with an extensible set of multimedia resources. The most popu-
`
`lar WWW browser with available source code is Mosaic (NCS A, 19941:), a cross-platform
`
`program developed and distributed by NCSA, now running in X11—based Unix, Macintosh
`
`and PC-Windows environments. As a hypermedia viewer, Mosaic combines the flexibility
`
`and navigability of hypermedia with multimedia outputs such as audio and GIF images.
`
`The World-Wide Web, especially as viewed with Mosaic, is phenomenally popular. By
`
`mid-Spring of 1994, Internet traffic was doubling about every six months. Of this growth,
`
`the World-Wide Web’s proportional usage was doubling approximately every four
`
`months. In absolute volume of traffic, use of theW was doubling every two and a half
`
`months (wallach, 1994).
`
`Much of the popularity of Mosaic can be attributed to its mouse-based interface,
`
`which can quickly, simply, and directly aid the user in browsing the variety of documents
`
`11
`
`11
`
`

`
`available on the Internet. However, inherent limitations in mouse-based interfaces make it
`
`difficult for users to perform complex commands and to access documents that cannot be
`
`reached by the visible links. Speech-based interfaces, on the other hand, perform well on
`
`these types of complex, nonvisual tasks, but speech input to computers is not nearly as
`
`widespread as other input methods.
`
`1.2 The approach
`
`The SLAM system simultaneously addresses the limitations of mouse-based WWW
`
`interfaces and the limited popularity of speech-based interfaces.
`
`By maintaining the full functionality of the mouse-based Mosaic WWW browser
`
`while adding the speech input option, a system has been created for which the strengths of
`
`each complimentary mode of input (modality) compensate for weaknesses in the other.
`
`The SLAM system does not merely add speech input to the existing Mosaic interface, but
`
`rather uses speech to allow access to information that was not directly available with the
`
`mouse-based system.
`
`This research could broaden the market for speech-based interfaces in two ways. By
`
`making speech recognition available in a popular product like Mosaic, speech recognition
`
`will also become an increasingly popular way to access data. SLAM also enables the user
`
`to perform speech recognition on either the local machine or on a remote speech
`
`recognition server, so that it is not necessary for the user ’s client machine to have a speech
`
`recognizer in order to access theW with speech.
`
`1.3 Overview of thesis
`
`Chapter 2 describes others’
`
`research related to the SLAM project,
`
`including
`
`motivation for creating multimodal interfaces, previous work on input to hypermedia
`
`12
`
`12
`
`

`
`systems, and research investigating spoken language interfaces to the WWW. Chapter 3
`
`describes the differences in speech- and mouse-based interfaces. Chapter 4 discusses
`
`issues involved with building a speech recognition system for the WWW.
`
`Chapter 5 describes the choices that were made in building the SLAM system
`
`architecture, and describes user and network testing of the system. Chapter 6 describes
`
`open issues and outlines directions for future work with the SLAM project.
`
`Appendix A contains source code and systems requirements developed in the course
`
`of this research. Appendix B describes the time trials of the SLAM's current networked
`
`recognition client and shows the resulting data.
`
`13
`
`13
`
`

`
`Chapter 2
`
`Related Work
`
`This chapter discusses previous and ongoing work in fields related to spoken-language
`
`navigation of the Wor1d—Wide Web. A motivation for research in this field is presented,
`
`followed by a brief overview of research comparing speech- and mouse-based input, par-
`
`ticularly as used in multimodal systems. An overview of previous research into hypertext
`
`and hypermedia systems is given, with a mention of systems using speech access to hyper-
`
`media. The chapter concludes with a look at other groups exploring the issue of spoken-
`
`language access to the W.
`
`2.1 Motivation
`
`The SLAM project combines a variety of emerging technologies and techniques, such
`
`as spoken language interfaces, the World-Wide Web, and multimodal access to informa-
`
`tion systems.
`
`This project is one way to address the important issue of studying multimodal inter-
`
`faces involving speech. The report of the NSF Workshop on Spoken Language Under-
`
`standing concluded that performance characteristics of muitimodal systems was one of the
`
`key research challenges in the field of spoken language research:
`
`14
`
`14
`
`

`
`“Interdisciplinary research will be needed to generate novel strategies for
`designing multimodal systems with performance characteristics superior to
`those of simpler unimodal alternatives. Among other things, the successful
`cultivation of such systems will require advance empirical work with
`human subjects, building a variety of new prototype systems, and the
`development of appropriate metrics for evaluating the accuracy, efficiency,
`leamability, expressive power, and other characteristics of different multi-
`modal systems.” (Cole, Hirschman et al., 1995, 12)
`
`Development and availability of a spoken-language enhancement to an interface for
`
`the World-Wide Web would also increase the availability and visibility of spoken-lam
`
`guage technology in the computing community as a whole. This may encourage other
`
`researchers and developers to refine and include spoken-language systems technology in
`
`future systems. 111 fact, there may also be a complementary effect, since adding spoken-
`
`language input to the World—Wide Web is likely to make the Web more easily used and
`
`thus more accessible to the general population. The area of human language technology
`
`has been identified as a grand challenge area necessary to support the national information
`
`infrastructure technology. A report of the Information Infrastructure Technology Task
`
`Group identifies “Intelligent Interfaces" as one of four broad topic areas of the Informa-
`
`tion Infrastructure Technology and Applications (IITA) program, and states that
`
`“Advanced user interfaces will bridge the gap between users and the future National
`
`Information Infrastructure... Work in this area includes development of technologies for
`
`speech recognition and generation...” (National Coordination Office for HPCC, 1994, 16-
`
`17).
`
`The possibilities and practicality of multirnodal interfaces to the Web will not be dis-
`
`covered via analytic methods alone. A substantial amount needs to be learned through
`
`empirical and experiential methods such as system building. Indeed, the potential interac-
`
`tions involved in multimodal systems are so complex that it may be impossible to discern
`
`15
`
`15
`
`

`
`their optimal structure without conducting advance exploratory research (Oviatt, 1992).
`
`Thus the determination of the important or tractable issues relating to such a project
`
`requires development, use, and testing of a spoken-language interface to the World-Wide
`
`Web.
`
`Moreover, the availability of even an experimental spoken-language interface would
`
`enable the growing population of Web users to address these questions in the very practice
`
`of their own day—to-day computing. If a spoken-language interface is used in the “worka-
`
`day world” of cooperative computing (Moran, 1990) exemplified by the Web, then we will
`
`have (a) empirical evidence of its utility and (b) a fund of varied experiences with the
`
`interface that could contribute to improvements. This community of users can tell us what
`
`is right and wrong with spoken-language interaction for hypermedia, thereby offering
`
`directions for further research in the field. Indeed, a widespread, easily-available spoken-
`
`language interface on the Internet could provide results useful to spoken—language systems
`
`research as a whole. In short, from a practical standpoint the idea is to make the interface
`
`available and see what happens, as in the case of the original Mosaic interface and other
`
`WWW browsers.
`
`2.2 Previous work with interface modalities for hypermedia
`
`The graphical user interface (GUI), especially with pointer-based direct manipulation,
`
`has become the predominant model for human-computer interaction. Even in innovative
`
`settings such as the World-Wide Web, which provides a rich hypermedia environment that
`
`includes outputs in hypertext, images and sound, the inputs to the system remain key-
`
`board— and pointer-based. (As the most typical pointer is the mouse, we will use the term
`
`“mouse-based“ interface to refer to pointer—based interfaces generally.)
`
`16
`
`16
`
`

`
`The mouse-based direct—manipulation interface (Shneiderman, 1983) provided a ratio-
`
`nal and innovative means of interaction with computer systems. While physical pointing
`
`and bitmapped displays solved many of the problems with character-and-keyboard-based
`
`interfaces, direct manipulation based on physical pointing did not make use of the full
`
`range of expressive capabilities of human users. This omission was, no doubt, mostly a
`
`consequence of the relatively poor state of other means of expression as input modalities;
`
`spoken-language systems have made immense progress since 1983 (Cole, I-Iirschman et
`
`a1., 1995).
`
`Adding spoken-language capabilities to hypermedia holds the promise of extending
`
`users‘ abilities in ways they find appealing. Empirical studies of multimodal interfaces
`
`have looked at user preferences for different kinds of inputs. For example, Rudnicky
`
`(1993) showed that users preferred speech input, even if it meant spending a longer time
`
`on the task, when compared with using a keyboard and a direct manipulation device.
`
`Oviatt and Olsen (1994) found that users of multimodal interfaces displayed patterns of
`
`use that reflected the contrastive functionality of the available modalities.
`
`Other researchers have investigated the comparative advantages of multimodal inter-
`
`faces, including Cohen (1992) and Oviatt (1992, 1994). One of the goals of this research
`
`has been to attempt “to use the strengths of one modality to overcome for the weaknesses
`
`of another” (Cohen, 1992, 143), who proposed a framework for this analysis. Cohen's
`
`analytical framework involves comparing the strengths and weaknesses of modalities with
`
`respect to factors such as:
`
`- intuitiveness,
`
`- consistency of “look and feel,”
`
`- whether options are apparent,
`
`17
`
`17
`
`

`
`- safety,
`
`- feedback,
`
`- “direct engagement" with an entity,
`
`- ability to describe,
`
`- use of anaphora,
`
`- establishing and maintaining context, and
`
`- use of temporal relations.
`
`Cohen studied multimodal interfaces in general terms, without specific consideration
`
`of interfaces for hypermedia. For multimodal interfaces in general, then, he observed that
`
`the advantages of pointer-based interfaces are that they are generally intuitive, unambigu-
`
`ous, and, if well-designed, can have a consistent “look and feel.” Drawbacks to using
`
`such interfaces include difficulty in selecting items not currently visible, poor support for
`
`temporal relations, and difficulty using context to specify relations. Natural language sys-
`
`tems overcome some of the weaknesses of pointer-based interfaces by allowing the speci-
`
`fication of context, temporal relations, and unseen objects. On the other hand, language
`
`has the problem that the user may not know the vocabulary of the recognizer. Spoken lan-
`
`guage systems are also prone to other problems such as ambiguity and other causes of rec-
`
`ognition errors. (Cohen, 1992).
`
`2.3 Previous work with speech access to hypermedia systems
`
`Interactive hypertext systems have been proposed for fifty years; a useful survey is
`
`provided by Arons (1991). Such systems have a number of advantages for information
`
`retrieval over traditional databases, including that there is no need for training the user on
`
`the system and users do not require knowledge of a topic before searching for information.
`
`Some disadvantages of such systems are that users will have difficulty in actually getting
`
`18
`
`18
`
`

`
`specific information, and are likely to encounter the well-known “lost in hyperspace"
`
`effect (Daniel, 1994; Whalen, 1989) during which users get sidetracked and lost while
`
`navigating through a hypermedia environment.
`
`One system (Stock, 1991) combines natural language and hypermedia to explore Ital-
`
`ian frescoes. This system uses the hypermedia aspect to organize unstructured informa-
`
`tion, and used the natural language aspect to help alleviate the problems of disorientation
`
`and the cognitive overhead of having too many links.
`
`Other groups have investigated using speech with hypermedia systems. One hyper-
`
`speech system (Arons, 1991) enabled the user to navigate in an audio environment with-
`
`out a visual display; speech recognition was used to maneuver in a database of digitally
`
`recorded speech. This system was similar to a speech-only WWW browser in that the
`
`speech interface was goal-directed; the speech provided a form of direct addressing that is
`
`difficult to capture in other interfaces, so that the user felt that they were navigating and in
`
`control. Arons acknowledged that “representing and manipulating a hypermedia database
`
`becomes much more complex in the speech domain than with traditional media." Related
`
`systems include those described by Resnick (1990) and Muller (1990), both cited by
`
`Arons (1991).
`
`2.4 Previous work with spoken-language extensions to WWW browsers
`
`Many groups around the country, and presumably around the world, are working on
`
`projects that are similar in many ways to OGI’s SLAM system.
`
`Earlier versions of MacMosaic had been compiled with speech recognition enhance-
`
`ments, but those compilations are no longer being performed, although they could be acti-
`
`vated with some code changes (Stephenson, 1994).
`
`19
`
`19
`
`

`
`I0
`
`MIT’s Spoken Language Systems group has been working on GALAXY (Goddeau,
`
`1994), a distributed system for on-line information that handles the natural language
`
`aspects of the system at a remote-recognition server. While the current focus of the GAL-
`
`AXY system is the travel domain, MIT is also believed to be applying this technology to
`
`creating a speech interface to theW as well.
`
`Raman at DEC has begun work on a spoken language extension to Mosaic called
`
`RETRIEVER (Rarnan, 1995) that focuses on allowing easier access to the Web to people
`
`with disabilities. Paciello at DEC is working with Hardin at NCSA on the Mosaic Disabil-
`
`ity Project (Paciello, 1995), one aspect of which is speech recognition.
`
`I-Iemphill at Texas Insuuments has completed a prototype speech-enabled Mosaic
`
`(llemphill, 1995) that allows for associating extended grammars and dialog states with
`
`links and hotlist items. Arbash at SRI developed a speech interface to Mosaic based on
`
`work related to the Xtallt project (Arbash, 1994). Both of these are based on local speech
`
`recognition, unlike OGI's remote-recognition system.
`
`20
`
`20
`
`

`
`Chapter 3
`
`A Comparison of Speech- and Mouse-based Interfaces
`
`to Hypermedia
`
`This chapter examines issues involved in creating a spoken language extension to a
`
`hypermedia system, in particular the Mosaic World-Wide Web browser. I discuss general
`
`issues involved in creating a multimodal interface with spoken language and mouse-based
`
`systems.
`
`Cohen’s analytic framework (discussed in Chapter 2) will now be particulatized and
`
`extended to deal specifically with the comparison of speech-based and mouse-based inter-
`
`faces for hypermedia. This will be done in two steps, by looking at mouse—based and then
`
`speech-based interfaces in terms of their respective advantages and disadvantages for
`
`hypermedia systems.
`
`3.1 Mouse-based interfaces to hypermedia: advantages and
`
`disadvantages
`
`The physical pointing involved in mouse-based interfaces is the source of both advan-
`
`tages and disadvantages for this modality. From the user's perspective, pointing has the
`
`21
`
`

`
`12
`
`traditional advantage of direct manipulation, namely reference specified through a combi-
`
`nation of action and location, as in double—cliclting an icon to start a program. Moreover,
`
`the interface generally provides immediate feedback to the user that the reference was suc-
`
`cessful, typically by highlighting the selected entity. From the point of view of the author
`
`of a WWW document, mouse-based pointing has the advantage that the reference can be
`
`completely specified: the label of a link will appear exactly as the author wrote it Addi-
`
`tionally, physical pointing in this context has no referential ambiguity; when the user
`
`clicks a mouse button, the user and the author both know exactly to which entity the user
`
`is referring.
`
`Mouse-based interfaces also have a number of disadvantages, particularly of the “lost-
`
`in-hyperspace” variety. This we1l—known problem was identified for hypertext systems by
`
`Whalen and Patrick (1989), who proposed a text-based natural-language solution.
`
`“Users can have trouble actually getting to the specific information
`required. They may have to navigate through a large number of paragraphs
`to get to the desired goal. Along the way, users are likely to get sidetracked
`and lost.” (289)
`
`When reference is based on physical pointing to a graphically-represented entity, the
`
`absence of such an entity on the screen means that the user cannot refer to it. In other
`
`words, the act of reference depends on the physical location of the referent’s presentation,
`
`which in hypermedia may be pages and documents away.
`
`Hypermedia interfaces typically have standard features such as a “hotlist" and history
`
`windows in order to give users a place that contains references they might want and that
`
`are otherwise not displayed. But the user might also prefer to refer to an entity by a name
`
`other than that specified by the author; the only way the user has to specify an entity is to
`
`click on it. Finally, the spatial nature of the interface limits the set of things to which the
`
`22
`
`22
`
`

`
`Table 3.1: Mouse-Based Interaction fith Hypermedia
`
`13
`
`Advantages
`
`Disadvantages
`
`1. Deictic reference and combination
`of action and reference
`
`1. Reference depends on location of
`referent
`
`2. Author completely specifies the
`
`
`representation of the entity
`
`
`
`2. (a) User might prefer another rep-
`resentation and (b) no other repre-
`sentation possible
`
`
`
`
`
`
`
`
`
`
`
`3. No referential ambiguity
`
`4. Generally gives immediate feed-
`back that user *5 action was under-
`
`stood
`
`3. Vocabulary of references limited
`to those with visible links
`
`4. Difficult to express complex acts
`
`user can refer. Users cannot describe entities (Cohen, 1992) instead of pointing to them.
`
`Similar problems exist with respect to actions. Because they are typically accomplished
`
`by selecting a command from a menu or by clicking on an icon, it is difficult to express
`
`complex actions other than as a perhaps tedious series of primitives. The advantages and
`
`disadvantages of mouse-based interfaces for hypermedia are summarized in Table 3.1.
`
`3.2 Spoken-language-based interfaces to hypermedia: advantages and
`
`disadvantages
`
`Many of the advantages and disadvantages of spoken-language-based interfaces for
`
`hypermedia turn out to be complements of those for mouse-based interfaces. From the
`
`user‘s standpoint, the ability to refer to an entity no longer depends on the location of its
`
`graphical representation.
`
`Indeed, all referents are potentially available because the user can simply say the name
`
`of the referent without having to see it displayed. A related advantage is that the user can
`
`now have a number of different ways in which to refer to entities. Similarly, multiple
`
`action primitives could easily be combined into a single complex action that could include
`
`temporal and 011161‘ sophisticated concepts that are not expressible in mouse-based inter-
`
`23
`
`23
`
`

`
`14
`
`faces. Another advantage is that the user’s hands are freed for other activities. Indeed, it
`
`might be possible to build a spoken-language-only interface to hypermedia that could
`
`serve users by telephone instead of requiring a GUI.
`
`Speech input to hypermedia also has characteristic disadvantages that are often recip-
`
`rocal consequences of its advantages. For example, because references no longer depend
`
`on physical location, references may become ambiguous: a “hotlink" may be uniquely
`
`accessible via the mouse but arnbiguously accessible via speech because another hotlink
`
`might have the same label. This strongly suggests that designers of hypermedia int

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket