`OHSU Digital Commons
`
`Scholar Archive
`
`July 1995
`
`Spoken-language access to multimedia (SLAM) : a
`multimodal interface to the World-Wide Web
`David House
`
`Follow this and additional works at: http://digitalcommons.ohsu.edu/etd
`
`Recommended Citation
`House Dav d "Spoken language access to mult med a (SLAM) : a mult modal nterface to the World W de Web" (1995). Scholar
`Archive. Paper 3416.
`
`o s. I as bee accep ed o c us o
`Th s Thes s s b oug o you o ee a d ope access by OHSU D g a Co
`au o zed ad
`s a o o OHSU D g a Co
`o s. Fo o e
`o
`a o , p ease co ac c a p eu@o su.edu.
`
` Sc o a A c ve by a
`
`VWGoA - Ex. 1024
`Case No. IPR2016-00125
`Volkswagen Group of America, Inc. - Petitioner
`West View Research, LLC - Patent Owner
`
`1
`
`
`
`Spoken-Language Access to Multimedia (SLAM):
`A Multimodal Interface to the World-Wide Web
`
`David House
`
`B.Sc., North Camlina State University, 1993
`
`A thesis submitted to the faculty of the
`Oregon Graduate Institute of Science & Technology
`in partial fulfillment of the
`requirements for the degree
`Master of Science
`
`in
`
`Computer Science and Engineering
`
`July 1995
`
`2
`
`
`
`The thesis “Spoken Language Access to Multimedia (SLAM): A Multimodal Interface to the
`World-Wide Web" by David House has been examined and approved by the following
`Examination Committee:
`
`'
`DE avid G.
`Associate Professor
`
`Thesis Research Advisor
`
`Dr. MarkFanty
`Research Assistant Professor
`
`[7
`
`'
`
`Dr. Jonathag Walfible
`
`Associate Professor
`
`3
`
`
`
`Acknowledgments
`
`Above all, I would like to thank David Novick for the arrangements, guidance, ideas
`
`and support he provided in all aspects of this research.
`
`I wish to give special thanks Oscar Garcia for his contributions to the remote-recogni-
`
`tion model and Mark Fanty for work on the recognition module, as well as his thoughts on
`
`a speech-only interface to this system and other feedback. Further thanks go to Jonathan
`
`Walpole for his feedback and assistance as a thesis committee member, to Ken Maupin for
`
`his valuable help with World-Wide Web-, HTML-, and programming-related issues, and
`
`to Ron Cole for his ideas, support and feedback.
`
`Additional thanks to Henry Churchyard for providing the code from htmlchek used for
`
`HTML. document parsing, and to Michael Mauldin for providing statistics about common
`
`link words on the World-Wide Web. Thanks also to Yeshwant Muthusamy for providing a
`
`fast text-to-phoneme generator and to James Blakely for his help with creating the Motif
`
`code for the type-in version.
`
`Finally, thanks to the National Center for Supercomputing Applications (NCSA) at the
`
`University of Illinois at Urbana-Champaign for its role in the development of Mosaic.
`
`4
`
`
`
`Contents
`
`Acknowledgments
`
`Abstract
`
`1
`
`Introduction
`
`ii
`
`vii
`
`1
`
`1.1 The problem . .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .1
`
`. .2
`
`1.2 The approach .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`1.3 Overview of thesis .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .2
`
`2 Related work
`
`4
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .4
`
`2.1 Motivation .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`2.2
`
`Previous work with interface modalities for hypermedia . .
`
`2.3 Previous work with speech access to hypermedia systems .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. . .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .6
`
`. .8
`
`. .9
`
`2.4
`
`Previous work with spoken-language extensions toW browsers .
`
`.
`
`3 A comparison of speech- and mouse-based interfaces to hypermedia
`
`3.1 Mouse-based interfaces to hypermedia: advantages and disadvantages.
`
`.
`
`3.2 Speech-based interfaces to hypermedia: advantages and disadvantages .
`
`4
`
`Issues in creating a spoken-language extension to Mosaic
`
`4.1 Options and trade-offs in creating a speech-enabled WWW browser .
`
`4.2 Microphone variation .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`11
`
`. .11
`
`. .13
`
`16
`
`. .16
`
`. .23
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`5
`
`
`
`4.3
`
`Input style: Open microphone or touch-to-talk .
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .23
`
`. .24
`
`4.4 Recording format .
`
`.
`
`. .
`
`.
`
`4.5 Location of recognition .
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`. .24
`
`. .25
`
`4.6 Choice of recognizer . .
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .26
`
`4.7 Generating recognition models . .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`The SLAM System
`
`5.1
`
`SLAM architecture .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`28
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .29
`
`. .31
`
`5.2 SLAM implementation .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .32
`
`5.3
`
`SLAM speech recognition .
`
`.
`
`.
`
`.
`
`.
`
`5.4 Scope of speech-access to links .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .34
`
`. .35
`
`5.5 Parsing HTML documents .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .35
`
`5.6 Microphone input method.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`5.7 Generating and storing speech-enabled documents .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .36
`
`. .38
`
`5.8 Updating the user's screen .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .38
`
`5.9 Current status .
`
`.
`
`.
`
`5.10 Network rates for SLAM .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`. .40
`
`. .41
`
`5.11 User tests .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`Future Work
`
`44
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. . .45
`
`6.1
`
`Improved speech access .
`
`. .
`
`. .
`
`6.2 Other interface improvements.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .48
`
`. .48
`
`6.3
`
`Improvements to the recognition .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .49
`
`6.4
`
`Implementation improvements .
`
`6.5 User studies .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .51
`
`. .52
`
`6.6 Making greater use of the user's speech .
`
`Speech—only access to the WWW .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .52
`
`6.7
`
`6.8 Speech access to pictures and icons .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`6.9 Unconstrained multimodal access to the WWW . .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .53
`
`. .53
`
`iv
`
`6
`
`
`
`Bibliography
`
`Appendix A. SLAM code
`
`55
`
`61
`
`A.1 Modified code for gui.c .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`62
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`65
`
`A.2 Code for SLAM-enable.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`A.3 Code for SLAM remote recognition server .
`
`A.4 Code for SLAM remote recognition client . .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`Appendix B. Network Tests of SLAM
`
`Biographical Note
`
`68
`
`78
`
`82
`
`87
`
`7
`
`
`
`List of Tables
`
`3.1 Mouse-Based Interaction with Hypermedia . .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .13
`
`. . .15
`
`3.2 Spoken-Language Interaction with Hypermedia .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`List of Figures
`
`4.1 Hartsfield School’s home page (overuse of “here" link) .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .20
`
`. .20
`
`4.2 Faculty of Dentistry home page (overuse of “here” link) .
`
`4.3 Recent publications page (overuse of “Postscript” link) .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .21
`
`. .29
`
`5.1 SLAM architecture .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`. .
`
`. .
`
`.
`
`.
`
`. .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`5.2 Input file for current SLAM dcmonstation system .
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.
`
`.4-O
`
`8
`
`
`
`Abstract
`
`The World—\V1de Web (WWW), a global networked information system based on
`
`hypertext, has become extremely popular since it became available in 1992. In order to
`
`improve the ease of access to the information available on the W, as well as to give
`
`increased exposure to spoken language systems, we developed Spoken Language Access
`
`to Multimedia (SLAM), a spoken language extension to the graphical user interface of the
`
`World-Wide Web browser Mosaic.
`
`Although other research has been conducted on speech interfaces to hypertext, includ-
`
`ing speech interfaces to the WWW, SLAM differs in some key ways. For one, SLAM uses
`
`the complementary modalities of spoken language and direct manipulation to improve this
`
`interface to information on the Internet Also, SLAM makes the advantages of spoken lan-
`
`guage systems available to a wider audience by providing a recognition server available
`
`remotely across a network.
`
`This thesis describes previous work related to SLAM, particularly in the areas of mu]-
`
`timodality and speech interfaces to hypertext and hypermedia systems, including speech
`
`access to the W. This thesis also examines the issues and architecture of what is
`
`vii
`
`9
`
`
`
`believed to be the first spokemlanguage interface to the W to be easily run across
`
`platforms.
`
`This work is sponsored by a Small Grant for Exploratory Research (number 9069-120)
`
`from the National Science Foundation.
`
`viii
`
`10
`
`10
`
`
`
`Chapter 1
`
`Introduction
`
`1.1 The problem
`
`The World-Wide Web (WWW) (CERN, 1994) is a network-based standard for hyper-
`
`media documents that combines documents prepared in HyperText Markup Language
`
`(HTML) (NCSA, 1994a) with an extensible set of multimedia resources. The most popu-
`
`lar WWW browser with available source code is Mosaic (NCS A, 19941:), a cross-platform
`
`program developed and distributed by NCSA, now running in X11—based Unix, Macintosh
`
`and PC-Windows environments. As a hypermedia viewer, Mosaic combines the flexibility
`
`and navigability of hypermedia with multimedia outputs such as audio and GIF images.
`
`The World-Wide Web, especially as viewed with Mosaic, is phenomenally popular. By
`
`mid-Spring of 1994, Internet traffic was doubling about every six months. Of this growth,
`
`the World-Wide Web’s proportional usage was doubling approximately every four
`
`months. In absolute volume of traffic, use of theW was doubling every two and a half
`
`months (wallach, 1994).
`
`Much of the popularity of Mosaic can be attributed to its mouse-based interface,
`
`which can quickly, simply, and directly aid the user in browsing the variety of documents
`
`11
`
`11
`
`
`
`available on the Internet. However, inherent limitations in mouse-based interfaces make it
`
`difficult for users to perform complex commands and to access documents that cannot be
`
`reached by the visible links. Speech-based interfaces, on the other hand, perform well on
`
`these types of complex, nonvisual tasks, but speech input to computers is not nearly as
`
`widespread as other input methods.
`
`1.2 The approach
`
`The SLAM system simultaneously addresses the limitations of mouse-based WWW
`
`interfaces and the limited popularity of speech-based interfaces.
`
`By maintaining the full functionality of the mouse-based Mosaic WWW browser
`
`while adding the speech input option, a system has been created for which the strengths of
`
`each complimentary mode of input (modality) compensate for weaknesses in the other.
`
`The SLAM system does not merely add speech input to the existing Mosaic interface, but
`
`rather uses speech to allow access to information that was not directly available with the
`
`mouse-based system.
`
`This research could broaden the market for speech-based interfaces in two ways. By
`
`making speech recognition available in a popular product like Mosaic, speech recognition
`
`will also become an increasingly popular way to access data. SLAM also enables the user
`
`to perform speech recognition on either the local machine or on a remote speech
`
`recognition server, so that it is not necessary for the user ’s client machine to have a speech
`
`recognizer in order to access theW with speech.
`
`1.3 Overview of thesis
`
`Chapter 2 describes others’
`
`research related to the SLAM project,
`
`including
`
`motivation for creating multimodal interfaces, previous work on input to hypermedia
`
`12
`
`12
`
`
`
`systems, and research investigating spoken language interfaces to the WWW. Chapter 3
`
`describes the differences in speech- and mouse-based interfaces. Chapter 4 discusses
`
`issues involved with building a speech recognition system for the WWW.
`
`Chapter 5 describes the choices that were made in building the SLAM system
`
`architecture, and describes user and network testing of the system. Chapter 6 describes
`
`open issues and outlines directions for future work with the SLAM project.
`
`Appendix A contains source code and systems requirements developed in the course
`
`of this research. Appendix B describes the time trials of the SLAM's current networked
`
`recognition client and shows the resulting data.
`
`13
`
`13
`
`
`
`Chapter 2
`
`Related Work
`
`This chapter discusses previous and ongoing work in fields related to spoken-language
`
`navigation of the Wor1d—Wide Web. A motivation for research in this field is presented,
`
`followed by a brief overview of research comparing speech- and mouse-based input, par-
`
`ticularly as used in multimodal systems. An overview of previous research into hypertext
`
`and hypermedia systems is given, with a mention of systems using speech access to hyper-
`
`media. The chapter concludes with a look at other groups exploring the issue of spoken-
`
`language access to the W.
`
`2.1 Motivation
`
`The SLAM project combines a variety of emerging technologies and techniques, such
`
`as spoken language interfaces, the World-Wide Web, and multimodal access to informa-
`
`tion systems.
`
`This project is one way to address the important issue of studying multimodal inter-
`
`faces involving speech. The report of the NSF Workshop on Spoken Language Under-
`
`standing concluded that performance characteristics of muitimodal systems was one of the
`
`key research challenges in the field of spoken language research:
`
`14
`
`14
`
`
`
`“Interdisciplinary research will be needed to generate novel strategies for
`designing multimodal systems with performance characteristics superior to
`those of simpler unimodal alternatives. Among other things, the successful
`cultivation of such systems will require advance empirical work with
`human subjects, building a variety of new prototype systems, and the
`development of appropriate metrics for evaluating the accuracy, efficiency,
`leamability, expressive power, and other characteristics of different multi-
`modal systems.” (Cole, Hirschman et al., 1995, 12)
`
`Development and availability of a spoken-language enhancement to an interface for
`
`the World-Wide Web would also increase the availability and visibility of spoken-lam
`
`guage technology in the computing community as a whole. This may encourage other
`
`researchers and developers to refine and include spoken-language systems technology in
`
`future systems. 111 fact, there may also be a complementary effect, since adding spoken-
`
`language input to the World—Wide Web is likely to make the Web more easily used and
`
`thus more accessible to the general population. The area of human language technology
`
`has been identified as a grand challenge area necessary to support the national information
`
`infrastructure technology. A report of the Information Infrastructure Technology Task
`
`Group identifies “Intelligent Interfaces" as one of four broad topic areas of the Informa-
`
`tion Infrastructure Technology and Applications (IITA) program, and states that
`
`“Advanced user interfaces will bridge the gap between users and the future National
`
`Information Infrastructure... Work in this area includes development of technologies for
`
`speech recognition and generation...” (National Coordination Office for HPCC, 1994, 16-
`
`17).
`
`The possibilities and practicality of multirnodal interfaces to the Web will not be dis-
`
`covered via analytic methods alone. A substantial amount needs to be learned through
`
`empirical and experiential methods such as system building. Indeed, the potential interac-
`
`tions involved in multimodal systems are so complex that it may be impossible to discern
`
`15
`
`15
`
`
`
`their optimal structure without conducting advance exploratory research (Oviatt, 1992).
`
`Thus the determination of the important or tractable issues relating to such a project
`
`requires development, use, and testing of a spoken-language interface to the World-Wide
`
`Web.
`
`Moreover, the availability of even an experimental spoken-language interface would
`
`enable the growing population of Web users to address these questions in the very practice
`
`of their own day—to-day computing. If a spoken-language interface is used in the “worka-
`
`day world” of cooperative computing (Moran, 1990) exemplified by the Web, then we will
`
`have (a) empirical evidence of its utility and (b) a fund of varied experiences with the
`
`interface that could contribute to improvements. This community of users can tell us what
`
`is right and wrong with spoken-language interaction for hypermedia, thereby offering
`
`directions for further research in the field. Indeed, a widespread, easily-available spoken-
`
`language interface on the Internet could provide results useful to spoken—language systems
`
`research as a whole. In short, from a practical standpoint the idea is to make the interface
`
`available and see what happens, as in the case of the original Mosaic interface and other
`
`WWW browsers.
`
`2.2 Previous work with interface modalities for hypermedia
`
`The graphical user interface (GUI), especially with pointer-based direct manipulation,
`
`has become the predominant model for human-computer interaction. Even in innovative
`
`settings such as the World-Wide Web, which provides a rich hypermedia environment that
`
`includes outputs in hypertext, images and sound, the inputs to the system remain key-
`
`board— and pointer-based. (As the most typical pointer is the mouse, we will use the term
`
`“mouse-based“ interface to refer to pointer—based interfaces generally.)
`
`16
`
`16
`
`
`
`The mouse-based direct—manipulation interface (Shneiderman, 1983) provided a ratio-
`
`nal and innovative means of interaction with computer systems. While physical pointing
`
`and bitmapped displays solved many of the problems with character-and-keyboard-based
`
`interfaces, direct manipulation based on physical pointing did not make use of the full
`
`range of expressive capabilities of human users. This omission was, no doubt, mostly a
`
`consequence of the relatively poor state of other means of expression as input modalities;
`
`spoken-language systems have made immense progress since 1983 (Cole, I-Iirschman et
`
`a1., 1995).
`
`Adding spoken-language capabilities to hypermedia holds the promise of extending
`
`users‘ abilities in ways they find appealing. Empirical studies of multimodal interfaces
`
`have looked at user preferences for different kinds of inputs. For example, Rudnicky
`
`(1993) showed that users preferred speech input, even if it meant spending a longer time
`
`on the task, when compared with using a keyboard and a direct manipulation device.
`
`Oviatt and Olsen (1994) found that users of multimodal interfaces displayed patterns of
`
`use that reflected the contrastive functionality of the available modalities.
`
`Other researchers have investigated the comparative advantages of multimodal inter-
`
`faces, including Cohen (1992) and Oviatt (1992, 1994). One of the goals of this research
`
`has been to attempt “to use the strengths of one modality to overcome for the weaknesses
`
`of another” (Cohen, 1992, 143), who proposed a framework for this analysis. Cohen's
`
`analytical framework involves comparing the strengths and weaknesses of modalities with
`
`respect to factors such as:
`
`- intuitiveness,
`
`- consistency of “look and feel,”
`
`- whether options are apparent,
`
`17
`
`17
`
`
`
`- safety,
`
`- feedback,
`
`- “direct engagement" with an entity,
`
`- ability to describe,
`
`- use of anaphora,
`
`- establishing and maintaining context, and
`
`- use of temporal relations.
`
`Cohen studied multimodal interfaces in general terms, without specific consideration
`
`of interfaces for hypermedia. For multimodal interfaces in general, then, he observed that
`
`the advantages of pointer-based interfaces are that they are generally intuitive, unambigu-
`
`ous, and, if well-designed, can have a consistent “look and feel.” Drawbacks to using
`
`such interfaces include difficulty in selecting items not currently visible, poor support for
`
`temporal relations, and difficulty using context to specify relations. Natural language sys-
`
`tems overcome some of the weaknesses of pointer-based interfaces by allowing the speci-
`
`fication of context, temporal relations, and unseen objects. On the other hand, language
`
`has the problem that the user may not know the vocabulary of the recognizer. Spoken lan-
`
`guage systems are also prone to other problems such as ambiguity and other causes of rec-
`
`ognition errors. (Cohen, 1992).
`
`2.3 Previous work with speech access to hypermedia systems
`
`Interactive hypertext systems have been proposed for fifty years; a useful survey is
`
`provided by Arons (1991). Such systems have a number of advantages for information
`
`retrieval over traditional databases, including that there is no need for training the user on
`
`the system and users do not require knowledge of a topic before searching for information.
`
`Some disadvantages of such systems are that users will have difficulty in actually getting
`
`18
`
`18
`
`
`
`specific information, and are likely to encounter the well-known “lost in hyperspace"
`
`effect (Daniel, 1994; Whalen, 1989) during which users get sidetracked and lost while
`
`navigating through a hypermedia environment.
`
`One system (Stock, 1991) combines natural language and hypermedia to explore Ital-
`
`ian frescoes. This system uses the hypermedia aspect to organize unstructured informa-
`
`tion, and used the natural language aspect to help alleviate the problems of disorientation
`
`and the cognitive overhead of having too many links.
`
`Other groups have investigated using speech with hypermedia systems. One hyper-
`
`speech system (Arons, 1991) enabled the user to navigate in an audio environment with-
`
`out a visual display; speech recognition was used to maneuver in a database of digitally
`
`recorded speech. This system was similar to a speech-only WWW browser in that the
`
`speech interface was goal-directed; the speech provided a form of direct addressing that is
`
`difficult to capture in other interfaces, so that the user felt that they were navigating and in
`
`control. Arons acknowledged that “representing and manipulating a hypermedia database
`
`becomes much more complex in the speech domain than with traditional media." Related
`
`systems include those described by Resnick (1990) and Muller (1990), both cited by
`
`Arons (1991).
`
`2.4 Previous work with spoken-language extensions to WWW browsers
`
`Many groups around the country, and presumably around the world, are working on
`
`projects that are similar in many ways to OGI’s SLAM system.
`
`Earlier versions of MacMosaic had been compiled with speech recognition enhance-
`
`ments, but those compilations are no longer being performed, although they could be acti-
`
`vated with some code changes (Stephenson, 1994).
`
`19
`
`19
`
`
`
`I0
`
`MIT’s Spoken Language Systems group has been working on GALAXY (Goddeau,
`
`1994), a distributed system for on-line information that handles the natural language
`
`aspects of the system at a remote-recognition server. While the current focus of the GAL-
`
`AXY system is the travel domain, MIT is also believed to be applying this technology to
`
`creating a speech interface to theW as well.
`
`Raman at DEC has begun work on a spoken language extension to Mosaic called
`
`RETRIEVER (Rarnan, 1995) that focuses on allowing easier access to the Web to people
`
`with disabilities. Paciello at DEC is working with Hardin at NCSA on the Mosaic Disabil-
`
`ity Project (Paciello, 1995), one aspect of which is speech recognition.
`
`I-Iemphill at Texas Insuuments has completed a prototype speech-enabled Mosaic
`
`(llemphill, 1995) that allows for associating extended grammars and dialog states with
`
`links and hotlist items. Arbash at SRI developed a speech interface to Mosaic based on
`
`work related to the Xtallt project (Arbash, 1994). Both of these are based on local speech
`
`recognition, unlike OGI's remote-recognition system.
`
`20
`
`20
`
`
`
`Chapter 3
`
`A Comparison of Speech- and Mouse-based Interfaces
`
`to Hypermedia
`
`This chapter examines issues involved in creating a spoken language extension to a
`
`hypermedia system, in particular the Mosaic World-Wide Web browser. I discuss general
`
`issues involved in creating a multimodal interface with spoken language and mouse-based
`
`systems.
`
`Cohen’s analytic framework (discussed in Chapter 2) will now be particulatized and
`
`extended to deal specifically with the comparison of speech-based and mouse-based inter-
`
`faces for hypermedia. This will be done in two steps, by looking at mouse—based and then
`
`speech-based interfaces in terms of their respective advantages and disadvantages for
`
`hypermedia systems.
`
`3.1 Mouse-based interfaces to hypermedia: advantages and
`
`disadvantages
`
`The physical pointing involved in mouse-based interfaces is the source of both advan-
`
`tages and disadvantages for this modality. From the user's perspective, pointing has the
`
`21
`
`
`
`12
`
`traditional advantage of direct manipulation, namely reference specified through a combi-
`
`nation of action and location, as in double—cliclting an icon to start a program. Moreover,
`
`the interface generally provides immediate feedback to the user that the reference was suc-
`
`cessful, typically by highlighting the selected entity. From the point of view of the author
`
`of a WWW document, mouse-based pointing has the advantage that the reference can be
`
`completely specified: the label of a link will appear exactly as the author wrote it Addi-
`
`tionally, physical pointing in this context has no referential ambiguity; when the user
`
`clicks a mouse button, the user and the author both know exactly to which entity the user
`
`is referring.
`
`Mouse-based interfaces also have a number of disadvantages, particularly of the “lost-
`
`in-hyperspace” variety. This we1l—known problem was identified for hypertext systems by
`
`Whalen and Patrick (1989), who proposed a text-based natural-language solution.
`
`“Users can have trouble actually getting to the specific information
`required. They may have to navigate through a large number of paragraphs
`to get to the desired goal. Along the way, users are likely to get sidetracked
`and lost.” (289)
`
`When reference is based on physical pointing to a graphically-represented entity, the
`
`absence of such an entity on the screen means that the user cannot refer to it. In other
`
`words, the act of reference depends on the physical location of the referent’s presentation,
`
`which in hypermedia may be pages and documents away.
`
`Hypermedia interfaces typically have standard features such as a “hotlist" and history
`
`windows in order to give users a place that contains references they might want and that
`
`are otherwise not displayed. But the user might also prefer to refer to an entity by a name
`
`other than that specified by the author; the only way the user has to specify an entity is to
`
`click on it. Finally, the spatial nature of the interface limits the set of things to which the
`
`22
`
`22
`
`
`
`Table 3.1: Mouse-Based Interaction fith Hypermedia
`
`13
`
`Advantages
`
`Disadvantages
`
`1. Deictic reference and combination
`of action and reference
`
`1. Reference depends on location of
`referent
`
`2. Author completely specifies the
`
`
`representation of the entity
`
`
`
`2. (a) User might prefer another rep-
`resentation and (b) no other repre-
`sentation possible
`
`
`
`
`
`
`
`
`
`
`
`3. No referential ambiguity
`
`4. Generally gives immediate feed-
`back that user *5 action was under-
`
`stood
`
`3. Vocabulary of references limited
`to those with visible links
`
`4. Difficult to express complex acts
`
`user can refer. Users cannot describe entities (Cohen, 1992) instead of pointing to them.
`
`Similar problems exist with respect to actions. Because they are typically accomplished
`
`by selecting a command from a menu or by clicking on an icon, it is difficult to express
`
`complex actions other than as a perhaps tedious series of primitives. The advantages and
`
`disadvantages of mouse-based interfaces for hypermedia are summarized in Table 3.1.
`
`3.2 Spoken-language-based interfaces to hypermedia: advantages and
`
`disadvantages
`
`Many of the advantages and disadvantages of spoken-language-based interfaces for
`
`hypermedia turn out to be complements of those for mouse-based interfaces. From the
`
`user‘s standpoint, the ability to refer to an entity no longer depends on the location of its
`
`graphical representation.
`
`Indeed, all referents are potentially available because the user can simply say the name
`
`of the referent without having to see it displayed. A related advantage is that the user can
`
`now have a number of different ways in which to refer to entities. Similarly, multiple
`
`action primitives could easily be combined into a single complex action that could include
`
`temporal and 011161‘ sophisticated concepts that are not expressible in mouse-based inter-
`
`23
`
`23
`
`
`
`14
`
`faces. Another advantage is that the user’s hands are freed for other activities. Indeed, it
`
`might be possible to build a spoken-language-only interface to hypermedia that could
`
`serve users by telephone instead of requiring a GUI.
`
`Speech input to hypermedia also has characteristic disadvantages that are often recip-
`
`rocal consequences of its advantages. For example, because references no longer depend
`
`on physical location, references may become ambiguous: a “hotlink" may be uniquely
`
`accessible via the mouse but arnbiguously accessible via speech because another hotlink
`
`might have the same label. This strongly suggests that designers of hypermedia int