`
`|
`
`nee
`
`2.4) Power
`
`ae
`
`SopesteayCoe oes
`cet.fe Md en
`oobyFAeanRDorca ayiN MYLewis s
`Fr Ten isDarePo eeasideaforCa
`A eee ao
`cae
`ee ee Sr forCe
`eemee Md aae
`| oFPyan‘3M oeand on
`aPerea)overcroBae radio Se
`a iHPageandy,ae
`airs Laureateeeeee aaaand _ |
`; le a
`oe
`coe
`se
`hasoyeee aosPi
`CeeofcurrentCeeSrarEaoegad .
`cll) De analysis -
`;
`:
`_
`;
`OverviewofcurrentSaacaaaealtechniques:oaaa
`me Bugeragt al
`- prosody lars) speech Le
`Beyond intelligibility-—_orSdoftext-to-speech
`1ORR
`acia
`mT ean]ee eee Se recognition
`.,over the 1s ee
`:
`ae
`aeduyCee speechCe
`neal recognition — MET) nh ro lgoe ve :
`tospares recognisersstill 98% ene oracoca ds
`oteBoCURalCo0a phy oe
`eeeTa) et iciceadeaiall interactiveCeeee
`SpokenDe Oc—_- beyond aand Ce
`Assessing EL Su
`; ialdaregspeech?-amaeinaiyiDed
`7 SoD Heath Ipswich SSTieee iets -
`| Deby CE Cafae
`
`PU Maraaa a |
`2 ea ae
`ia)aia"
`DJCasand§7Tra
`Py Te |
`AaPer a org
`yh ae Cr2a WALL
`
`BT eeeete
`
`Page | of 22
`
`~ GOOGLE EXHIBIT 1027.
`
`GOOGLE EXHIBIT 1027
`
`Page 1 of 22
`
`
`
`
`
`The BT Technology Journalis a quarterly periodical of technical papers pub-
`lished by British Telecommunications plc to promote an awareness, among
`workersin similar fields world-wide, of the Research and Development under-
`taken by BT in telecommunications and related sciences.
`
`EDITORIAL BOARD
`
`G White BSc PhD CEng FIEE SMIEEE, Chairman
`JRW Ames MSc CEng MIEE
`E L Cusack BSc PhD
`IG Dufour Euring CEng FIEE
`PG Flavin CEng MIEE
`PW France MSc PhD
`J R Grierson MA PhD CEng FIEE
`RC Nicol PhD CEng FIEE
`S G Stockman BA PhD
`AM Jell BA, Editor
`DN Clough MA, Assistant Editor
`
`Enquiries to the Editor: (01473) 623232, facsimile (01473) 620915, e-mail: bttj @ipswich.sac.co.uk
`Internet access to the BT Laboratories information pagesis available on: http://www.labs.bt.com
`
`is reserved by British
`Unless otherwise stated, copyright of the papers appearing in the Journal
`Telecommunicationsplc. The views of contributors are not necessarily those of the Editorial Board, do
`not necessarily represent BT policy, nor reflect an endorsementfor any commercial products.
`
`The BT Technology Journalis distributed by Chapmanand Hall, 2-6 Boundary Row, London SE! 8HN,
`UK.
`
`The Journal is published four times per year in January, April, July and October. Subscription prices for
`1996are: print + Internet access: $214 (USA/Canada) £126 (EU) £140 (all other countries); print only:
`$185 (USA/Canada) £108 (EU) £122(all other countries). Subscription prices for individuals are (print
`only): $95 (USA/Canada) £54 (EU)£54(all other countries). Individual subscriptions must be paid for by
`personal chequeorcredit card.
`
`Any paymentin US$ should be made to Routledge, Chapman and Hall Dollar Account: 051-70700-4,
`Barclays Bank New York Ltd., 300 Park Avenue, New York, NY 10022, USA.
`
`Subscription rates to USA include airfreight to New York and secondclass postage thereafter. All other
`territories outside UK and Europe will be served by accelerated surface post.
`Second class postage paid at Rahway, NJ. Postmaster: send address corrections to The BT Technology
`Journal, c/o Mercury Airfreight International Ltd Inc., 2323 Randolph Avenue, Avenel, NJ 07001, USA
`(US mailing agent).
`
`All subscription enquiries should be made to Chapman and Hall Subscriptions Department.
`All enquiries concerning editorial matters should be made to the Editor, SAC Technographic Ltd,
`38 Anson Road, Martlesham Heath, Ipswich, Suffolk IP5 7RG (for voice, fax, e-mail details, see above).——--eeeeeSeta,SEEADOVE).
`BT Laboratories
`
`Martlesham Heath Ipswich Suffolk England IP5 7RE
`
`Page 2 of 22
`
`Page 2 of 22
`
`
`
`
`
` HAR 0 4 199
`-OpCOPYcece\G
`TechnologyJournal
`
`PyriGHt OFF
`
`bT
`
`Vol.14 No.1 JANUARY 1996
`
`THEME
`
`Speechtechnologyfor telecommunications
`
`Foreword by C Wheddon
`
`Editorial by F A Westall, R D Johnston and A V Lewis
`
`FA Westall, R D Johnston and
`A V Lewis
`
`Speech technology for telecommunications
`
`Speech is the easiest, most expressive and most natural means of
`human communication. Most of us have received intensive training in
`using itfrom the day we were born! But speech is more than just a way
`of transmitting words or ideas — it conveys the essence of human
`emotion, moods, and personality. It is BT’s core business, accounting
`for over 90% of revenues. It is also our primary means to access the
`26 million customers of the UK telephone networks, and to around a
`half a billion telephone users world-wide. This paper introduces the
`key speech technologies, described in detail in the associated papers
`in this issue, and makes some personalpredictions aboutfuture trends
`and challenges in this important, exciting andfar-reachingfield.
`
`WT K Wong
`
`Low rate speech coding for telecommunications
`
`28
`
`Overthe last decade major advances have been made in speech cod-
`ing technology which is now widely used in international, digital
`mobile and satellite networks. The most recent techniques permit
`telephone network quality speech transmission at 8 kbit/s, but there -
`are still demands for even lower rates and moreflexible, good quality
`coding techniques for various network applications. This paper
`reviews the developments so far, and describes a new class of speech
`coding methods known as speech interpolation coding which has the
`potential to providetoll-quality speech coding at or below 4 kbit/s.
`
`P A Barrett, R M Voelcker and
`A V Lewis
`
`Speechtransmission over digital mobile radio channels
`
`45
`
`The design ofa speech channelfor digital mobile radio applicationsis
`a trade-off between the key performance dimensions of speech quality,
`robustness to errors, delay, complexity and bit rate. An appropriate
`balance is often difficult
`to achieve, but
`is vital
`to customer
`satisfaction. This paper identifies the considerations in selecting a
`speech codec for mobile telephony applications, outlines techniques
`for robust and efficient speech transmission over a digital mobile
`radio channel and discusses how the resulting performance can be
`assessed. Throughout the paper,
`the half-rate GSM digital mobile
`radio system is used as an example.
`
`Page 3 of 22
`
`BT Technol J Vol 14 No 1 January 1996
`
`Page 3 of 22
`
`
`
`Spoken language systems — beyond prompt and
`response
`
`P J Wyard, A D Simons, S Appleby, E Kaneen, S H Williams and K R Preston
`
`
`
`Spoken language systems allow users to interact with computers by speaking to them. This paper focuses on the most
`advanced systems, which seek to allow as natural a style of interaction as possible. Specifically this means the use of
`continuous speech recognition — natural language understanding to interpret the utterance, and an intelligent dialogue
`manager which allows a flexible style of ‘conversation’ between computer and user. This paper discusses the architecture of
`spoken language systems and the components of which they are made, and describes both a variety ofpossible approaches
`and the particular design decisions made in some systems developed at BT Laboratories. Three spoken language systems in
`the course of development are described — a multimodalinterface to the BT Business Catalogue, an e-mail secretary which
`can be consulted over the telephone network, and a multimodal system to allow selection offilms in the interactive TV
`environment.
`
`
`
`1.
`
`Introduction
`
`N; sciencefiction imageofthefuture is complete without
`
`the ever-present personable computer which can under-
`stand every wordsaid to them. In spite of these popular media
`images, the goal of completely natural interaction between
`humans and machinesis still some wayoff.
`
`systems, which
`(IVR)
`response
`Interactive voice
`provide services over the telephone network, have been
`available since the mid-1980s. Initially they were restricted
`to interactive TouchTone® input with voice providing the
`response to the user. The use of such services was therefore
`limited to the population with TouchTone keypads. More
`recently applications using automatic speech recognition
`(ASR) have been developed. These often simply allow the
`option of spoken digit recognition as an alternative to
`keypad entry, thus allowing the service to be launched even
`in areas where TouchTone penetration is poor. Moving on
`from such systems the words which are spoken can be
`matched to the service. This allows these ASR-based
`services to be more user-friendly than their TouchTone
`counterparts because the user can directly answer the
`question: ‘Which service do you require?’ with ‘weather’ or
`‘sport’ rather than ‘for weather press 1 for sport press 2’,
`etc. However,
`they
`still
`rely on selection from a
`predetermined menuofitems at any point in the dialogue.
`
`Moresophisticated services are now becoming possible
`using emerging larger vocabulary speech recognition
`technology. However,it is not sensible to simply extend the
`menu-based approach to accommodate larger vocabularies.
`
`Page 4 of 22
`
`.
`
`Although well-engineered simple applications may be easy
`to use, more advanced services
`are
`likely to have
`complicated menu structures. If information can only be
`provided one item at a time, using a ‘prompt and response’
`dialogue, rigid interaction styles may steer the user through
`a complex dialogue. This can result in the user becoming
`lost, or ending up with the wrong information. These
`problems are particularly significant
`for
`inexperienced
`users. On the other hand, experienced users may become
`bored by the large number of responses needed when they
`know exactly what they want. The menu-based structure
`required by systems which rely on isolated word input is
`often the limiting factor for new services. This limitation of
`the user interface is one of the greatest barriers to the
`usability of many IVR services.
`
`Moving beyond the menu-style interaction towards
`conversational spoken language will allow users to express
`their
`requirements more directly and avoid tedious
`navigation through menus. This approach will also allow
`the user to take control of the interaction rather than using
`the more common‘promptand response’ dialogue.
`
`BTis interested in the development of spoken language
`systems (SLS) to provide a key competitive advantage.
`SLSs
`allow users
`to interact with computers using
`conversational language rather than simply responding to
`system prompts with short or one word utterances. With the
`rapid increase
`in
`competition,
`service differentiation
`becomes a key factor in gaining market share. Systems
`
`187
`
`BT Technol J Vol 14 No | January 1996
`
`Page 4 of 22
`
`
`
`SPOKEN LANGUAGE SYSTEMS
`
`which allow users 24-hour remote access to information
`provide a very useful service for people whoare in different
`time zones, or away from their office, or who need
`information immediately during unsocial hours. SLSs can
`be used to automate such services and also those which
`currently require human operators, thus freeing their time to
`deal with difficult situations where more complex, or more
`personalised advice is needed.
`
`2.
`
`System overview
`
`his section outlines a typical spoken language system
`architecture, from the information processing point of
`view (platform and inter-process communication issues are
`not dealt with to any great extent in this paper). The archi-
`tecture and the key processing componentsare outlined.
`
`The most basic form of SLS, a speech-in/speech-out
`(rather
`than multimodal)
`system,
`requires at
`least
`the
`following major components (described briefly below and
`in more detail in section 4).
`
`e
`
`Speech recognition — to convert an input speech
`utterance to a string of words.
`
`@ Meaning extraction — to extract as much of the
`meaning as is necessary for the application from the
`recogniser output and encodeit into a suitable meaning
`representation.
`
`in information networking and the
`trends
`Current
`phenomenal growth of the Internet bring their attendant
`problems for our customers in keeping up with technology,
`finding what they need, and using information to their best
`advantage. Spoken language system technology can greatly
`enhance our customers’ ease of access to information, thus
`increasing network revenue through new and increased
`usage. Systems which combine several modes of input and
`output, such as speech, graphics, text, video, mouse-control,
`touch and virtual reality, are known as multimodal spoken
`language systems. These allow far greater freedom of
`Database query — to retrieve the information specified
`expression for users who, as a result, should feel more
`by the output of the meaning extraction component.
`comfortable and less as though they are ‘talking to a
`Some applications (e.g. home banking) may require a
`computer’. They are able to point, use gestures, speak, type;
`specific transaction to occur. Many applications may
`whatever comes most naturally to them. Spoken language
`be a mixture of database query and_transaction
`systems will become increasingly important
`in the near
`processing.
`future as progress in technology becomes more widely
`available.
`
`e
`
`e
`
`Dialogue manager — this controls the interaction or
`‘dialogue’ between the system and the user, and co-
`ordinates
`the operation of all
`the other
`system
`components.
`It uses
`a dialogue model
`(generic
`information about how conversations progress) to aid
`the final interpretation of an utterance. This may not
`have been achieved by the ‘meaning extraction’
`component, because the interpretation relies on an
`understanding of the conversation as a whole.
`
`®
`
`e
`
`to be
`Response generation — to generate the text
`output in spoken form. Information retrieved by the
`database query component will be passed to the
`response
`generation
`component,
`together with
`instructions from the dialogue manager about how to
`generate thetext (e.g. terse/verbose, polite/curt, etc).
`
`Speech output module (text-to-speech synthesis or
`recorded speech).
`
`Atits simplest, processing consists of a linear sequence
`of calls to each component, as shown in Fig 1. A typical
`outputof each stage from an application which accesses the
`BT Business Catalogue is shown. It
`is not necessary to
`understand
`the output of
`the
`‘meaning
`extraction’
`component in detail to realise that meaning extraction can
`be a non-trivial exercise. The simple linear sequence shown
`in Fig 1
`is,
`in general,
`too inflexible. It is better if the
`dialogue manageris given greater control, to call the other
`components in a flexible order, according to the results at
`
`The goal is to be able to build systems which are not
`restricted only to those motivated users who are prepared to
`spend time learning the language the machine understands.
`These new systems can be used by anyone who wants
`occasional accessto a particular service. They will also help
`the user successfully gain the information or service they
`require by simply calling a number and asking for what they
`want. In fact, the aim is to put back someofthe intelligence
`which existed in the network 50 years ago when a user
`simply lifted the handset and asked to be connected to the
`service or numberrequired.
`
`This paper discusses the design and implementation of
`spoken language systems and is organised as follows.
`Section 2 gives an outline of the architecture of an SLS.
`Section 4 describes the components of an SLS in some
`detail, giving concrete examples from current systems.
`Section 3 discusses some of the systems currently under
`development at BTL. These include a multi-modal system
`for access to the BT Business Catalogue, a speech-in/
`speech-out system for remote e-mail access and a system
`for accessing information about films. Section 5 discusses
`future work which needs to be carried out to improve the
`quality and usability of SLSs, and section 6 draws some
`conclusions.
`
`188
`
`BT Technol J Vol 14 No 1 January 1996
`
`Page 5 of 22
`
`
`
`SPOKEN LANGUAGE SYSTEMS
`
`dialogue manager
`
`speech
`recognition
`
`meaning
`
`database
`query
`
`response
`generation
`
`“The Duet 50
`
`extraction
`
`“Which
`
`and the Duet 80”
`
`phonescost
`less than the
`Duet 100”
`
`price of the Duet 100
`
`
`
`
`
`
`
`{
`[
`
`productQuestion.
`P,
`[buyPrice]
`
`]
`
`find all products P1,
`where P1 is a phone,
`= the price of P1 is
`Price1, and Price1 is
`less than Price2,
`where Price2 is the
`
`[product, phone, P41],
`[valueFeature, buyPrice, P1, Price1],
`[valueFeature, buyPrice, P2, Price2]
`[Price1<Price2],
`
`[product, phone, P2],
`
`[P2=duet_100]
`
`]
`
`
`Example ofa linear process flow in a spoken language system.
`Business Catalogue access system (see section 3.1) are
`multimodal and require a screen and a meansof inputting
`text and mouse clicks and outputting text and graphics.
`These components must be addedto the architecture shown
`in Fig 2 and the dialogue manager and response generator
`must be upgraded to deal with the extra modalities.
`However, most of the discussion of this section applies
`equally to multimodal systems.
`
`Fig 1
`
`each stage. This leads to an architecture of the type shown
`in Fig 2.
`
`The need for this more flexible architecture is illustrated
`by the processing sequence in Fig 3 which shows the
`dialogue manageras controlcentre, calling each component
`in an order determined by the results of processing at each
`stage. Although every processing stage is passed through
`the dialogue manager, this is not included in the sequence
`unless some non-trivial decision or action is taken. The
`example given in Fig 3 is largely driven by limitations of
`the recogniser, but
`the need for
`this sort of flexible
`architecture goes far beyond this. It will eventually enable
`the dialogue managerto act in an intelligent manner, co-
`ordinating the components and combiningtheir outputs in a
`nonlinear manner.
`
`3. Example systems
`n this section three spoken language systems under
`development at BT Laboratories are described:
`
`So far in this section, the discussion has covered speech
`in/speech out systems. However, systems such as the BT
`
`®
`
`access to the BT Business Catalogue, known as BusCat
`— this was the first multimodal continuous speech
`input spoken language system,
`
`
` speech
`
`recognition
`
`meaning
`extraction
`
`database
`query
`
`dialogue manager
`
`
` response
`
`generation
`
`Fig2
`
`Role of a dialogue managerin a spoken language system.
`
`189
`
`BT Technol J Vol 14 No 1 January 1996
`
`Page 6 of 22
`
`Page 6 of 22
`
`
`
`SPOKEN LANGUAGE SYSTEMS
`
`®
`
`®
`
`an e-mail access system, which is speech in/speech out
`only, but has the conversational features described in
`this paper — it is also a dial-up service overthe tele-
`phonenetwork,
`
`a film access system, in which users will be able to
`select films and videos using continuous speech and
`button pushes on a remote control handset — this
`system is targeted at the interactive TV environment.
`
`user
`
`input
`
`dialogue
`manageraction
`
`componentaction
`
`Which
`phones
`comein
`blue?
`ue?
`
`
`
`
`
`
`
`tells response
`module to
`promptthe
`userfor repeat
`input
`
`
`Which
`phones
`comein
`
`blue?
`
`Telephones
`
` realises thatit
`cannot
`
`
`
`
`
`
`interpret the
`question, soit
`tells the
`response
`moduleto tell
`the user thatit
`
`
`
`
`
`is missing
`information
`
`about the
`
`
`product class
`
` combinesthis
`
`
`semantic
`
`representation
`with the
`previous one- it
`
` now realises
`
`
`
`
`
`
`
`
`
`that it has
`sufficient
`information to
`make a
`database
`query
`
`speech recogniser returns
`with a low confidencethatit
`made a satisfactory
`recognition
`
`
`
`
`
`
`
`response module
`generates and outputs
`
`“| did not understand
`
`that - please repeat”
`
`
` recogniser outputs “Which
`ones comein blue?” (one
`
`
`word misrecognised
`
`
`
`meaning extraction
`produces a semantic
`
`
`representation, which
`
`
`contains an unresolved
`
`
`
`
`product class
`
` response module
`
`generates and outputs
`
`
`
`“Whattype of product do
`you require?”
`
`
`meaning extraction
`produces a semantic
`
`
`representation of the word
`
`telephones
`
`database query returns with
`a list of the blue phones
`
`
`
`the response module gener-
`ates and outputs “We have
`the following blue phones:
`
`
`the Relate 100 the Relate
`200 and the Duet 100”
`
`
`
`
`3.1
`
`BusCat
`
`The SLS BusCat provides direct access to a subset of
`the BT Business Catalogue, which covers a range of
`products such as telephones, answering machines and phone
`systems. The user has a screen displaying a Netscape
`WWW browserand speech input/outputfacilities. All the
`normal WWW browser features are present, such as the
`ability to click on links to other pages, and a display
`consisting of mixed text and graphics
`
`(see Fig 4),
`
`The SLS BusCatsystem in use.
`Fig4
`in this system users may use continuous
`Additionally,
`speech input, type questions into a free-text window, and
`listen to speech output generated bya text-to-speech (TTS)
`system. This multimodal interface enables users to request
`specific information about the products in the catalogue, or
`to browse through the catalogue.
`
`In addition to its internal knowledge bases, the system
`has the capability to access external databases across a
`network. One application for this might be to provide a
`multimodalinterface for such databases. Anotheris to allow
`the internal knowledge bases to be periodically updated
`from an external database.
`
`The speech recogniser used is BT’s Stap recogniser[1],
`and the text-to-speech system is BT’s Laureate [2] system.
`
`The example in Table | gives a flavour of whatit feels
`like to interact with the system. Here the user is already
`logged on to the system. From each WWW page there is a
`choiceof:
`
`recogniser outputs
`“Telephones”
`
`The overall structure of the system is shown in Fig 5.
`The system can cope with multiple simultaneous users.
`
`190
`
`Fig 3
`
`Nonlinear process flow in spoken language systems.
`
`BT Technol J Vol 14 No I January 1996
`
`Page 7 of 22
`
`Page 7 of 22
`
`
`
`prolog database
`
`dialogue information
`
`SPOKEN LANGUAGESYSTEMS
`
`current query
`
`template
`
`
`current user
`preferences
`
`
`
`Oo
`(J
`
`
`:
`database
`°
`
`products
`domain
`query
`dialogue
`and services
`knowledge
`history
`
`
`
`
`
`
`response
`meaning
`dialogue
`extraction
`manager
`generation
`
`
`
`
`
` www
`
`browser
`
`
`
`
`
`speech
`recogniser
`
`speech
`output
`
`
`
`user
`
`Fig 5
`
`Architecture of BusCat.
`
`System response
`
`In the interaction the user wants to know what on-hook
`dialling is. Having received an explanation of this feature,
`he decides he wants a phone with on-hookdialling which
`costs less than £60. Then he remembers he also wantsit in
`grey to matchhis living room. Hefinally selects the Relate
`200 telephone.
`
`3.2.
`
`E-mail access
`
`BTis very interested in the mobile telephony market.
`Speech-only natural language systemsare very attractive to
`this market because people want to be able to keep in touch
`while on the move. They are likely to only have a mobile
`phone with which to do so. Speech access to information
`can have an added safety advantage over TouchTone
`interfaces as well as helping to improve the usability of the
`service.
`
`
`
`
`“Which phones have on-hook dial-
`ling and cost less than 60 pounds?’
`
`Anexample session with BusCat.
`Table 1
`
`
` User input
`
`
`
`
`(and optionally spoken)
`‘What is on-hookdialling?’
`Textual
`explanation of on-hook dialling:
`
`
`‘Time spent waiting for someone to
`
`
`answer the phone can often be lost
`
`
`time. But with this feature, you can
`
`
`dial without picking up the phone
`
`
`handset,
`leaving you free to carry
`
`
`on with something else until
`the
`
`
`second your call connects,’ and a
`
`
`list of five phones which havethis
`
`
`feature: Vanguard 10e, Relate 200,
`
`
`Relate 300, Relate 400, Converse
`300.
`
`
`
`Text: ‘The following products meet
`
`
`your requirements,’ and a list of
`four phones, each with a small pic-
`
`
`ture, a short description andaprice
`
`
`E-mail has been chosen as the vehicle for producing a
`(Vanguard 10e, Relate 200, Relate
`speech in/speech out natural language system because the
`300, Converse 300).
`
`
`information source (the users own e-mail)
`is
`readily
`
`
`Text: ‘The following products meet
`“Which ones comein grey?’
`available. An existing system allows selected users to
`your requirements,’ and a list of
`
`access their e-mail using TouchTone commands,
`thus
`three phones, each with a small pic-
`
`providing information on whether natural language access
`ture, a short description and a price
`
`confers real benefits to the users. It will certainly enable
`(Vanguard 10e, Relate 200, Relate
`300).
`queries which are notpossible currently, such as: ‘Do I have
`any messages from John?’
`
`
`
`The user clicks on the link next to
`
`
`The system responds with a large
`the picture of the Relate 200.
`picture of the Relate 200, a full
`
`
`
`description including all its features
`
`and a price.
`
`
`®
`
`®
`
`®
`
`speaking to the system,
`
`clicking on a link,
`
`typing into the free-text field.
`
`Page 8 of 22
`
`The main differences between the e-mail service and the
`BusCat service are as follows. The e-mail service uses
`telephony speech only, rather than a multimodal system.
`The dialogue is more robustto errors and will allow the user
`or the system to take the inititive whereas the Business
`Catalogue system is based onthe user taking the initiative.
`The e-mail
`system architecture is being designed to
`decouple domain-independent and domain-dependentparts
`of the system, and the same approachis being used for the
`
`191
`
`BT Technol J Vol 14 No | January 1996
`
`Page 8 of 22
`
`
`
`SPOKEN LANGUAGE SYSTEMS
`
`film access system (see section 3.3). This means that in the
`future there will be a good level of confidence in being able
`to produce spoken language systems for new domains more
`quickly than at present.
`
`An example of a dialogue the system might have with
`the user is shown in Table 2.
`
`Table 2
`
`An example session with an e-mail service.
`
`User
`Note
`System
`(dials up system)
`
`The default startup is to
`tell the user how many
`old and new messages
`there are and to start
`listing the new ones.
`
`Good morning Alison.
`You have ten new mes-
`sages and 20 old mes-
`sages. Yourfirst new
`message is from Mike
`Harper, regarding,
`Thanksfor demo.
`
`Read it.
`
`4 J
`
`Messageis: ‘Thank
`you for your demoyes-
`terday. The customers
`were most impressed.’
`
`Forward it to Peter
`Smith and Sandra
`
`Resolution ofpartially
`specified name. Check
`on preconditions for
`forwarding
`
`
`Sandra Jonesis already
`on the distributionlist
`
`Forwarding message to
`Peter Smith
`
`
`Forward to Language
`Group
`
`Colin North, regarding,
`Team Meeting.
`
`Forwarding message to
`all the people on your
`distributionlist called
`‘Language Group’.
`
`wantto carry onlisting
`new messages unless
`youtell it otherwise.
`
`Does not confirm on
`forwarding a message.
`
`Next messageis from
`Laurence Cooper:
`Plans for Workshop
`
` Next messageis from
` Stop.
`
` System assumes you
`
`
`
`3.3
`
`Film access for interactive multimedia services
`
`Interactive multimedia services (IMS) are currently
`being implemented by BT in Colchester and Ipswich [3].
`The usability of
`the interface is
`still constrained to
`navigation using a TV remote control. This means that
`multiple layers of menus needto be traversed to get to the
`information required. Spoken language access would allow
`users to go straight to the information they are searching for,
`without
`requiring them to learn complex navigation
`procedures.
`
`The video-on-demandsubsetof the IMS, whichconsists
`of over 4000 hours of material, including films, educational
`programmes, children’s programmes, etc, was chosen. The
`SLS will allow users to give instructions such as: ‘I want a
`comedyfilm starring Harrison Ford’. Part of the benefit of
`developing such a system,is to ensure that the generic SLS
`frameworkis truly domain independent.
`
`There is currently a text-based interface to the Internet
`movie database [4]. This allows users to enter queries such
`as: “Tell methe ratings of comedy moviesstarring Harrison
`Ford’. The system performs the meaning extraction using a
`caseframeparser (section 4.2). This allowsit to pick outthe
`salient information from among extraneous words.
`
`It seems likely, from humananalysis of typical queries
`about films, that this method is suitable.
`
`Anissue yet to be addressed is how to best reconcile the
`advantages of using speech, with the limitations of current
`recognition technology. This is clearly illustrated in the
`present example, since the text-based interface can query
`the database of over 50 000 films and 100 000 cast names.
`No speech recogniser yet built can cope with this range of
`vocabulary. The obvioussolutionis to restrict the size of the
`database. A possible step in the right direction would beto
`couple the ‘meaning extraction’ component and recogniser
`much more closely,
`so that meaning extraction and
`recognition happen simultaneously. This might enable the
`recogniser to cut down the vocabularysize ‘on the fly’. For
`.. OK. What would
`example, given the input sentence: ‘Which comedy movies
`you like?
`
`star Burt Lancaster,’ it could be established straightaway
`
`
`that the user was talking about comedies, then only about
`
`cinemafilms, and finally that the user was only interested in
`
`
`
`HaveI any recent mess-
`In the last week you
`Slightly more complex
`have received 5 mes-
`ages from Steve Brown
`user query. System
`an actor. Therefore, by the time the recogniser gets to the
` or Steve Clark
`
`sages from Steve Brown
`makesexplicit whatit
`name ‘Burt Lancaster,’ the number of possible words has
`and 4 from Steve Clark
`thinksofas‘recent’.
`
`
`reduced considerably.
`List the ones from
`
`
`Message1is about
`
`
`Resolutionofpartial
`Clark.
`
`
`Meeting onthe 31st
`specified name.
`
`Message 2 is about
`
`
`Agenda for team
`
`
`meeting
`
`
`Messageis: ‘Hi Alison,
`Have you completed the
`plans yet?’
`
`Reply
`Choice of several stock
`replies
`
`Let’s hearit.
`
`Tell him yes
`
`OK
`
`Ihave four possible
`responses, yes, OK, no,
`or Pl call you.
` Reply to Laurence
`
`Cooper: Text is ‘Yes’.
`Please confirm.
`
`Reply sent. Next mes-
`sage is from...
`
`Confirms before send-
`ing a reply
`
`Detection of user speak-
`ing terminates prompt.
`Listing new messages
`ended by user.
`
`192
`
`This is the subject of further research andis discussed in
`moredetail in the next section.
`
`BT Technol J Vol 14 No | January 1996
`
`Page 9 of 22
`
`Page 9 of 22
`
`
`
`4. Components of a spoken language system
`
`4.1
`
`Speech recognition
`
`he job of a speech recogniser is typically thought of as
`converting a speech utteranceinto a string of text. The
`internal workings of speech recognisers are explained in
`some depth elsewhere [5]. This section looks at the recog-
`niser’s place within an SLS, and,in particular, at the lan-
`guage model
`(LM)the recogniser uses and the form of
`outputthat it provides.
`
`A language model embodies information about which
`wordsor phrases are more likely than others at a given point
`in a dialogue.
`
`One might imagine that in a system that accepts fluent
`language, for example an automated travel agent, the speech
`recogniser might need only one language model,that of the
`entire English language. It could then recognise anything
`that anyone said to it (assuming they are speaking English)
`and could inform the dialogue manager accordingly. Speech
`recognition is not yet accurate enough and a modelofthe
`entire English language does not exist. Instead,
`to get a
`working system, the recogniser must be given as much help
`as possible. It must be given hints about whatthe user is
`likely to say next
`to improve the chances of correctly
`recognising what has been said. If the dialogue manager
`knowsthat the customer wants to go on a cruise and has just
`asked them where they would like to go, it should prime the
`recogniser to be expecting a response that may well concern
`one of a numberof specified cruise ports and, by the same
`token, is unlikely to have anything to do with backpacking
`in Nepal.
`
`SPOKEN LANGUAGE SYSTEMS
`
`@
`
`®
`
`®
`
`®
`
`®
`
`language models,
`
`perplexity of a language model,
`
`advantages and disadvantages of language models,
`
`loading language models into the recogniser,
`
`output from the recogniser.
`
`4.1.1
`
`Language models for the recogniser
`
`speech
`the
`for
`source
`knowledge
`primary
`The
`recognition componentis a set ofstatistical models, known
`as hidden Markov models or HMMs, which encode how
`likely a given acoustic utteranceis, given a string of spoken
`words. A recogniser can decode a speech utterance purely
`on the basis of this acoustic-phonetic knowledge,andthis is
`basically what happens in the case of single isolated-word
`recognition. However, in the case of recognising a string of
`words
`(which form part of a spoken language),
`the
`recogniser can use a second knowledge source, namely the
`intrinsic probability of
`the given string. This
`second
`knowledge source is knownas the language model.
`
`To take a classic example, a given utterance may have
`almost equal
`acoustic-phonetic probabilities of being
`‘recognise speech’ or ‘wreck a nice beach’. However, the
`intrinsic probability of the first string is likely to be higher
`than that of the second, particularly if this utterance came
`from the domain of a
`technical
`journal on speech
`technology.
`
`This can be expressed mathematically as follows. Let X
`be the acoustic utterance and let S be the sentence t