WO1999003092A2

WO1999003092A2 - Modular speech recognition system and method

Info

Publication number: WO1999003092A2
Application number: PCT/US1998/012723
Authority: WO
Inventors: Arthur Gerald Herkert; Oleg Andric; Lu Chang; Gil Alterovitz
Original assignee: Motorola Inc.
Priority date: 1997-07-07
Filing date: 1998-06-18
Publication date: 1999-01-21
Also published as: WO1999003092A3; AU7978098A

Abstract

A modular speech recognition system (201) includes a bulk memory (230) that stores a plurality of specialized vocabulary databases (231), a front end processor (305) that generates a first set of feature vectors based on an analysis of a first set of sampled speech data received during a call, a local memory (370) that stores a specialized vocabulary database (371); and a recognition processor (355) that accepts the first set of feature vectors and generates a recognition result based on the first set of feature vectors and specialized vocabulary database. The specialized vocabulary database (371) is a copy of one of the plurality of specialized vocabulary databases (231) stored in the bulk memory (230). The specialized vocabulary database (231) is selected from the plurality of specialized vocabulary databases (371) in response to information associated with the call.

Description

MODULAR SPEECH RECOGNITION SYSTEM AND METHOD

Field of the Invention

This invention relates in general to speech recognition systems and in particular to a speech recognition module that is used in a speech recognition system that handles interactive calls.

Background of the Invention

Until recently, pagers typically received and stored only numbers. However, developments in pager technology, including the introduction of alpha pagers (which allow for text data), have brought up an issue of a lack of a universal means to accept text messages from a caller for communication to pagers. Some solutions have been provided, notably WordSender™ and electronic mail paging servers. Unfortunately, these are not ubiquitous solutions that are applicable in every situation and accessible to the ordinary user. Telephone networks, if they could be utilized for this purpose, would present an acceptable universal solution, as evinced by their wide availability in the home, office, and many public facilities. The current approach to offering alpha messaging service by voice telephone communication typically requires service providers to deploy telephone operators to handle alpha paging requests. The necessity of human operators constitutes a significant portion of the costs associated with alpha paging and greatly limits the market segment to which the product may be targeted.

A possible solution to the problem of needing human operators is the use of speech recognition technology incorporating a dialog response system. To be useful in typical paging systems, the dialog response system must be capable of handling several, and often many, telephone calls simultaneously. Informative and effective dialogs that do not confuse or frustrate the user are required. Usually, an effective dialog system is not a generic one, but rather is dependent on a number of factors such as the subject matter and its associated jargon as well as the caller's geographic location, socio-economic background, linguistic considerations, and cultural expectations. Thus, a customizable system can be effective in increasing the robustness of user interfaces and data entry, while a generic system will likely not be appropriate in many situations.

The design of speech recognition systems typically includes trading off the parameters of speed of recognition, accuracy of recognition, and the amount of local memory that is used by the program performing the speech recognition. (Local memory is memory directly accessible on a high speed processor bus, as opposed to bulk memory such as hard disk memory.) Programs performing speech recognition have both an executable portion and a database portion. Both portions of the program must typically be located in memory directly accessible by a processor, such as random access memory (RAM), read only memory (ROM), or electrically programmable read only memory (EPROM). Such memory is needed in order to provide a speed of recognition that is sufficiently rapid for interactive speech recognition systems, while also being sufficiently accurate. When the executable portion and/or the database portion are made available in media normally used for distribution, such as compact disk read only memory (CD ROM) or floppy disk, sufficient local memory such as RAM is typically provided into which to copy the portion made available on such media. Although RAM memory costs have been declining, the cost of memory in speech recognition systems is still a predominant factor in providing a robust speech recognizer because the database portion of typical speech recognizers is large, particularly for those speech recognizers that are designed to be able to recover speech that encompasses anything more than a small set of jargon. As the number and size of vocabulary models and networks grow, the processor time needed to determine that a certain vocabulary model matches the actual speech pattern increases. Constraining the number and size of models places a limitation on the accuracy of the speech recognition algorithm. Thus, there appears to be an implied compromise among factors such as speed, accuracy, and the quantity of models that can be supported.

Thus, what is needed is a technique to provide a speech recognition system that can provide customized dialog responses for several telephone lines in an economical and efficient manner. Brief Description of the Drawings

FIG. 1 shows an electrical block diagram of portions of a fixed portion of a radio communication system, in accordance with the preferred and alternative embodiments of the present invention.

FIG. 2 shows an electrical block diagram of a modular speech recognition system and other portions of a paging controller used in the radio communication system, in accordance with the preferred and alternative embodiments of the present invention. FIG. 3 shows a more detailed electrical block diagram of the modular speech recognition module, in accordance with the preferred embodiment of the present invention.

FIG. 4 shows a flow chart that describes a method used in the paging controller for providing interactive dialog during one or more telephone calls, in accordance with the preferred and alternative embodiments of the present invention.

Detailed Description of the Drawings

Referring to FIG. 1, an electrical block diagram of portions of a fixed portion of a radio communication system 100 is shown in accordance with the preferred and alternative embodiments of the present invention. The fixed portion of the radio communication system 100 comprises conventional telephones 111 connected through a conventional switched telephone network (STN) 112 by conventional telephone links 113 to a system controller, which in this example is a paging controller 114. The paging controller 114 oversees the operation of a paging radio fixed network 116, typically comprising a plurality of radio frequency transmitter /receivers coupled to the paging controller 114 by conventional telephone links 115. The paging controller 114 encodes and decodes inbound and outbound telephone addresses into formats that are compatible with land line message switch computers. The paging controller 114 also functions to encode and schedule outbound messages, which can include such information as analog voice messages, digital alphanumeric messages, and response commands, for transmission by the radio frequency transmitter /receivers to a plurality of selective call radios (not shown in FIG. 1). The paging controller 114 further functions to decode inbound messages, including unsolicited and response messages, received by the radio frequency transmitter /receivers from the plurality of selective call radios.

It should be noted that the paging controller 114 is capable of operating in a distributed transmission control environment that allows mixing conventional cellular, simulcast, satellite, or other coverage schemes involving a plurality of radio frequency transmitter /receivers and conventional antennas, for providing reliable radio signals within a geographic area as large as a worldwide network. Moreover, as one of ordinary skill in the art would recognize, the telephonic and selective call radio communication system functions may reside in separate system controllers 114 which operate either independently or in a networked fashion.

It will be appreciated that the selective call radios are of several types of radios, including two way pagers, conventional mobile radios, conventional or trunked mobile radios which have a data terminal attached thereto, or which optionally have data terminal capability designed in. Each of the selective call radios assigned for use with the radio communication system fixed network 100 has an address assigned thereto which is a unique selective call address. The address enables the transmission of a message from the paging controller 114 only to the addressed selective call radio, and identifies messages and responses received at the paging controller 114 from the selective call radio. Furthermore, each of one or more of the selective call radios can have a unique telephone number assigned thereto, the telephone number being unique within the STN 112. A list of the assigned selective call addresses and correlated telephone numbers for the selective call radios is stored in the paging controller 114 in the form of a subscriber database.

Referring to FIG. 2, an electrical block diagram of a modular speech recognition system 201 and other portions of the paging controller 114 are shown, in accordance with the preferred embodiment of the present invention. The paging controller 114 schedules and queues data and stored voice messages for transmission to the selective call radios, connects telephone calls and uses a processor generated interactive dialog for determining messages to be transmitted to the selective call radios, and receives acknowledgments, demand responses, unsolicited data and stored audio messages, and telephone calls from the selective call radios. The paging controller 114 in this example comprises an STN interface 210, an the modular speech recognition system 201. The modular speech recognition system 201 comprises a paging controller processor 240, a hard disk memory 230, one type I speech recognition module (SRM-I) 220 and two essentially identical type II speech recognition modules (SRM-II) 221, 222, which are all intercoupled by an external bus 225. The STN interface 210 handles the switched telephone network (STN) 112 physical connection, connecting and disconnecting telephone calls at the telephone links 113, and routing call related information between the telephone links 113, the SRM-I 220 and the paging controller processor 240, under control of the paging controller processor 240.

When a telephone call is received, and sufficient resources are available to process the call, it is connected by the STN interface 210, under control of the paging controller processor 240, to the SRM-I 220 by a conventional serial processor interface 215 for processing of call information received during the call, and is connected to the paging controller processor 240 by the external bus 225 for communication of information that is generated by the paging controller processor 240 and transmitted in the telephone call, during an interactive dialog controlled by the paging controller processor 240. (Alternatively, the call can be connected from the STN interface 210 to the SRM-I 220 by the external bus 225.) The interactive dialog which is supported by the preferred embodiment of the present invention is substantially more sophisticated than those commonly in use today, in which the information presented by the caller is typically restricted to digits entered from the telephone keypad or clearly spoken single words, such as the digits or "yes" or "no." A plurality of specialized vocabulary databases are stored in the hard disk memory 230. The hard disk memory 230 also stores a plurality of conventional digitized voice response segments which are transmitted in a conventional manner during a call to a caller.

Call information is received as digital information such as information that conveys the telephone number of the initiator of the telephone call. Call information also includes digitized analog information, such as digitized voice signals of the caller or digitized dual tone multifrequency tones generated by the caller's activation of telephone instrument keys, or computer generated digitized stored voice responses transmitted from the paging controller 114 to the caller. For simplicity, both the digital information and digitized analog information received or transmitted during a call and associated with the call is described herein as being "in the call" or "during the call." Digitized analog information is received and transmitted by the paging controller processor 240 in a plurality of simultaneous calls connected to the STN interface 210 in a conventional time multiplexed manner. One or more telephone calls are simultaneously call connected by the STN interface 210, under control of the paging controller processor 240 to the SRM-I 220, which provides a front end processing of information received in the calls, resulting in the generation of a series of sets of feature vectors for each telephone call, wherein each set typically represents a phrase of the received portion of an interactive dialog. The feature vectors generated from information received in each telephone call are coupled to a recognition processor 355, 356 (see FIG. 3) of one of the speech recognition modules 220, 221, 222 by the external bus 225. Since in the example shown in FIG. 2 there are three such recognition portions, three telephone calls can be front end processed simultaneously. The only significant difference in the SRM-II 221 and SRM-222 is in an identification code of each.

The hard disk drive 230 is a conventional disk drive, such as a 2.1 Gigabyte drive commonly supplied with computers sold today. Alternatively, another form of bulk memory such as a conventional compact disk read only memory (CD ROM) drive could be used. Paging controller 114 is preferably a Wireless Message Gateway ™

Administrator! paging terminal manufactured by Motorola, Inc., of Schaumburg Illinois, and modified by the addition of unique speech recognition modules 220, 221, 222, the unique control functions as described herein with reference to the paging controller processor 240, and the plurality of specialized vocabulary databases stored in the hard disk drive 230.

It will be appreciated that other conventional processing systems that include a telephone interface and support for the unique speech recognition modules 220, 221, 222 could alternatively be modified for use as the paging controller 114. It will be further appreciated that the paging controller 114 can be configured to handle more telephone calls by using more SRM-IFs, up to the capacity of the front end portion of the SRM-I, and more SRM-I's, up to the capacity of the STN interface 210, or the physical and or capacity of the paging controller 114.

Referring to FIG. 3, a more detailed electrical block diagram of the modular speech recognition module is shown, in accordance with the preferred embodiment of the present invention. The SRM-I 220 and one SRM-II 222 are shown, as well as the hard disk memory 230, the paging controller processor 240, and the external bus 225.

The SRM-I 220 provides the front end processing described above (with reference to FIG. 2) by means of a front end processor 305 that comprises an electrically programmable read only memory (EPROM) 310, a random access memory (RAM) 320, a microprocessor 330, and an external bus input output driver (EXT BUS I/O DVR) 325, which are all mounted to a printed circuit board (not shown in FIGs. 1-2), and intercoupled by an internal bus 340. The EPROM 310 comprises a unique front end processing segment 315 as well as conventional segments, which together control the operation of the microprocessor 330, and thereby the front end processor 305. The RAM 320 comprises memory storage space sufficient to store three maximum sets of feature vectors. In this example, feature vector sets named feature vector set A (VS A) 321 and feature vector set B (VS B) 322 are stored in RAM 320.

The recognition processors 355, 356 of the SRM-I 220 and SRM-II 222 each comprise an EPROM 360, a RAM 370, a microprocessor 380, and an external bus input output driver (EXT I/O BUS DVR) 375, intercoupled by an internal bus 390. These circuits form the recognition processor 355 and are all mounted to the printed circuit board to which the circuits forming the front end processor 305 are also mounted, forming the SRM- I 220. The SRM-II 222 comprises a printed circuit board (not shown in FIGs. 1-2) having the same layout as the printed circuit board of the SRM- I 220, but having only the circuits that form the recognition processor 356 mounted thereto. The EPROM 360 comprises a unique recognizer segment 365 as well as conventional segments which control the operation of the microprocessor 380, and thereby the recognition processor 355. The RAM 370 comprises memory storage space sufficient to store one maximum set of feature vectors and one maximum specialized vocabulary database. In this example, a copy of the vector set B 322 is stored in recognition processor 356 of SRM-II 222 and a vector set named vector set C (VS C) 372 is stored in recognition processor 355 of SRM-I 220. In this example, a specialized vocabulary database named specialized vocabulary database N (VDB N) 373 is stored in recognition processor 356 of SRM-II 222 and a specialized vocabulary database named specialized vocabulary database M (VDB M) 371 is stored in recognition processor 355 of SRM-I 220.

The hard disk memory 230 comprises sufficient memory storage space to store P specialized vocabulary databases identified as specialized vocabulary databases 1 through P (VDB 1-P) 231. Specialized vocabulary database N 373 and specialized vocabulary database M 371 are copies of two of the P specialized vocabulary databases stored in the hard disk memory 230, although they may alternatively be copies of one specialized vocabulary database stored in the hard disk memory 230.

The paging controller processor 240 comprises EPROM 361, RAM 371, external bus input/output driver 376, and microprocessor 381, intercoupled by internal bus 391. The EPROM 361 comprises a unique dialog control segment 366 as well as conventional segments which control the operation of the microprocessor 381, and thereby the paging controller processor 240.

The RAMs 320, 370, 371 are conventional read/write RAMs, preferably 64 Megabytes each. The EPROMs 310, 360, 361 are conventional EPROMs programmed with conventional and unique segments as described above. The microprocessors 330 and 380 are microprocessors of the 56000 family of digital signal processors made by Motorola, Inc. of Schaumburg, IL. The microprocessor 381 is a microprocessor of the 68000 family of microprocessors made by Motorola, Inc. The internal busses 340, 390, 391 are parallel microprocessor busses of conventional design uniquely laid out on the printed circuit board described above for intercoupling the devices described above with reference to the speech recognition modules 220, 221, 222. By being directly coupled to the microprocessors 330, 380, and 381 by the respective internal busses 340, 390, 391, the RAMs 320, 370, 371 are local memories; that is, the stored information is read from and written to them by a central processing portion of the respective microprocessors 330, 380, and 381 on a random addressed basis, at a bussed speed. It will be appreciated that the RAMs 320, 370, 371 could alternatively be a portion of the microprocessors 330, 380, and 381 themselves, by being integrated on the same substrate as the central processing unit and other elements of the microprocessors 330, 380, and 381.

The external bus 225 is a microprocessor bus that intercouples the speech recognition modules 220, 221, 222, the hard disk memory, and the paging controller processor 240 by flat cables and connectors. The external bus input/output drivers 325, 375, 376 are conventional devices for driving the external bus 225, which is a conventional SCSI (small computer systems interface) bus.

It will be appreciated that the RAMs 320, 370, 371 could be dynamic or static RAM devices, that the EPROM could alternatively be of other type such as masked ROM or flash ROM, and that the microprocessors 330 and 380 could alternatively be of other types of digital signal processors or possibly microprocessors such as those of the PowerPC™ or Pentium® families of processors, and that the microprocessor 381 could be of another type such as a microprocessor of the PowerPC™ or

Pentium® families of processors. It will be further appreciated that in alternative embodiments of the present invention, the functions of the front end processor 305 and recognition processor 355 could be provided by using one microprocessor of sufficient speed and capability. In such a case, the RAMs 320, 370 could be combined, although the memory space would have to be essentially the same as provided by both. A similar situation exists for the EPROMs 310, 360. With one microprocessor replacing the microprocessors 330, 380, only one internal bus is needed for the SRM-I 220, and vector sets which are communicated from the front end processor 305 to the recognition processor 355 are moved, when necessary, between RAM locations using the single internal bus. It will be further appreciated, that in another alternative embodiment of the present invention wherein a paging controller supports sufficiently few telephone links 113, the functions of the SRM-1 220 and the paging controller processor 240 could further be combined using one microprocessor of sufficient speed and capability. In these alternative embodiments in which a processor performs the functions of two or more of the processors 305, 355, 240 of the preferred embodiment, the functions provided by the front end processor 305, the recognition section 355, and the paging controller processor 240 are separated as described herein with reference to the preferred embodiment, and specialized vocabulary databases are locally copied into RAM from those stored in the hard disk memory.

In yet another alternative embodiment in accordance with the present invention, suitable for a use with a plurality of non-trunked telephone lines, a plurality of analog recognition modules similar in design to the SRM-I 220 are used. The STN interface 210 in this embodiment is a telephone interface for connecting a plurality of analog telephone lines 113 and the analog recognition module differs from the SRM-I 220 essentially only in that it converts an analog signal from one line to digitized speech samples. Variations are possible wherein the STN interface 210 digitizes and time multiplexes several analog telephone lines and one or more SRM-I 220 and a plurality of SRM-II 221's are used.

However, in all of the embodiments, a hard disk drive or other bulk memory is used to store the plurality of specialized vocabulary databases from which a specialized vocabulary database is copied to a local memory one at a time, providing the benefits of fast, accurate, and cost efficient interactive dialogs.

Referring to FIG. 4, a flow chart is shown describing a method used in a paging controller 114 for providing interactive dialog during one or more telephone calls. In this example, recognition processor 356 of speech recognition module 222 is available and a new call (hereafter for simplicity, "the call") is received at the STN interface 210. At step 405, a connection is made to the call. The paging controller processor 240 controls the STN interface 210 to connect the call to the SRM-I 220 for front end processing (generation of a set of feature vectors). At step 410, in response to identification of the connection of a new telephone call, which is one form of predetermined digital information associated with the call, and in further response to identification of an exchange code that is the calling telephone's exchange, the paging controller processor 240 at step 420 selects from hard disk memory 230 a specialized vocabulary database from the specialized vocabulary databases 231 stored in the hard disk memory 230. Each of the specialized vocabulary databases 231 is designed for a relatively narrow set of jargon. For example, the specialized vocabulary database selected for response to the telephone call of the example is a vocabulary database specialized for identifying numbers in Spanish, in response to a call having just been connected and in response to the exchange number of the call received being one known to be primarily used by Spanish speaking callers (this combination of information is a predetermined set of digital information received in the call). Because the specialized vocabulary databases are specialized, each one will fit within the RAM 370 of one of the recognition processors 355, 356. The paging controller processor 240 controls a selected one of the speech recognition modules 220, 222 to download at step 430 a copy of the selected specialized vocabulary database into its RAM 370. For this example, it is assumed that SRM-II 222 is selected and a copy of specialized vocabulary database N 373 is downloaded into RAM 370.

Thereafter, the paging controller processor 240 further selects an initial dialog response phrase in Spanish (in digitized voice form), which says for example (translated to English), "Please say the paging number you are calling." The paging controller processor 240 communicates this dialog response phrase to the STN interface 210 for transmission in the telephone call to the caller.

When time multiplexed voice information is received in the phone call after the call is connected, the front end processor 305 analyzes the time multiplexed voice information obtained in the call at step 415 by generating speech samples therefrom and generates feature vectors therefrom in a conventional manner. This analysis continues until a predetermined break point occurs in the digitized voice signal, such as a 0.5 second pause, at which time a first set of conventional feature vectors 322 is completely generated at step 425 which is based on and represents a first set of conventional speech samples. The break point is determined by the front end processor 220. In this example, the set of feature vectors 322 is determined from the voice information initially received in the connected call. When completed, the set of feature vectors 322 is copied at step 435 to the RAM 370 of recognition processor 356, the same recognition processor selected by the paging controller processor 240 for receiving a copy of the specialized vocabulary database N 373. The recognition processor 356 then generates a recognition result at step 440 based on the set of feature vectors B 322 and the specialized vocabulary database N 373. For example, the caller may say "346-9876" (in Spanish), for which the recognition result is the set of numbers 3469876 in ASCII (American Standard for Coded Information Interchange). The recognition result is determined quickly and accurately. It will be appreciated that while this function could be alternatively performed by the conventional method of asking the user to enter DTMF tones in the United States, the use of voice to provide the digits is a more ubiquitous solution to obtaining digits in countries where DTMF dialing is not as prevalent as in the United States. Furthermore, the use of voice digits is more natural for many users than using keypad keys.

The paging controller 114 can alternatively select a specialized vocabulary database based on an identification of a predetermined set of recognition results generated from a set of sampled speech data received from the connected call, at step 445. This is illustrated by a continuation of the example being described above. The paging controller processor 240 identifies at step 445 the recognition result 3469876 as a set of seven digits identifying a pager used by a pediatrician. In response to this identification, the paging controller processor 240 selects another specialized vocabulary database L at step 420 for copying into the RAM 370 of recognition processor 356. The specialized vocabulary database L is a specialized database of jargon associated with pediatricians, in Spanish. In this manner, the next set of feature vectors generated by front end processor 305 from the digitized voice data received in the same call are copied to SRM II 222 and analyzed using specialized vocabulary database L. This again results in a fast and accurate generation of a recognition result. This process of repeatedly selecting a specialized vocabulary database from the set of specialized vocabulary databases stored in the hard disk memory 230, based on information associated with the telephone call is continued until the telephone call is completed.

By now it will be appreciated that the preferred and alternative embodiments of the present invention provide a unique configurations of circuit devices and databases that permit a cost effective use of interactive dialogs in a paging system handling a wide variety of jargons (such as medicine, law, electrician, and real estate) in a wide variety of languages by avoiding the use of one large vocabulary database stored in a large RAM, which would be very costly, or one large vocabulary database stored in bulk (mass) memory, such as hard disk or CD ROM, in which recognition would be impractically slow due to the inherent slow access times of such mass memory. The unique arrangement involves using a series of smaller, specialized vocabulary databases selected during a telephone call; for example, 15 specialized vocabulary databases including the pediatrician jargon of the above example as well as electrician and real estate jargon in English, Spanish, Haitian, German, and French; wherein the specialized vocabulary database is based on data received within the call and copied into a RAM 370 of a recognition processor 355, 356 from a set of smaller, specialized vocabulary databases 231 stored in the hard disk memory 230. The unique arrangement further permits the economic handling of a plurality of simultaneous phone calls by separating the recognition processing function performed by the recognition processors 355, 356 from the relatively less intensive feature vector generation function performed by the front end processor 305 and the less intensive dialog control function performed by the paging controller processor 240.

It will be further appreciated that such benefits are available from using essentially the same method and unique arrangement of apparatus as described herein, for communication systems other than paging communication systems, such as catalog order placement systems and reservation systems. We claim:

Claims

1. A speech recognition module, comprising: a local memory that stores a specialized vocabulary database; and a recognition processor that accepts a first set of feature vectors and generates a recognition result based on the specialized vocabulary database stored in the local memory and the first set of feature vectors, wherein the first set of feature vectors represents a first set of speech samples obtained during a call, and wherein the specialized vocabulary database is a copy of one of a plurality of specialized vocabulary databases stored in a bulk memory, and wherein the specialized vocabulary database is selected from the plurality of specialized vocabulary databases in response to information associated with the call.

2. The speech recognition module according to claim 1, wherein the first set of speech vectors are also stored in the local memory.

3. The speech recognition module according to claim 1, further comprising: a front end processor that generates the first set of feature vectors based on an analysis of the first set of sampled speech data.

4. The speech recognition module according to claim 3, wherein the front end processor generates a plurality of sets of feature vectors from a plurality of sets of sampled speech data obtained in a plurality of calls and couples to one or more of other speech recognition modules one of the plurality of sets of sampled speech data.

5. The speech recognition module according to claim 3, wherein the recognition processor comprises a first microprocessor controlled by a recognition program segment, and the local memory is a random access memory directly accessible by the first microprocessor, and wherein the speech recognition module further comprises a circuit board for mounting and coupling the first microprocessor and the random access memory, and wherein the bulk memory is located external to the circuit board, and wherein the speech recognition module further comprises a means for transferring the specialized vocabulary database from the bulk memory to the random access memory, and wherein the front end processor comprises a second microprocessor controlled by a speech analysis segment, and wherein the second microprocessor is mounted to the circuit board and coupled to the first microprocessor and the local memory.

6. The speech recognition module according to claim 5, wherein the first and the second microprocessors are embodied in a single microprocessor.

7. The speech recognition module according to claim 1, wherein the information is a set of digital information received during the call.

8. The speech recognition module according to claim 1, wherein the information is a set of recognition results generated by the recognition processor from a second set of feature vectors representing a second set of sampled speech data obtained during the call.

9. The speech recognition module according to claim 1, wherein the recognition processor comprises a first microprocessor controlled by a recognition program segment, and the local memory is a random access memory directly accessible by the first microprocessor.

10. The speech recognition module according to claim 9, further comprising a circuit board for mounting and coupling the first microprocessor and the random access memory, wherein the bulk memory is located external to the circuit board, and wherein the speech recognition module further comprises a means for transferring the specialized vocabulary database from the bulk memory to the random access memory.

11. A modular speech recognition system, comprising: a bulk memory that stores a plurality of specialized vocabulary databases; a front end processor that generates a first set of feature vectors based on an analysis of a first set of sampled speech data received during a call; a local memory that stores a specialized vocabulary database; and a recognition processor that accepts the first set of feature vectors and generates a recognition result based on the first set of feature vectors and the specialized vocabulary database, wherein the specialized vocabulary database is a copy of one of the plurality of specialized vocabulary databases stored in the bulk memory, and wherein the specialized vocabulary database is selected from the plurality of specialized vocabulary databases in response to information associated with the call.

12. The modular speech recognition system according to claim 11, wherein the speech recognition system further comprises a dialog controller that detects the information and selects the specialized vocabulary database for copying into the local memory.

13. The modular speech recognition system according to claim 12, wherein the information is digital information received during the call.

14. The modular speech recognition system according to claim 12, wherein the information is a set of recognition results generated by the recognition processor from a second set of feature vectors representing a second set of sampled speech data obtained during the call.

15. A method for speech recognition during a call, comprising in a system controller the steps of: selecting a specialized vocabulary database from a plurality of specialized vocabulary databases stored in a bulk memory, in response to information associated with the call; copying the specialized vocabulary database into a local memory; and generating a recognition result based on the specialized vocabulary database stored in the local memory and a first set of feature vectors that represents a first set of speech samples obtained during the call.

16. The method according to claim 15, further comprising the step of generating the first set of feature vectors based on an analysis of the first set of sampled speech data obtained during the call.

17. The method according to claim 15, wherein in the step of selecting, the information is digital information received during the call.

18. The method according to claim 15, wherein in the step of selecting, the information is a set of recognition results generated from a second set of feature vectors representing a second set of sampled speech data obtained during the call.