WO2002037469A2

WO2002037469A2 - Speech generating system and method

Info

Publication number: WO2002037469A2
Application number: PCT/IL2001/001009
Authority: WO
Inventors: Zeev Lavi; Moshe Gilad; Ronen Dvashi
Original assignee: Infinity Voice Holdings Ltd.
Priority date: 2000-10-30
Filing date: 2001-10-30
Publication date: 2002-05-10
Also published as: IL139347A0; AU2002214227A1; WO2002037469A3

Abstract

A method of analyzing a WWW site for readout, in which audible items are grouped (340) so as to generate voice menus (342) having desirable property. Also disclosed is a method of audio browsing in which additional related audio data is sounded to a user.

Description

SPEECH GENERATING SYSTEM AND METHOD FIELD OF THE INVENTION

The present invention is relate to the field of text to speech conversion. BACKGROUND OF THE INVENTION

Conversion of text to speech is known, also for the reading out, e.g., for visually impaired, of WWW pages. Many WWW pages include audio segments, for download or even for immediate playback.

The use of advertising on the Internet, also as part of WWW pages is notorious. A particular subject being studied by many operators in the field of Internet advertising is the targeting of advertisements. Advertisement targeting and personalization of WWW interactions generally require a user profile which is typically maintained at the behest of the site owner or the viewer. Such a profile may also be maintained by a third party advertiser that targets advertisements to a viewer through the WWW page or via other means.

PCT publication WO 98/44424, the disclosure of which is incorporated herein by reference describes an automatic converter that modifies HTML pages by replacing text output objects by audio objects or by text objects containing translated subject matter.

PCT publication WO 00/07372, the disclosure of which is incorporated herein by reference describes an automatic annotator that can add advertisements as an audio annotation on a stream that may include Internet information or can add ASL (American sign language) gestures to an audio stream, based on an automated voice recognition of the audio channel contents.

SUMMARY OF THE INVENTION

An aspect of some embodiments of the invention relates to reading out and annotating WWW pages using computer generated speech. In an exemplary embodiment of the invention, a part of a WWW page is translated to a viewer's preferred language before being read out. Alternatively or additionally, animation associated with the speech is added, for example an image of a stick figure waving appendages or lip movements for a face figure. The animation may be determined automatically or it may be added manually to the page or an associated database. Optionally, an animation template database associates particular animation or an animation layout with a WWW page layout. The WWW page layout may be determined, for example, by analyzing the WWW page and/or by the WWW page being created using an authoring system that uses standard layouts. Optionally, information other than or additional to animation is associated with a page template. Optionally, the read out subject matter comprises information missing from the WWW page, for example supplied by other means or information that is removed from the WWW page during conversion to a different screen format (e.g., a cellular telephone screen).

Optionally, the added animation is added to support the understanding of a converted WWW page.

An aspect of some embodiments of the invention relates to the generation of sound effects in a speech rendering of a WWW page to indicate objects of interest and/or links. In an exemplary embodiment of the invention, a "bong" sound is generate prior to presenting a link, an image of interest, a matched search word and/or other objects of interest. An aspect of some embodiments of the invention relates to a method of combining speech and sounds. In an exemplary embodiment of the invention, a sound effects set (possibly including speech) is mixed with an output of a text-to-speech generator, after the speech is generated, and before sounds are provided to a sound card. Alternatively or additionally, a stereo signal is provided as an output. An aspect of some embodiments of the invention relates to an audio-based automated salesman that is not associated with a particular site being viewed. In an exemplary embodiment of the invention, when a user enters a site, the salesmen can suggest various products from the site, or from competing sites, to the viewer. In an exemplary embodiment of the invention, the salesman is not an agent of the viewer or of the site, but may receive a commission from the site and/or the viewer. Alternatively or additionally, the salesman converses in the native language of the viewer, optionally providing a transaction and/or explanation of the products on the site. A viewer may interact with the salesman using audio and/or visual tools and may, in some embodiments of the invention, conclude a sale via the salesman. An aspect of some embodiments of the invention relates to a method of text presentation, in which a text to speech conversion system retrieves a previously prepared audio file, in replacement for a text segment and/or a non-readable element. In an exemplary embodiment of the invention, the element is an advertisement.

An aspect of some embodiments of the invention relates to a method of analyzing a WWW page for read out. In an exemplary embodiment of the invention, a target page is divided into readable elements, each of which can be selected to be read out and a plurality of unreadable elements that can be removed. In an exemplary embodiment of the invention, the readable elements are automatically categorized, for example "menu", link list" and "article headline". Optionally, the categorized elements are grouped into groups. In an exemplary embodiment of the invention, the groups are selected, for example pre-selected or ad-hoc, so that a resulting voice menu and/or voice menu structure used to read out the page has desired properties, for example, a minimum or maximum number of elements. In an exemplary embodiment of the invention, associations between the categories are predefined, to assist grouping of the categorized using a logical scheme. Thus, a relative level of association between two categories can determine, for example, whether the two categories will be merged for a particular page, or whether other two categories will be merged. Alternatively or additionally, the number of elements in a category may determine if it is to be merged (e.g., low numbers are merged, so that voice menus are not wasted). In some embodiments of the invention, the categorization of an item may be changed, for example if more than one categorization fits an item (or group of items), and the different categorization affords a more desirable menu structure.

There is thus provided in accordance with an exemplary embodiment of the invention, a method of analyzing a WWW site for readout, comprising: parsing the site to identify items for which to generate an audible indication; categorizing the identified items by category; grouping the categories; and generating at least one voice menu based on said grouping, wherein said grouping comprises grouping so that at least some of the generated menus have a desirable property. Optionally, said desirable property comprises a minimum number of elements in a menu. Alternatively or additionally, said desirable property comprises a maximum number of elements in a menu.

In an exemplary embodiment of the invention, grouping comprises grouping based on pre-defined associations of categories. Alternatively or additionally, grouping comprises ordering said categories for presentation.

In an exemplary embodiment of the invention, the generated menus include a main menu and sub menus. Optionally, said main menu is shorter than 10 items. Optionally, said main menu is shorter than 7 items. Optionally, said main menu is shorter than 5 items.

Optionally, generating voice menus comprises merging the items in at least two categories into a single category.

In an exemplary embodiment of the invention, grouping comprises changing the categorization of an item to achieve the desired property.

There is also provided in accordance with an exemplary embodiment of the invention, a method of audio browsing of data that includes text data, comprising: selecting from a remote database, by a user, data including text data to be provided in an audio manner; automatically providing to said user, audio corresponding to said selected data; determining at least an indication of a content of said selected data; and automatically providing to said user, data in audio manner and relating to said determined indication. Optionally, selecting comprises selecting data by selecting a page. Alternatively or additionally, selecting comprises selecting data by selecting a WWW site. Alternatively or additionally, selecting comprises selecting data from a menu. Alternatively or additionally, selecting comprises selecting using a telephone handset with no visual display assistance. Alternatively or additionally, selecting comprises selecting using a telephone handset with a limited display incapable of satisfactory displaying of the data in a visual manner. Alternatively or additionally, selecting comprises selecting using a cellular telephone. In an exemplary embodiment of the invention, said data comprises a text segment. Alternatively or additionally, said data comprises an article. Alternatively or additionally, said data comprises an audio clip.

In an exemplary embodiment of the invention, said corresponding audio comprises a text to speech rendition of said text. Alternatively or additionally, said corresponding audio comprises a translation of said text. Alternatively or additionally, said corresponding audio comprises a recording of a human reading of said text. In an exemplary embodiment of the invention, determining at least an indication comprises matching a keyword of said data. Alternatively or additionally, determining at least an indication comprises identifying a source of said data. Alternatively or additionally, determining at least an indication comprises matching said data to a template.

In an exemplary embodiment of the invention, said relating data comprises a help message. Alternatively or additionally, said relating data comprises an unsolicited sales offer. Alternatively or additionally, said relating data comprises a comparison with data from a different source. Alternatively or additionally, said relating data comprises an unsolicited comment. Alternatively or additionally, said relating data comprises an advertisement. Alternatively or additionally, said relating data comprises audio of an interactive sales program.

In an exemplary embodiment of the invention, said relating data is provided locally to said user. Alternatively or additionally, said relating data is provided to compensate for lack of visual display quality. Alternatively or additionally, said relating data is provided to compensate for data which is not presented and not selected by the user for audio presentation. In an exemplary embodiment of the invention, said relating data is provided in a language native to said user and other from a language of said data.

In an exemplary embodiment of the invention, said relating data is personalized to match at least one attribute of said user. In an exemplary embodiment of the invention, said related data is sounded after said corresponding audio is sounded.

In an exemplary embodiment of the invention, said related data is requested by said user.

BRIEF DESCRIPTION OF THE FIGURES Fig. 1 is a schematic diagram of a configuration including an Internet speech generator, in accordance with a n exemplary embodiment of the invention;

Fig. 2 is a schematic block diagram of a speech and sound mixing system, in accordance with an exemplary embodiment of the invention;

Fig. 3A is an exemplary WWW page to be read out in accordance with an exemplary embodiment of the invention;

Fig. 3B is a flowchart of a process of processing the page of Fig. 3A, in accordance with an exemplary embodiment of the invention;

Fig. 3C is a flowchart of a process of reading out the page of Fig. 3A, in accordance with an exemplary embodiment of the invention; Fig. 4 is a block diagram of a cell-phone configuration, in accordance with an exemplary embodiment of the invention; and

Fig. 5 is a schematic block diagram of a system topology, in accordance with an exemplary embodiment of the invention.

DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS Fig. 1 is a schematic diagram of a configuration 100 including an Internet speech generator 108, in accordance with an exemplary embodiment of the invention. Configuration 100 includes a viewer that browses a target site 106 via an Internet 104, for example using a browser executing on a general purpose computer, as known in the art or using other display tools. Speech generator 108, generates speech and/or animation annotations for viewer 102. As shown, speech generator 108 is on a separate computer from both viewer 102 and site 106. Alternatively, some or all of the functionality of generator 108 may be located at viewer 102 and/or target site 106 and/or distributed between several computers, connected for example via Internet 104. In some embodiments of the invention speech generator 108 is provided on a LAN that interconnects several viewers 102 and/or at an ISP or on a proxy server, for example one that serves a plurality of people with a same language need.

The speech generated by speech generator 108 may be transmitted over the network

(LAN or Internet) as audio (e.g., using standard methods) or as codes (e.g., syllables, phonemic codes, optionally with phonic concatenation hints) for a program (e.g., a Java applet) on viewer 102 to convert into audio. Alternatively, the speech generation is performed on the computer of viewer 102.

In an exemplary embodiment of the invention, speech generator 108 comprises a page analyzer 110 that analyses the WWW page information at site 106. Exemplary types of analysis include selecting which text to convert to speech and link and advertisement detection. Speech annotation or conversion is performed by a speech annotator/converter 112. In an exemplary embodiment of the invention, this converter comprises a standard text-to- speech converter software module or unit. Alternatively or additionally, speech generator 108 comprises an animation annotator 114, for example for adding animation annotations to the WWW page or to the speech generated for the page. Alternatively or additionally, other multimedia elements may be added, for example, audio clips.

An optional database 116 maybe provided, for example for storing page templates (described below), for storing associations of animation with speech, for storing sound clips (e.g., to replace text and/or advertisements) and/or for storing help messages (described below).

Speech generator 108 may interact with site 106 in various ways, including, for example, site 106 may retrieve the annotations from generator 108, for incorporation in its output; site 106 may be retrieved by viewer 102 via generator 108 or a separate server (not shown) that annotates the contents of site 106; or viewer 102 may retrieve site 106 and annotations from generator 108 in parallel, and add the annotations at viewer 102.

Fig. 2 is a schematic block diagram of a speech and sound mixing system 200, in accordance with an exemplary embodiment of the invention. HTML data (202) is parsed to yield plain text (204). A text to speech generator 206 converts the plain text into audio signals, to be outputted as sound-waves (208) at viewer 102. HTML is a marked-up text file. In an exemplary embodiment of the invention, some of the mark ups are used to modify the audio output at viewer 102. In one exemplary embodiment of the invention, the mark-ups are passed to text-to-speech generator 206. However, this may require a dedicated generator software and/or a special preprocessor for converting the text mark-ups into command parameters of generator 206. In an exemplary embodiment of the invention, the mark-ups are converted into audio effects and/or speech using a separate channel, which is mixed using a mixer 210 to form part of sounds 208. In an exemplary embodiment of the invention, links (212) cause a wave generator 214 to generate a "bong" sound preceding the recitations of the link. Alternatively or additionally, wave generator 214 reads out the link using a phonetic readout, rather than as English. Alternatively or additionally, the mark-up path is used to control various parameters of mixer 210, for example volume, speed, volume and voice type (e.g., women or child).

In an exemplary embodiment of the invention, the generation of speech or other sound effects is automatic when the page is retrieved and/or displayed, for example immediate, or after a delay. Alternatively or additionally, the audio may be generated and/or presented when viewer 102 interacts with a display. In one example, links or active buttons are substituted for text with which audio is associated. Alternatively or additionally, the viewer's browser can detect the interact with such areas. In another example, when a user clicks on a page portion that has associated audio, the audio is played. In an exemplary embodiment of the invention, the pages are based on a template of the page structure. Although over 2 billion WWW pages are extant at present, many of the pages match one of a small number of formats. Typically, this is because there are a small number of accepted WWW page formats. Alternatively or additionally, many pages are generated using standardized tools, that include formats. In an exemplary embodiment of the invention, templates for these pages, associate one or more of the following with the page format, for automatic generation of text (e.g., for annotations), speech and/or animation:

(a) order of reading;

(b) identification of important vs. less important material;

(c) locations suitable for showing animation; and (d) advertising.

Alternatively or additionally, the WWW page is analyzed, for example using methods known in the art, for example to detect links (e.g., based on HTML tags), headlines (e.g., based on relative or absolute font size).

In an exemplary embodiment of the invention, a user interface is provided for generator 108, for example to allow a user to set one or more of the following parameters:

(a) speech characteristics, such as speed, volume and voice type;

(b) type and existence of animation; and

(c) existence and/or parameters of language translation. In an exemplary embodiment of the invention, the user interface is via a WWW page. Speech generator 108 may operate in various operational modes, including, for example, one or more of the following modes:

(a) Unsolicited explanations. Such explanations may be provided, for example, based on a user profile that can indicate a viewer's past queries. Such a user profile and/or explanations may be stored locally to viewer 102 or remotely, for example at site 106 or at generator 108.

(b) Unsolicited or solicited offer of a product which is the same as shown or is related to that shown on the WWW page. In some cases, the product is offered from the same vendor, in others from another vendor, for example under competing terms. Optionally, the contents of a competing site 106' for that product are provided to the viewer using audio.

In an exemplary embodiment of the invention, a response from the user may be received using a speech input, which may be processed, for example at viewer 102, at generator 108 and/or at a separate speech recognition server. Such speech input may be used for other exemplary operational modes as well. Alternatively or additionally, DTMF input may be used.

(c) Translation and recitations of various parts of site 106, for example automatically, or on selection of a text portion or other display object by a user.

(d) Perform transaction. In an exemplary embodiment of the invention, speech generator 108 completes a transaction with the viewer, for example regarding to the contents of the currently displayed page or not.

(e) Ask viewer 102 questions, for example, what is it that the viewer likes about the displayed page and/or product.

(f) Add animation. Animation can be added as a stand alone element or it may be associated with speech output or replace it, for example being in sign language. In an exemplary embodiment of the invention, the animation is that of a stick figure emphasizing the text or speech output. Alternatively or additionally, the animation is that of a face, synchronized to the audio sounds.

(g) Help. Help, as opposed to unsolicited explanations, is in response to a user request, for example, a user clicking on an item or an image of a product. In some embodiments of the invention, a software component at viewer 102 detects that a user clicked on a word and forwards this word (or image) to generator 108, for generating a "help" message.

Fig. 3A is an exemplary WWW page 300 to be read out in accordance with an exemplary embodiment of the invention. Page 300, which is similar in structure to some news pages, includes readable and non-readable elements, elements at different levels of interest and elements having different levels of relevance to the page. For example, page 300 can include an article 306, having an image 308, a headline 310 and text paragraphs 312. In addition, page 300 includes a link list 304 (e.g., a single item comprising multiple display and/or HTML elements), an auto-install control 302, other controls 314 and 316, an advertisement 326, a plurality of subject headlines (for other articles) 318 and lists of headlines 320. In addition, page 300 may include a secondary article, for example including an image 322 and a headline 324.

Although page 300 is a multi-article page, many WWW pages include only a single article, which includes, for example, one or more images, titles and subtitles, text paragraphs and controls, usually at the start and/or end of the page.

Fig. 3B is a flowchart of a process 330 of processing page 300, in accordance with an exemplary embodiment of the invention.

At 332, the "nature" of the site is optionally recognized. For example, different variations and/or processing steps are performed on different types of WWW pages. Examples of WWW page types, include: News, portals, e-commerce, etc. In an exemplary embodiment of the invention, the page type is recognized by comparing the page address against a catalog of site and/or page types. Alternatively or additionally, the page address or site title may include keywords that identify the site type (e.g., checkout pages). At 334, non-readable portions, such as HTML commands (or other language commands, for other page description languages) images, text input boxes, pull-down lists and controls are removed. Optionally, some such portions may be retrained, for example to allow a user to select them using a special menu. Alternatively or additionally, a text tag portion may be retained, for example an image title, so the user can be aware of what is missing from the page. In an exemplary embodiment of the invention, an image or other non-readable item may be requested by a user to be forwarded to him, for example, by e-mail, to a cellular telephone display and/or a fax.

At 336, the readable parts are categorized. In an exemplary embodiment of the invention, the categorization is based on the type of read-out to be applied. Alternatively or additionally, the categorization is based on the hierarchical order of read-out. Exemplary categories include: headline, banner, main menu bar, links list, mail address, articles, sub- articles and tables. In an exemplary embodiment of the invention, a "type name" tag is added to each readable part, in HTML code or other parsed stream. Optionally, the names used are based on the identification of the site nature. Alternatively, standard nomenclature may be used.

The display elements may be categorized using various methods. For example, some WWW sites include tags, such as "headline", "menu" or "link". Although different tags may be used by different sites, a plurality of tags can be combined in a single category (or group, below). Alternatively or additionally, regular expressions or other rules may be used. In another example, a set of contiguous links is identified as a link list. If the text associated with the link is a multiword phrase, it is assumed the links are headlines. In another example, a sequence of paragraphs of same size font, possibly with headlines in another font, is recognized as an article. A multitude of text parsing engines are known in the art, for which a skilled practitioner may define recognition and categorization rules. The use of rule sand expressions may be in addition to or instead of the use of templates. In an exemplary embodiment of the invention, a frames-like approach (as once used in Al) is used to assist in recognizing elements in a page of a certain type. At 338, the readable parts are optionally organized, in an order that they will be presented in the menu. The order may be a property of the site type. Alternatively or additionally, the order may be determined based on the number of each element of each type. Alternatively or additionally, the order may be at least partly random. Alternatively or additionally, the order may be based on a perceived relative importance of different items. Perceived importance may be determined, for example, based on selection (for readout) statistics (e.g., order, frequency) of this or other users.

In an exemplary embodiment of the invention, the user operates the system using a telephone. Thus, each menu cannot have too many options. In an exemplary embodiment of the invention, several readable parts are grouped together (340), so that the number of options will not be too great, for example, over 5, 6, 7 or 8 readable groups. Alternatively or additionally, very short menus are undesirable, as they increase the total number of menus. So items with short menus are grouped together too.

Possibly, the titles of the menus and/or menu elements are generated in real-time, to match the grouping of categories. In an exemplary embodiment of the invention, the categories are selected so that they can be naturally combined into single menu elements in various manners. Optionally, the desirability of associating two or more particular categories into a single menu. The final menu set can thus be, for example, a function of the number of elements in each categories, their relative perceived importance and the particular categories available on the page. Optionally, the determined names and or statistics of the site are stored in a database (342) for use in a next time the page is read out.

Optionally, a manual setup step for the page is triggered (344), for example based on the number of request for the page and/or based on complaints. In an exemplary embodiment of the invention, page 300 is divided into the following groups: "menus", "advertisement", "main headline", "links list", "subject headlines #1", "subject headlines #2", and "secondary article". In an exemplary embodiment of the invention, the advertisement is read out, without prompting the user. Optionally, an audio advertisement (e.g., wav file) is provided by the advertisement provider instead of the text advertisement. Fig. 3C is a flowchart 350 of an exemplary process of reading out an arbitrary page

300.

At 352, a site or page is chosen. In an exemplary embodiment of the invention, a user sets up a limited number of favorite sites. Alternatively or additionally, the site is selected form a hierarchical list provided by the system. Alternatively or additionally, the user enters the site address or a keyword by voice input. Alternatively or additionally, the user uses the telephone keys to enter the site address and/or keywords. In a tone telephone, the address may be ambiguous, however, such ambiguity may be settled, for example, by comparing the entry against a catalog of favorite and/or common sites.

At 354, advertisements on the site are played. In an exemplary embodiment of the invention, the system requests an audio clip to replace the text and/or image, from the advertisement provider. Alternatively, text to speech methods are used.

In the method of Fig. 3B, the page is analyzed for readable and non-readable parts this may take place, for example, between 352 and 354. In an exemplary embodiment of the invention, the order of readout and/or other readout properties can be a user-associated preference. Optionally, different preferences are associated with different pages, even for a same user.

In an exemplary embodiment of the invention, a short menu of options is read out to a user. Responsive to the list, the user may dig dipper into the hierarchical structure of the site (e.g., alternative pages, sub articles). Alternatively or additionally, the system may read out an article or part of an article (358), before returning to the options list. The listing may change to reflect the fact that some articles have been read, for example, by putting them last in the list and/or using a different bong sound before the read and unread articles.

Optionally, some of the articles may be retrieved as audio files. In an exemplary embodiment of the invention, the page includes tags indicating for which articles and/or other readable or non-readable page elements there is a previously prepared audio equivalent, at the WWW site sever and/or at a different location.

Once reading is completed, the user can exit (360).

In an exemplary embodiment of the invention, tone keys are used to navigate the option lists of 356. Alternatively or additionally, a user can activate the keys (or use a voice command) during a read out, for example, to bookmark, to stop, to fast forward, rewind, to receive help, to follow a link, to activate a preset utility, to go down a level in hierarchy or to go up a level in hierarchy. The keys for these and/or other actions may be preset and/or read out to the user, as one of the options. Optionally, a key is active for an item while it is being read and for a short time after, possible even after a next item is being read.

Following are tables showing examples of system messages and system readout, for reading out a WWW site (CNN in this example), in accordance with an exemplary embodiment of the invention. Table la shows a process of site analysis (generally corresponding to Fig. 3B), in accordance with an exemplary embodiment of the invention. Table lb shows the application of this method to a particular CNN main page.

TABLE lb

Table Ila shows the steps in an exemplary process of reading out a page in accordance with an exemplary embodiment of the invention.

seconds". Banner reading:

3. If there is a banner in the site, the system will read the banner to the user (either by reading the text in the banner or by playing the clip or other audio file of the banner.

4. Before reading / playing the Banner, the system will announce to the user: "The site is processed and will be read in a few seconds"

Article choosing:

5. The system will offer the user the articles to hear: "Press 1 for main article, press 2 for sub articles, press 0 to return "

6. If the user clicks " 1 "- the system reads to him (after the "Text to speech" operation) the main article title and the whole article. After reading the article the system will ask the user for a next action by repeating the previous message.

7. If the user clicks "2" - the system will read the first "sub title" and than will announce "press 1 to hear article, press pound for next sub title, press star for previous article, press 0 to return ".

8. If the article is the first one read, than the "*" option may not be offered. If it is the last one read, than the "#" option may not be offered.

9. If the user requested to hear the sub article, than the system will read the whole article to him. At the end, the system will return and read the previous message: "press 1 to hear article, press pound for next sub title, press star for previous article, press 0 to return "

End of process:

10. If the user asks to return (he clicks on "0") then the system stops and returns to the main menu.

Output Web site is read to the customer

TABLE Ila Table lib shows the reading out of a main page of the CNN site (September 3, 2000, at 22:00 Israel time). As the user can choose various options of parts of the site to hear, and in order to simplify the presentation, several possibilities will be described.

In some embodiments of the invention, some articles may be available only to members, which may require a payment authorization act or a log-in act. Alternatively, such acts may be implicit. The system may warn the user of the cost of reading out an article. Possibly, the system detects one or more price-quotes on the WWW page and reads them out, for example as part of the menu. Narious databases, for example, have a standard record structure that includes a title, a link and a price quote. Such a structure may be used to drive parsing that detects the quote.

Fig. 4 is a block diagram of a cell-phone configuration 400, in accordance with an exemplary embodiment of the invention. Information from a source site 402 is transmitted, for example over the Internet or via a dedicated line to a cellular operator 401. The content is converted at operator 401, at source site 402 or intermediate between them, using a converter 404, which converts the format and/or level of details form a format suitable for personal computers to a format suitable for cellular telephones. This conversion may be in real-time or it may be off-line.

A text to speech converter and/or annotator 406 preferably converts parts of the converted content to speech or adds a layer of audio annotations. In an exemplary embodiment of the invention, the annotations are designed to compensate for content removed or made less desirable by converter 404. The converted and annotated content is then transmitted to a cellular telephone 408, using methods known in the art. Alternatively, converter 404 and converter 406 are combined, for example, to convert an HTML page into a hybrid image and audio content. Alternatively, the cellular telephone may serve as a browsing terminal in a configuration as shown in Fig. 1, possibly with no special allowance being made for cellular conversion, if any. For example, the cellular conversion may be performed after the audio annotations are added.

Fig. 5 is a more detailed schematic block diagram 500 of a system topology, in accordance with an exemplary embodiment of the invention. A source 502 comprises, for example, one or more of a public web service 510, a hosted web service 512 and a corporate Intranet or Extranet.

The data from source 502 is provided to a gateway server 504, optionally through a proxy 516. Gateway 504 may utilize, for example, multiple language/voice generation and/or translation engines 506. An optional language ID engine 522 maybe used to determine the language of the site, for example using methods known in the art, such as word recognition, character sets, language tags, letter frequency, page title and a language previously associated with the page address. A data collection server 520 may be optionally provided for tracking usage of the system and/or for billing. A telephone system 508, including a base station 526, a telephone company operating system 528 and a network 530, may be used as a user input and output device. In an exemplary embodiment of the invention, an Interactive Voice Response system 524 is used for receiving user input commands, by gateway server 504.

In an exemplary embodiment of the invention, server 504 includes an application backbone and framework, to which are attached various software and/or hardware modules, for example, a telephony module, a network resource management module, a customization database module, a billing database module, e-mail and Intranet servers, ASR (automatic speech recognition) and TTS (text to speech) modules, an optimization engine (e.g., for aggregating page elements into menus), a web engine, a language server, an interactive ad server and/or content proxy servers. It will be appreciated that the above described methods of web site annotation and readout may be varied in many ways, including, changing the order of steps, which steps are performed on-line or off-line, such as table or index preparation, and the exact implementation used, which can include various hardware and software combinations. In addition, a multiplicity of various features has been described. It should be appreciated that different features may be combined in different ways. In particular, not all the features are necessary in every exemplary embodiment of the invention. Software as described herein is preferably provided on a computer readable media, such as a diskette or an optical disk. Alternatively or additionally, it may be stored on a computer, for example in a main memory or on a hard disk, both of which are also computer readable media. Where methods have been described, also computer hardware programmed to perform the methods is within the scope of the description. When used in the following claims, the terms "comprises", "includes", "have" and their conjugates mean "including but not limited to".

It will be appreciated by a person skilled in the art that the present invention is not limited by what has thus far been described. Rather, the scope of the present invention is limited only by the following claims.

Claims

1. A method of analyzing a WWW site for readout, comprising: parsing the site to identify items for which to generate an audible indication; categorizing the identified items by category; grouping the categories; and generating at least one voice menu based on said grouping, wherein said grouping comprises grouping so that at least some of the at least one generated menu has a desirable property.

2. A method according to claim 1, wherein said desirable property comprises a minimum number of elements in a menu.

3. A method according to claim 1, wherein said desirable property comprises a maximum number of elements in a menu.

4. A method according to claim 1, wherein grouping comprises grouping based on predefined associations of categories.

5. A method according to claim 1, wherein grouping comprises ordering said categories for presentation.

6. A method according to claim 1, wherein said at least one menu comprises a main menu and sub menus.

7. A method according to claim 6, wherein said main menu is shorter than 10 items.

8. A method according to claim 6, wherein said main menu is shorter than 7 items.

9. A method according to claim 6, wherein said main menu is shorter than 5 items.

10. A method according to claim 1, wherein generating at least one voice menu comprises merging the items in at least two categories into a single category.

11. A method according to claim 1, wherein grouping comprises changing the categorization of an item to achieve the desired property.

12. A method of audio browsing of data that includes text data, comprising: selecting from a remote database, by a user, data including text data to be provided in an audio manner; automatically providing to said user, audio corresponding to said selected data; determining at least an indication of a content of said selected data; and automatically providing to said user, data in audio manner and relating to said determined indication.

13. A method according to claim 12, wherein selecting comprises selecting data by selecting a page.

14. A method according to claim 12, wherein selecting comprises selecting data by selecting a WWW site.

15. A method according to claim 12, wherein selecting comprises selecting data from a menu.

16. A method according to claim 12, wherein selecting comprises selecting using a telephone handset with no visual display assistance.

17. A method according to claim 12, wherein selecting comprises selecting using a telephone handset with a limited display incapable of satisfactory displaying of the data in a visual manner.

18. A method according to claim 12, wherein selecting comprises selecting using a cellular telephone.

19. A method according to claim 12, wherein said data comprises a text segment.

20. A method according to claim 12, wherein said data comprises an article.

21. A method according to claim 12, wherein said data comprises an audio clip.

22. A method according to claim 12, wherein said corresponding audio comprises a text to speech rendition of said text.

23. A method according to claim 12, wherein said corresponding audio comprises a translation of said text.

24. A method according to claim 12, wherein said corresponding audio comprises a recording of a human reading of said text.

25. A method according to claim 12, wherein determining at least an indication comprises matching a keyword of said data.

26. A method according to claim 12, wherein determining at least an indication comprises identifying a source of said data.

27. A method according to claim 12, wherein determining at least an indication comprises matching said data to a template.

28. A method according to claim 12, wherein said relating data comprises an advertisement.

29. A method according to claim 12, wherein said relating data comprises a help message.

30. A method according to claim 12, wherein said relating data comprises an unsolicited sales offer.

31. A method according to claim 12, wherein said relating data comprises a comparison with data from a different source.

32. A method according to claim 12, wherein said relating data comprises an unsolicited comment.

33. A method according to claim 12, wherein said relating data comprises audio of an interactive sales program.

34. A method according to claim 12, wherein said relating data is provided locally to said user.

35. A method according to claim 12, wherein said relating data is provided to compensate for lack of visual display quality.

36. A method according to claim 12, wherein said relating data is provided to compensate for data which is not presented and not selected by the user for audio presentation.

37. A method according to claim 12, wherein said relating data is provided in a language native to said user and other from a language of said data.

38. A method according to claim 12, wherein said relating data is personalized to match at least one attribute of said user.

39. A method according to claim 12, wherein said related data is sounded after said corresponding audio is sounded.

40. A method according to claim 12, wherein said related data is requested by said user.