CN101482975A

CN101482975A - Method and apparatus for converting words into animation

Info

Publication number: CN101482975A
Application number: CNA2008100190049A
Authority: CN
Inventors: 李嘉辉
Original assignee: AVANTOUCH SOFTWARE Co Ltd
Priority date: 2008-01-07
Filing date: 2008-01-07
Publication date: 2009-07-15

Abstract

A method for converting text into animation and device thereof is capable of converting text into corresponding animation, comprising steps of text inputting; speech synthesizing, synthesizing an input text in order to obtain corresponding audio files; video synthesizing, including steps of spelling analyzing and mouth shape synthesizing, which are doing phonetic analysis for the text to obtain corresponding spellings for the text, then extracting images corresponding to the spelling from a preset data base of mouth shape of spelling, and finally synthesizing the images into a video files; animation synthesizing, synthesizing the audio file and video file into images. The device for converting text into animation comprises a tent inputting module, a speech synthesizing module, a video synthesizing module, and an animation synthesizing module. The text input by users are processed by the modules, thereby the animation in accordance with text is obtained, achieving the goal of conversion of text to animation.

Description

A kind of method and apparatus of converting words into animation

Technical field

The present invention relates to a kind of method and apparatus of converting words into animation, and be particularly related to a kind of method and apparatus of the converting words into animation in conjunction with mobile communication technology and multimedia technology.

Background technology

In today of mobile communication technology and multimedia information technology develop rapidly, the application of audio frequency, video and people's work, life are more and more closer, are also playing the part of more and more important role in commerce is used.

Mobile communication has experienced first generation analog network mobile communication technology (1G), second generation digital network mobile communication technology (2G) afterwards, be about to welcome 3G (Third Generation) Moblie technology (3G), compare with the second generation mobile communication technology that is using at present with the first generation that with the analogue technique is representative, 3G (Third Generation) Moblie technology (3G) will have wideer bandwidth, higher transmission speed.3G (Third Generation) Moblie technology (3G) can not only transporting speech, can also transmit data, thereby quick and easy wireless application is provided.Can realize that high speed data transfer, wideband multimedia service and streaming media service are another principal features of 3G (Third Generation) Moblie technology (3G).3G (Third Generation) Moblie technology (3G) high-speed mobile can be inserted and the service of internet protocol-based combines, provide real-time multimedia and flow medium function, for example: real-time video phone (video conference), video/audio stream, long distance wireless supervision, multimedia real-time game, video request program etc.Therefore, the application of 3G (Third Generation) Moblie technology (3G) has the application of wide development space, particularly audio frequency, video to become a new direction of the market demand.As the technical foundation of phonetic synthesis and Video Applications, design one reasonably, the processing module of converting words into animation has very important research and practical value efficiently.

The patent of number of patent application CN200510034257.X (patent publication No. CN1707550) foundation and the access method thereof of pronunciation mouth shape cartoon databank " sound pronunciation with " has been described a kind of with sound pronunciation and the related technical scheme of pronunciation mouth shape animation.Its implementation is: set up three databases earlier, i.e. dictionary database and its corresponding voice bank, the basic picture library of the phonetic symbol pronunciation degree of lip-rounding; Then in the expression mode of relative value or number percent, set up three subdata bases or comprise the total data storehouses of a subdata base in all shape of the mouth as one speaks picture and phonetic symbol, phonetic symbol and individual character, individual character and sentence are interrelated respectively; During visit, the content Word message input with learning is decomposed into individual character with sentence; According to dictionary database, find the phonetic symbol and the pronunciation of individual character correspondence; According to according to the basic picture library of the phonetic symbol pronunciation degree of lip-rounding, find pronunciation degree of lip-rounding parent map from phonetic symbol; Last dispensed give whenever dehisce the type picture and with the sound synchronous playing, make that sound and pronunciation mouth shape figure's is synchronous.

Yet the method has certain limitation, and at first, that this method must rely on is personal, distinctive dictionary database and its corresponding voice bank, the basic picture library of the phonetic symbol pronunciation degree of lip-rounding, yet these databases can not be used on the common mobile communication equipment; Secondly, the method obtains the principle of the degree of lip-rounding based on phonetic symbol.

In view of the above problems, for provide one reasonably, the processing module of converting words into animation efficiently, the method of the converting words into animation of the voice system that meets Chinese character is provided, make this method have more compatibility simultaneously, can be widely used in mobile communication and multimedia information technique field, the present invention has designed a kind of method and apparatus of converting words into animation.

Summary of the invention

First purpose of the present invention is to provide a kind of method of converting words into animation, literal can be converted into corresponding animation, and described method comprises: the literal input step; The phonetic synthesis step is carried out phonetic synthesis to obtain corresponding audio files with the literal of importing; The video synthesis step, with the literal of input by the video synthesis module synthetic after, obtain the video file corresponding with literal; The animation synthesis step synthesizes audio file and video file and obtains the animation corresponding with literal.

According to the method for the converting words into animation that the object of the invention provided, wherein the video synthesis step more comprises: spelling analyzing step and shape of the mouth as one speaks synthesis step.The spelling analyzing step is carried out spelling analyzing to obtain the phonetic corresponding with literal with literal; Shape of the mouth as one speaks synthesis step extracts the shape of the mouth as one speaks picture corresponding with phonetic from default phonetic shape of the mouth as one speaks data bank, at last shape of the mouth as one speaks picture is synthesized video file.

According to the method for the converting words into animation that the object of the invention provided, wherein more comprise: video conversion module, above-mentioned video file is compressed and the form conversion after by the video conversion step, transform into and be fit to the final video file used.

According to the method for the converting words into animation that the object of the invention provided, wherein the animation synthesis step more comprises shape of the mouth as one speaks synchronizing step, so that audio file and video file are synchronous and T.T. is consistent.

Purpose according to the present invention provides a kind of method of converting words into animation, wherein the spelling analyzing step is vowel and non-vowel with described spelling analyzing, shape of the mouth as one speaks synthesis step is written into and synthesizes video file for extracting respectively and resolved vowel and the pairing shape of the mouth as one speaks picture of non-vowel that comes out from default shape of the mouth as one speaks data bank.Especially, in order to reach the purpose that reduces synthetic complexity, the shape of the mouth as one speaks picture of non-vowel correspondence is the pairing shape of the mouth as one speaks picture of one of them vowel.

Purpose according to the present invention provides a kind of method of converting words into animation, wherein the spelling analyzing step is initial consonant and simple or compound vowel of a Chinese syllable with described spelling analyzing, shape of the mouth as one speaks synthesis step is written into and synthesizes video file for extracting respectively and resolved initial consonant and the pairing shape of the mouth as one speaks picture of simple or compound vowel of a Chinese syllable that comes out from default shape of the mouth as one speaks data bank.Especially, in order to reach the purpose that reduces synthetic complexity and improve synthetic precision, initial consonant and simple or compound vowel of a Chinese syllable all can be subdivided into different groups again according to articulation type is different, and same group initial consonant and simple or compound vowel of a Chinese syllable are very little owing to distinguishing, and can use identical shape of the mouth as one speaks picture.

Purpose according to the present invention provides a kind of method of converting words into animation, for synthetic animation being adapted to different application demands, more comprise the following steps: the animated transition step, animation file by compression and form conversion, is transformed into the final animation file that is fit to application.

Another object of the present invention is to provide a kind of device of converting words into animation, comprising: the literal load module, and the user is used for input characters; Phonetic synthesis module, described literal obtain and described literal corresponding audio files after synthesizing by the phonetic synthesis module; Video synthesis module, described literal obtain the video file corresponding with described literal after synthesizing by the video synthesis module; The animation synthesis module, after described audio file and described video file synthesize by the animation synthesis module, obtain the animation corresponding with described literal, the literal of user's input is through finally obtaining the animation corresponding with literal, to realize the purpose of converting words into animation after the above-mentioned module.

According to the device of the converting words into animation that the object of the invention provided, wherein the video synthesis module more comprises: the spelling analyzing module, and the literal of being imported obtains the phonetic corresponding with literal by after the spelling analyzing module parses; Shape of the mouth as one speaks synthesis module, the shape of the mouth as one speaks picture corresponding with phonetic that extracts from the default phonetic shape of the mouth as one speaks data bank in outside behind shape of the mouth as one speaks synthesis module, synthesizes video file.

According to the device of the converting words into animation that the object of the invention provided, wherein more comprise: video conversion module, above-mentioned video file is compressed and the form conversion after by video conversion module, transform into and be fit to the final video file used.

According to the device of the converting words into animation that the object of the invention provided, wherein the animation synthesis module more comprises shape of the mouth as one speaks synchronization module, so that audio file and video file are synchronous and T.T. is consistent.

The method and apparatus of the converting words into animation that the object of the invention provided mainly has the following advantages:

1, can be according to the voice system and the principle of Chinese character, provide one rationally, the disposal route of converting words into animation efficiently;

2, have good compatibility, can be widely used in mobile communication and multimedia information technique field.

Description of drawings

Fig. 1 is a kind of method flow diagram of converting words into animation;

Fig. 2 is the apparatus function piece figure according to a kind of converting words into animation of Fig. 1;

Fig. 3 is the concrete exploded view according to shape of the mouth as one speaks synthesis step in the video synthesis step in the method for the converting words into animation of first embodiment of the invention.

Embodiment

Embodiment one

Please refer to Fig. 1, Fig. 1 is the process flow diagram according to the method 100 of a kind of converting words into animation of specific embodiment.Converting words into animation method 100 can be converted into literal corresponding animation, and converting words into animation method 100 comprises the following steps: literal input step 110, phonetic synthesis step 120, video synthesis step 130 and animation synthesis step 140.

The user is input characters or accept the literal that other people transmit at first, pass through phonetic synthesis step 120 then, the literal of these inputs or the literal that other people transmitted are carried out phonetic synthesis (Text-To-Speech, TTS), obtain and described literal corresponding audio files, Word message is converted into the audio-frequency information that to listen.At present speech synthesis technique is ripe relatively, can directly adopt ripe existing commercial module, the phonetic synthesis module that provides as companies such as Microsoft, IBM.Generally speaking, what comprise following parameter in the phonetic synthesis step 120 is provided with the sound setting, as: male voice, female voice, children; The audio sample size is provided with, as 8 audio samples, 16 audio samples; Sampling rate or the like.By obtaining after the phonetic synthesis step 120 and described literal corresponding audio files, and generally be input in the animation synthesis step 140 in the mode of wav formatted file.

When carrying out phonetic synthesis step 120, also need through video synthesis step 130, video synthesis step 130 comprises 2 important ingredients: spelling analyzing step 132 and shape of the mouth as one speaks synthesis step 134.The literal of these inputs or the literal that other people transmitted are at first resolved to concrete phonetic through spelling analyzing step 132.Generally speaking, as long as the system of spelling input method is provided, as: the Chinese character pinyin input method of mobile communication terminal, Chinese character pinyin input method of computer system or the like, all have a built-in encode Chinese characters for computer and phonetic storehouse, the purpose of phonetic can be realized Chinese character is converted in needs this encode Chinese characters for computer of dependence of spelling analyzing step 132 and phonetic storehouse.This also is why can realize compatible reason easily according to the method 100 of converting words into animation.Then, according to extracting the shape of the mouth as one speaks file corresponding in preset rule in the shape of the mouth as one speaks synthesis step 134 and the externally default phonetic shape of the mouth as one speaks data bank, and then be written into the pairing shape of the mouth as one speaks static images of described shape of the mouth as one speaks file, and then synthesize video file with phonetic.

Please refer to Fig. 3, Fig. 3 is the concrete exploded view according to shape of the mouth as one speaks synthesis step in the video synthesis step in the method for the converting words into animation of the embodiment of the invention one.In the present embodiment, video synthesis step 130 will be the literal extraction of being imported and be written into corresponding video file that it is all trickleer to ignore initial consonant, mouth shape, is difficult to differentiate difference, so ignores the influence of initial consonant among the embodiment one according to different simple or compound vowel of a Chinese syllable. has 40 simple or compound vowel of a Chinese syllable in the Chinese phonetic alphabet; "a"、"ai"、"ao"、"an"、"ang"、"o"、"ou"、"e"、"ei"、"en"、"eng"、"er"、"-i"、"i"、"ia"、"iao"、"ian"、"iang"、"ie"、"iu"、"in"、"ing"、"iong"，"iou"、"u"、"ua"、"uo"、"uai"、"uei"、"ui"、"un"、"uan"、"uen"、"uang"、"ueng"、"ong"、"v"、"ve"、"van"、"vn"。 The user at first input characters or accept literal that other people transmit after, as: " today, weather was pretty good; Go out object for appreciation "; This section literal at first resolves to concrete phonetic (the 2nd row as shown in Fig. 3 form) through spelling analyzing step 132; Then; According to preset rule:ignore initial consonant; Only extract and be written into corresponding video file according to simple or compound vowel of a Chinese syllable; And from the default phonetic shape of the mouth as one speaks data bank in outside, extract the shape of the mouth as one speaks file corresponding (the 5th row as shown in Fig. 3 form) with phonetic, and then be written into the pairing shape of the mouth as one speaks static images of described shape of the mouth as one speaks file (the 6th row as shown in Fig. 3 form), and then synthesize video file.Wherein, video file can be generally the GIF formatted file for the dynamic shape of the mouth as one speaks picture of being made up of according to sequential described shape of the mouth as one speaks static images; Perhaps the dynamic video of being made up of according to sequential described shape of the mouth as one speaks static images is generally the AVI formatted file.

After respectively audio file and video file being ready to by phonetic synthesis step 120 and video synthesis step 130, then will described audio file and described video file be synthesized the animation that the existing audio frequency that obtains correspondence has video again through animation synthesis step 140.For example: the AVI format video file that generates through video synthesis step 130 can synthesize by further corresponding with it audio file, obtains the animation file that a new existing audio frequency has video again; And can make up by further corresponding with it audio file (for example can be the audio file of amr form) through the GIF format video file that video synthesis step 130 generates, obtain the animation file that existing audio frequency has video again.

In addition, in order to keep the true to nature of animation, in animation synthesis step 140, it should be noted that: the first, the T.T. of video file (comprising above-mentioned dynamic shape of the mouth as one speaks picture or dynamic video) should be consistent with audio file (being generally the wav formatted file); The second, pairing audio file of each literal and video file also should with the real case near-synchronous.Therefore, more comprise shape of the mouth as one speaks synchronizing step 142 at animation synthesis step 140, in shape of the mouth as one speaks synchronizing step 142, calculate each literal pairing audio file time and video file time, so that described audio file and described video file are synchronous and T.T. is consistent.It should be noted that, if be under the prerequisite that GIF format video file and corresponding audio file (for example can be the audio file of amr form) are made up in video synthesis step 130, because GIF format video file itself can't add sound, therefore make up the animation file that is produced and in fact contain 2 files: a GIF format video file and an amr format audio file, therefore, in order to access smooth existing audio frequency the animation file of video is arranged again, the shape of the mouth as one speaks synchronizing step 142 of this moment is even more important.

Shape of the mouth as one speaks synchronizing step 142 concrete principle of work can body with reference among the figure 3, for example: 2880 milliseconds of total times spent of TTS file that " today, weather was pretty good; go out object for appreciation " this section literal generates, wherein, comma takies 530 milliseconds, 213 milliseconds of remaining 11 Chinese characters (2880-530) ÷, 11 ≈ then need be according to a definite sequence (sequence shown in the row of the 7th as shown in Fig. 3 form) when extracting and being written into described shape of the mouth as one speaks static images.

In addition, in order better to adapt to the needs of using, the method 100 of converting words into animation also further comprises animated transition step 150, described animation file is passed through compression and form conversion etc., transform into the final animation file that is fit to application, as be converted into FLV (Flash Video) the stream media format file that is adapted at using on the computer network, perhaps be fit to the 3GP formatted file of application and mobile communication terminal.

Please refer to Fig. 2, Fig. 2 is the apparatus function piece figure according to a kind of converting words into animation of Fig. 1.The device 200 of described converting words into animation comprises: literal load module 210, and the user is used for input characters; Phonetic synthesis module 220, literal obtain and described literal corresponding audio files after synthesizing by phonetic synthesis module 210; Video synthesis module 230, literal obtain the video file corresponding with described literal after synthesizing by video synthesis module 230; Animation synthesis module 240, after audio file and video file synthesize by animation synthesis module 240, obtain the animation corresponding with described literal, the literal of user's input is through finally obtaining the animation corresponding with literal after the above-mentioned module 210～240, to realize the purpose of converting words into animation.

The video synthesis module 230 of the device 200 of converting words into animation more comprises: spelling analyzing module 232, the literal of being imported obtain the phonetic corresponding with literal after resolving by spelling analyzing module 232; Shape of the mouth as one speaks synthesis module 234, the shape of the mouth as one speaks picture corresponding with phonetic that extracts from the default phonetic shape of the mouth as one speaks data bank in outside behind shape of the mouth as one speaks synthesis module, synthesizes video file.

The device 200 of converting words into animation, wherein animation synthesis module 240 comprises shape of the mouth as one speaks synchronization module 242, so that audio file and video file are synchronous and T.T. is consistent.

The device 200 of converting words into animation wherein more comprises: video conversion module 250, above-mentioned video file is compressed and the form conversion after by video conversion module, and transform into and be fit to the final video file used.

Each module principle of work of the device 200 of converting words into animation is corresponding with each step in the method 100 of converting words into animation, does not repeat them here.

In the present embodiment, though though in video synthesis step 130, ignored the influence of initial consonant, only extract and be written into corresponding video file according to simple or compound vowel of a Chinese syllable, but in real process, if reach higher precision, the influence of initial consonant also can be taken into account fully, under the situation of considering the initial consonant influence, its principle of work basically identical, difference is that phonetic shape of the mouth as one speaks data bank need increase the shape of the mouth as one speaks of initial consonant, therefore phonetic shape of the mouth as one speaks data bank relatively can be bigger, algorithm difficulty relatively, do not repeat them here.

Embodiment two

According to the method for the converting words into animation of the embodiment of the invention two, except that the principle of work of video synthesis step 130 and embodiment distinguished to some extent, other each steps realized principle basically identicals, so all the other each steps will no longer be introduced.Analysis mode among analysis mode among the embodiment two in the spelling analyzing step 132 and the embodiment one is different, only distinguishes vowel and non-vowel in the spelling analyzing step 132.Specifically: the Chinese phonetic alphabet is made up of initial consonant and simple or compound vowel of a Chinese syllable, and simple or compound vowel of a Chinese syllable is mainly combined by vowel, and therefore, the combination of vowel can all simple or compound vowel of a Chinese syllable pronunciations of approximate simulation.Have 5 vowels in the Chinese phonetic alphabet, be respectively " a ", " e ", " i ", " o ", " u ", therefore, only need set up the pairing shape of the mouth as one speaks static images of these 5 vowels in advance.Its pairing shape of the mouth as one speaks static images also set up respectively in all the other non-vowels.When resolved one-tenth vowel of Chinese character literal and non-vowel, shape of the mouth as one speaks synthesis step 134 will extract the picture corresponding with described phonetic, and synthesize described video file respectively according to these vowels and non-vowel from default phonetic shape of the mouth as one speaks data bank.

It needs to be noted, because the non-vowel shape of the mouth as one speaks is all trickleer, be difficult to differentiate difference, and because the u in the vowel relatively is fit to replace simulating for non-vowel, therefore only need to set up " a ", " e ", " i ", " o ", the pairing shape of the mouth as one speaks static images of " u " these 5 vowels, can realize the purpose of whole converting words into animation by the mode of permutation and combination.

For example: text: today, weather was pretty good, its phonetic is: Jin Tian Tian Qi Bu Cuo, can resolve to vowel according to above-mentioned spelling analyzing method: uiu uiau uiua ui uu uuo, from default phonetic shape of the mouth as one speaks data bank, extract the picture corresponding then respectively successively, and synthesize the video file that needs with described vowel.

Embodiment two described methods have phonetic shape of the mouth as one speaks data bank and set up easily and safeguard, take up room little, algorithm simple, the advantage that realizes easily.Though but the phonetic of vowel simulation and real phonetic are approximate, but still significant difference is arranged, therefore the animation file effect that adopts this scheme to obtain is compared effect with embodiment one and can be differed from.

Embodiment three

According to the method for the converting words into animation of the embodiment of the invention three, except that the principle of work of video synthesis step 130 and embodiment distinguished to some extent, other each steps realized principle basically identicals, so all the other each steps will no longer be introduced.Same, all different among analysis mode among the embodiment three in the spelling analyzing step 132 and embodiment one and the embodiment two.Specifically: the Chinese phonetic alphabet is made up of initial consonant and simple or compound vowel of a Chinese syllable, and according to the difference of articulation type, 21 initial consonants can be divided into 8 classes.Initial consonant pronunciation mechanism of the same type is similar, and when sound joined influencing of coarticulation before and after considering, the syllable with initial consonant of the same type was similar to the influence of preceding syllable.This 8 class is respectively: plosive is unaspirated: " b ", " d ", " g ", plosive is supplied gas: " p ", " t ", " k ", affricate is unaspirated: " z ", " zh ", " j ", affricate is supplied gas: " c ", " ch ", " q ", fricative voiceless sound: " f ", " s ", " sh ", " x ", " h ", fricative voiced sound: " r ", nasal sound: " m ", " n ", lateral: " l ", therefore, only need set up corresponding shape of the mouth as one speaks static images in advance and get final product for these 8 initial consonant classifications.

In like manner, according to the difference of articulation type, 40 simple or compound vowel of a Chinese syllable can be divided into 4 classes.Simple or compound vowel of a Chinese syllable pronunciation mechanism of the same type is similar, and when sound joined influencing of coarticulation before and after considering, the syllable with simple or compound vowel of a Chinese syllable of the same type was similar to the influence of back syllable.Opening is exhaled class: " a ", " ai ", " ao ", " an ", " ang ", " o ", " ou ", " e ", " ei ", " en ", " eng ", " er ", class of syllables with i as the final or a final beginning with i class " i ", " i ", " ia ", " iao ", " ian ", " iang ", " ie ", " iu ", " in ", " ing ", " iong ", " iou ", heal up and exhale class " u ", " ua ", " uo ", " uai ", " uei ", " ui ", " un ", " uan ", " uen ", " uang ", " ueng ", " ong ", scoop up mouth and exhale class " v ", " ve ", " van ", " vn ", therefore, only need set up corresponding shape of the mouth as one speaks static images in advance for these 4 simple or compound vowel of a Chinese syllable classifications gets final product.

After Chinese character literal difference this 8 class initial consonant classification of resolved one-tenth and 4 class simple or compound vowel of a Chinese syllable classifications, shape of the mouth as one speaks synthesis step 134 will be respectively according to 8 class initial consonant classifications and 4 class simple or compound vowel of a Chinese syllable classifications, from default phonetic shape of the mouth as one speaks data bank, extract and described 8 class initial consonant classifications and the corresponding picture of 4 class simple or compound vowel of a Chinese syllable classifications, and synthesize described video file.

In the present embodiment, in video synthesis step 130, though serve as according to extraction and be written into corresponding video file with initial consonant classification and simple or compound vowel of a Chinese syllable classification.But in real process, still can adopt other mode to carry out spelling analyzing step 132 and shape of the mouth as one speaks synthesis step 134.For example: only consider the initial consonant classification, ignore simple or compound vowel of a Chinese syllable; Only consider the simple or compound vowel of a Chinese syllable classification, ignore initial consonant; Consider the initial consonant classification, consider simple or compound vowel of a Chinese syllable itself simultaneously; Consider the simple or compound vowel of a Chinese syllable classification, consider initial consonant itself simultaneously.The array mode that these are different will influence fidelity of the taking up room of foundation of shape of the mouth as one speaks resources bank and maintenance, shape of the mouth as one speaks resources bank, shape of the mouth as one speaks simulation or the like, therefore, can come to determine to carry out in which way the spelling analyzing and the shape of the mouth as one speaks according to the actual needs and synthesize.For example: when needing to realize the real-time video phone or implementing literal video/audio synchronous playing on mobile communication terminal, because the display screen of mobile communication terminal is less, therefore less demanding to fidelity can adopt the mode as implementing two.When will be,,, therefore can adopt as implementing one or implement three mode to having relatively high expectations of fidelity because display screen is bigger at giant-screen real-time video phone.

In sum, though the present invention discloses as above with preferred embodiment, it is not in order to limit the present invention.Those of ordinary skill in the technical field under the utility model, in not breaking away from spirit and scope of the present utility model, various variations and the retouching done all are considered as being equal to the utility model, all belong to protection content of the present invention.

Claims

1, a kind of method of converting words into animation can be converted into literal corresponding animation, it is characterized in that comprising the following steps:

(1) literal input step;

(2) phonetic synthesis step is carried out phonetic synthesis with described literal, obtains and described literal corresponding audio files;

(3) video synthesis step carries out video with described literal and synthesizes, and obtains the video file corresponding with described literal;

(4) animation synthesis step synthesizes described audio file and described video file, obtains the animation file corresponding with described literal.

2, the method for converting words into animation according to claim 1 is characterized in that: described video synthesis step comprises---the spelling analyzing step, described literal is carried out spelling analyzing, and obtain the phonetic corresponding with described literal; Shape of the mouth as one speaks synthesis step extracts the picture corresponding with described phonetic, and synthesizes described video file from default phonetic shape of the mouth as one speaks data bank.

3, the method for converting words into animation according to claim 2 is characterized in that: the dynamic picture of described video file for being made up of according to sequential described picture.

4, the method for converting words into animation according to claim 2 is characterized in that: the dynamic video of described video file for being made up of according to sequential described picture.

5, the method for converting words into animation according to claim 1 and 2 is characterized in that: described animation synthesis step comprises shape of the mouth as one speaks synchronizing step, makes described audio file and described video file synchronous, and T.T. is consistent.

6, the method for converting words into animation according to claim 2 is characterized in that: described spelling analyzing step is to be vowel and non-vowel with described spelling analyzing.

7, the method for converting words into animation according to claim 6 is characterized in that: described shape of the mouth as one speaks synthesis step extracts and described vowel and the corresponding picture of described non-vowel respectively from default shape of the mouth as one speaks data bank, and synthesizes described video file.

8, the method for converting words into animation according to claim 7 is characterized in that: the picture of the described non-vowel correspondence that described shape of the mouth as one speaks synthesis step extracts from default shape of the mouth as one speaks data bank is the pairing picture of one of described vowel.

9, the method for converting words into animation according to claim 2 is characterized in that: described spelling analyzing step is to be initial consonant and simple or compound vowel of a Chinese syllable with described spelling analyzing.

10, the method for converting words into animation according to claim 9, it is characterized in that: described initial consonant is further resolved to the unaspirated class of plosive, plosive supply gas class, the unaspirated class of affricate, affricate supply gas class, fricative voiceless sound class, fricative voiced sound class, nasal sound class, lateral class.

11, the method for converting words into animation according to claim 9 is characterized in that: described simple or compound vowel of a Chinese syllable is further resolved to, and opening is exhaled class, class of syllables with i as the final or a final beginning with i class, is healed up and exhale class, Cuo Kou to exhale class.

12, according to the method for claim 9 or 10 described converting words into animation, it is characterized in that: described shape of the mouth as one speaks synthesis step only extracts the picture corresponding with one of described all kinds of initial consonants or all kinds of initial consonants from default shape of the mouth as one speaks data bank, and synthesizes video file.

13, according to the method for claim 9 or 11 described converting words into animation, it is characterized in that: described shape of the mouth as one speaks synthesis step only extracts the picture corresponding with one of described all kinds of simple or compound vowel of a Chinese syllable or all kinds of simple or compound vowel of a Chinese syllable from default shape of the mouth as one speaks data bank, and synthesizes video file.

14, according to the method for claim 9 or 10 or 11 described converting words into animation, it is characterized in that: described shape of the mouth as one speaks synthesis step extracts and described all kinds of initial consonants and the corresponding described picture of described each simple or compound vowel of a Chinese syllable respectively from default shape of the mouth as one speaks data bank, and synthesizes video file.

15, the method for converting words into animation according to claim 1 and 2, it is characterized in that: further comprise the animated transition step after the described animation synthesis step, this step by compression and form conversion, transforms into the final animation file that is fit to application with described animation file.

16, a kind of device of converting words into animation can be converted into literal corresponding animation, it is characterized in that comprising:

(1) literal load module, the user is used for importing described literal;

(2) phonetic synthesis module, described literal obtain and described literal corresponding audio files after synthesizing by the phonetic synthesis module;

(3) video synthesis module, described literal obtain the video file corresponding with described literal after synthesizing by the video synthesis module;

(4) animation synthesis module, described audio file and described video file obtain the animation corresponding with described literal after synthesizing by the animation synthesis module.

17, the device of converting words into animation according to claim 16 is characterized in that: described video synthesis module more comprises---the spelling analyzing module, and described literal obtains the phonetic corresponding with described literal by after the spelling analyzing module parses; And shape of the mouth as one speaks synthesis module, the picture corresponding with described phonetic that extracts from the default phonetic shape of the mouth as one speaks data bank in outside is by synthesizing described video file behind the shape of the mouth as one speaks synthesis module.

18, according to the device of claim 16 or 17 described converting words into animation, it is characterized in that: this device further comprises video conversion module, described video file compresses by video conversion module and the form conversion, transforms into the final video file that is fit to application.

19, according to the device of claim 16 or 17 described converting words into animation, it is characterized in that: described animation synthesis module further comprises shape of the mouth as one speaks synchronization module, makes described audio file and described video file synchronous, and T.T. is consistent.