WO2017107518A1

WO2017107518A1 - Method and apparatus for parsing voice content

Info

Publication number: WO2017107518A1
Application number: PCT/CN2016/096186
Authority: WO
Inventors: 周蕾蕾
Original assignee: 乐视控股（北京）有限公司; 乐视致新电子科技（天津）有限公司
Priority date: 2015-12-25
Filing date: 2016-08-22
Publication date: 2017-06-29
Also published as: CN105912521A

Abstract

A method and apparatus for parsing a voice content. The method comprises: generating a first word segmentation dictionary by combining a word group in a specified field with a word group in a non-specified field, and performing word segmentation on a corpus stored in a machine according to the first word segmentation dictionary to obtain a word group in the corpus (11); making statistics, in the corpus, on the probability or frequency of occurrence of each word group in the word group in the corpus, and adjusting the probability or frequency according to a pre-determined rule so that the probability or frequency of occurrence of the word group in the specified field in the word group in the corpus increases (12); generating a second word segmentation dictionary by combining the word group in the corpus with the adjusted probability or frequency, and performing word segmentation on a voice content sent by a user according to the second word segmentation dictionary to obtain a word group in the voice content (13); and parsing the word group in the voice content according to a grammar file to obtain a corresponding semanteme (14). By means of the method, the probability of occurrence of a word group in a specified field in all word groups in a machine increases, thereby improving the accuracy rate of the machine parsing a semanteme of a voice content.

Description

Method and device for analyzing voice content

The present application claims priority to Chinese Patent Application No. 201510995231.5, filed on Dec. 25, 2015, the entire disclosure of which is incorporated herein by reference. In the application.

Technical field

The present application relates to the field of information processing, and in particular, to a method and apparatus for analyzing voice content.

Background technique

Natural language processing technology can help people to communicate better with the machine. For example, the voice recognition module in the computer recognizes the voice content sent by the user, and parses the voice content to obtain the semantics corresponding to the voice content. Finally, The computer performs related operations based on the parsed semantics.

At present, the general method for the machine to parse the voice content sent by the user is: the first step: establish a language model, usually before the language model is established, it is necessary to artificially mark some commonly used corpora, for example, the user is directed to "I want to see Andy Lau's "The concert" is marked with the corpus, in which "I" can mark adult pronouns, "Andy Lau" is marked as a star name, etc., and then the words in the corpus are classified according to the content of the mark, for example, the personal pronoun is a class, star The name is a class, etc., complete the classification of the phrase, that is, complete the establishment of the language model; the second step: according to the phrase in the established language model, the word content of the user input is cut, usually using CRF (Conditional Random Field) The method, for example, the voice content input by the user to the computer is "When is there a concert of Andy Lau", at which time the computer cuts the corpus according to the phrase in the phonetic model, for example, if in the star name category in the language model There is the word "Andy Lau", which has the word "singing" in the verb category, in the noun category. There are phrases such as "time" and "concert". According to these phrases, the corpus can be cut into "what / time / have / Andy Lau / concert / concert", or cut into "what / time / have /Andy Lau / / singing / meeting", because the language model has two phrases "singing" and "concert", in this case, we must compare the probability of the two words appearing in the corpus, for example, " The singer has a higher probability of appearing in the corpus than the "concert", so the above corpus is preferentially cut into "what / time / have / Andy Lau / / singing / meeting"; the third step: the cut phrase and The grammar files in the machine are matched to resolve the semantics of the user's voice content, and BNF (Backus-Naur Form) is a grammar frequently used by users.

With the continuous development and updating of information, the number of phrases in certain specific fields has gradually increased, but the corpus of machines containing these specific domain phrases is limited, so when the language model is established, it may lead to certain specific fields. The probability that a phrase appears in all phrases of a language model is relatively small. When the machine cuts the voice content sent by the user according to the language model, the voice content sent by the user may be incorrectly cut due to the small probability of occurrence of certain specific domain phrases, thereby causing the machine error analysis. The voice content input by the user, for example, in the above example, if "singing" has a greater probability of appearing in the corpus than "song", then the above corpus is preferentially cut into "what/time/have/Andy Lau// Singing/meeting, obviously, this does not match the semantics of the voice content sent by the user.

Summary of the invention

In view of the above problems, the embodiment of the present application provides a method and apparatus for parsing voice content, which is used to solve the problem that a machine error parses a user-entered voice content due to a small number of corpora in a specific domain when a language model is established.

An embodiment of the present application provides a method for parsing voice content, the method comprising: combining a phrase in a specific domain with a phrase in a non-specific domain to generate a first word dictionary, according to the first word dictionary in the machine The stored corpus performs a word segmentation to obtain a phrase in the corpus; statistics a probability or frequency of occurrence of each phrase in the corpus in the phrase in the corpus, and adjusts the probability or frequency according to a predetermined rule to make a specific The probability or frequency of occurrence of a phrase in the field in a phrase in the corpus; combining a phrase in the corpus with the adjusted probability or frequency to generate a second word dictionary, and according to the second slice The word dictionary performs word-cutting on the voice content sent by the user, obtains a phrase in the voice content, and parses the phrase in the voice content according to the grammar file to obtain corresponding semantics.

Preferably, the combining the phrase in the specific domain and the phrase in the non-specific domain to generate the first word dictionary comprises:

Generating a corpus stored by the machine according to a phrase of a specific domain, and obtaining a phrase of a specific domain in the corpus;

Counting the probability or frequency of occurrence of a phrase for each particular domain in the corpus in a particular domain of the phrase in the corpus;

According to the ranking of the probability or the frequency, a preset number of phrases are selected from the phrases of the specific domain in the corpus, and the selected phrases are combined with the phrases in the non-specific domain to generate a first word dictionary.

Preferably, the translating the specific content of the voice content sent by the user according to the second word-cut dictionary include:

According to the second word-cutting dictionary, the speech content sent by the user is respectively cut using the method of the backward maximum cut word and the forward minimum cut word, and if the two types of word-cutting methods are different, the The second word dictionary searches for the probability or frequency corresponding to the different phrases, and selects a phrase with a large probability or frequency as the final word segment.

Preferably, the second word dictionary includes:

Address area and phrase area;

The address area, the guiding machine searches for a position of a phrase in the voice content after the word-cut sent by the user in the second word-cut dictionary;

The phrase area stores a corresponding phrase in the address area.

Preferably, the parsing the phrases in the voice content according to the grammar file specifically includes:

Matching the phrase in the voice content with the phrase in the grammar file, if the phrase in the voice content completely matches the phrase in the grammar file, the analysis is successful; if the full match fails, the keyword matching is performed .

Preferably, the keyword matching specifically includes:

Matching the phrase in the voice content with the keyword in the grammar file, if the matching is successful, the parsing is successful; if the matching is unsuccessful, the parsing fails.

Preferably, the phrase of the specific domain comprises at least one of the following:

Chinese characters; English letters; numbers.

An apparatus for parsing voice content, the apparatus comprising: a combining unit, a statistic unit, a word cutting unit, and a parsing unit; wherein

The combining unit is configured to combine a phrase in a specific domain and a phrase in a non-specific domain to generate a first word dictionary, and perform a word cutting on the corpus stored in the machine according to the first word dictionary to obtain the corpus Phrase in

The statistical unit is configured to count a probability or frequency of occurrence of each phrase in the corpus in a phrase in the corpus, and adjust the probability or frequency according to a predetermined rule, so that a phrase in a specific domain is in the corpus The probability or frequency of occurrence in the phrase in the phrase increases;

The word-cutting unit is configured to combine the phrase in the corpus with the adjusted probability or frequency to generate a second word-cut dictionary, and perform word-cutting on the voice content sent by the user according to the second word-cut dictionary. Obtaining a phrase in the voice content;

The parsing unit is configured to parse a phrase in the voice content according to a grammar file to obtain a corresponding semantic.

Preferably, the combining unit includes: a word subunit, a statistical subunit, and a combined subunit; wherein

The word-cutting unit, configured to perform a word-cutting on a corpus stored by the machine according to a phrase of a specific domain, to obtain a phrase of a specific domain in the corpus;

The statistical subunit is configured to count a probability or a frequency of occurrence of a phrase of each specific domain in the corpus in a phrase in a specific domain in the corpus;

The combining subunit, configured to select a preset number of phrases from a phrase of a specific domain in the corpus according to the ranking of the probability or frequency, and combine the selected phrase with a phrase in a non-specific domain to generate The first word dictionary.

Preferably, the word-cutting unit comprises:

Combining subunits, cutting subunits, and finding subunits; wherein

a combination subunit, configured to combine the phrase in the corpus and the adjusted probability or frequency to generate a second word dictionary;

The word-cutting unit is configured to perform a word-cutting on the voice content sent by the user according to the second word-cutting dictionary, using a backward maximum cut word and a forward minimum cut word;

The finding subunit is configured to search for a probability or a frequency corresponding to the different phrase in the second word dictionary, when the phrases obtained by the two word-cutting methods are different, and select a probability or a frequency with a large frequency The phrase is the final word of the word.

The embodiment of the present application provides an electronic device, including the device for parsing voice content according to any of the foregoing embodiments.

The embodiment of the present application provides a non-transitory computer readable storage medium, wherein the non-transitory computer readable storage medium can store computer instructions, which can implement the parsing of voice content provided by the embodiments of the present application. Some or all of the steps in the various implementations of the method.

An embodiment of the present application provides an electronic device, including: one or more processors; and a memory; Wherein the memory stores instructions executable by the one or more processors, the instructions being configured to perform the method of parsing the voice content of any of the above-described embodiments of the present application.

An embodiment of the present application provides a computer program product, the computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer, The computer is caused to perform the method for analyzing voice content according to any of the above embodiments of the present application.

When applying the language model of the present application, by adjusting the probability or frequency of occurrence of each phrase in all phrases in the stored corpus in the machine, the probability or frequency of occurrence of the phrase in the specific domain in all phrases is increased, thereby Improve the accuracy of the machine's semantics of parsing user speech content.

DRAWINGS

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, a brief description of the drawings used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description It is a certain embodiment of the present application, and other drawings can be obtained according to the drawings without any creative work for those skilled in the art.

1 is a schematic flowchart of a method for parsing voice content according to Embodiment 1 of the present application;

2 is a schematic flowchart of a language model adaptation according to Embodiment 1 of the present application;

3 is a schematic diagram of an address area portion in a second word dictionary provided by Embodiment 1 of the present application;

4 is a schematic diagram of a portion of a phrase region in a second word dictionary provided by Embodiment 1 of the present application;

FIG. 5 is a schematic flowchart of a method for cutting a user voice content by using a joint manner of a backward maximum cut word and a forward minimum cut word according to Embodiment 1 of the present application;

6 is a schematic diagram of a syntax prepared by using a syntax tree according to Embodiment 1 of the present application;

FIG. 7 is a schematic flowchart of a method for matching voice content sent by a user according to a grammar file according to Embodiment 1 of the present application;

FIG. 8 is a schematic flowchart of a complete method for parsing voice content according to Embodiment 1 of the present application;

FIG. 9 is a schematic structural diagram of an apparatus for analyzing voice content according to Embodiment 2 of the present application;

FIG. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present application. Some embodiments are applied, not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

In view of the problems existing in the prior art in analyzing the voice content sent by the user, the embodiment of the present application provides a method and apparatus for analyzing voice content, which are used to solve a domain-specific corpus when building a language model. There are few problems that cause machine errors to resolve the voice content entered by the user.

Example 1

The embodiment of the present application provides a method for parsing voice content, which is used to improve the accuracy of semantic analysis of semantics in a user's voice content. FIG. 1 is a schematic flowchart of a method for parsing voice content according to an embodiment of the present application. The method is as follows:

Step 11: Combine the phrase in the specific domain with the phrase in the non-specific domain to generate a first word dictionary, and perform a word segmentation on the corpus stored in the machine according to the first word dictionary to obtain a phrase in the corpus.

In this step, firstly, the dictionary of a specific domain is screened, and the dictionary of these specific fields is combined to generate a full dictionary, for example, a phrase in a field such as a computer, a machine, or an entertainment is combined into a specific domain dictionary, and a specific domain dictionary is selected. As a full dictionary; then CRF word-cutting according to the phrase in the full dictionary to the stored corpus in the machine (as in step 21 of FIG. 2), obtaining a phrase in a specific field in the corpus; and then counting each of the specific fields The probability or frequency of occurrence of the phrase in all the specific domain phrases in the obtained corpus, and according to the probability or frequency ranking, the phrase is selected as a dynamic dictionary according to the preset number (step 22 of FIG. 2), for example, in the word cutting The following corpus selects the top 50,000 topic-specific phrases of probability, and combines them into a dynamic dictionary, in which the 50,000 phrases can contain many phrases frequently used by users in specific fields; finally, the dynamic dictionary and non-specific fields will be generated. The phrase in the phrase is combined (as in step 23 of Figure 2), generating an offline word dictionary, the first word dictionary, where the non-specific Phrases in a domain refer to phrases that users often use, and do not include phrases in a specific domain. For example, phrases in a non-specific field can include personal pronouns, such as you, me, him, etc.; phrases in non-specific fields are also Can include common verbs, such as playing, thinking, wanting, taking, etc.

After generating the first word-cut dictionary, cutting the corpus stored in the machine according to the first word-cut dictionary (step 24 of FIG. 2), obtaining all the phrases in the corpus, and using the phrase in the language model Training corpus, the phrase here contains both phrases in a specific field and phrases in a non-specific field. There are many ways to cut words in a corpus. Here is an example of one way, such as The corpus stored in the machine is: I want to see Andy Lau's concert, and then CRF cut the corpus according to the first word dictionary. For example, the phrase in the first word dictionary is: I, think, see , want to see, Andy Lau, the concert, then according to the phrase in the first word dictionary can cut the corpus into: I / think / see / Andy Lau / / concert, or cut into: I / want to see / Andy Lau / / Concert, then you need to compare the probability or frequency of the words "think" and "want to see" in the corpus. If the probability or frequency of the latter is greater, then "I want to see , Andy Lau, the singer "These phrases as training corpus language model.

Step 12: Statistics the probability or frequency of occurrence of each phrase in the corpus in the phrase in the corpus, and adjust the probability or frequency according to a predetermined rule, so that the phrase in the specific domain is in the phrase in the corpus The probability or frequency of occurrence increases.

In step 11, the training corpus in the language model is obtained, that is, the phrases in all the corpora in the machine are obtained. In this step, the language model needs to be trained (step 25 in Fig. 2), which can be performed by using the SRILM tool. The training of the language model may include, but is not limited to, the probability or frequency of occurrence of each phrase in all corpora of the statistical machine in all phrases. Here, the SRILM language model training tool is only an exemplary description, and may also be other training methods. Specifically limited.

After training the language model, the user needs to test the training results. For example, to check the probability of occurrence of each phrase. When checking the probability of the corresponding phrase, it may be found that the phrases in some specific corpora are often appearing in the corpus, but relatively non-specific. Some of the similar phrases in the field have a low probability of occurrence, so that when the relevant corpus is cut, the phrases in these specific corpora may be annihilated by some other similar phrases, causing the word-cutting error, so that the machine cannot correctly resolve the user's Voice content. For example, if the user inputs the speech content of "dog stick" on the computer, if "hit" is a phrase in a non-specific field, "dog stick" is a phrase in a specific field, and "hit" is a phrase in all corpora. The probability of occurrence is greater than the probability of "dog sticking", so the computer will cut the "dog stick" into "hitting/dog stick", causing the computer to not correctly parse the user's semantics.

There are many ways to solve the above problems. For example, you can take the probability or frequency of redistributing the phrases in each specific domain in the corpus, so that the machine can more accurately cut the voice content sent by the user. word. Here is a specific way to redistribute the probability: firstly, the probability sums of the non-specific domain phrases and the specific domain phrases appearing in all corpora are respectively expressed as Psum1 and Psum2; then each non-specific domain phrase is displayed. The probability of occurrence is divided by Psum1 to obtain P1. Similarly, the probability of occurrence of each particular domain phrase is divided by Psum2 to obtain P2; finally P1 is multiplied by the weight coefficient k1, P2 is multiplied by the weight coefficient k2, respectively, to obtain the non-specific domain phrase respectively. And the final probability that each phrase in a particular domain phrase appears in all corpora, where the user can set the values of k1 and k2 according to individual needs, but the sum of k1 and k2 is 1, and to be in the dynamic dictionary The final probability of occurrence of each phrase is greater than the final probability of occurrence of each phrase in a non-specific field, and the weighting factor k1 is smaller than Psum1. By reallocating the probability of occurrence of non-specific domain phrases and domain-specific phrases, the probability of occurrence of specific domain phrases is increased, so that the machine can more accurately cut the voice content sent by the user, and improve the semantics of the user's voice content. accuracy.

Step 13: Combine the phrase in the corpus with the adjusted probability or frequency to generate a second word dictionary, and perform a word cut on the voice content sent by the user according to the second word dictionary to obtain the voice. The phrase in the content.

After reallocating the probability or frequency of occurrence of each phrase, the adaptive process of the language model is completed (step 26 of Figure 2), at which time the machine can output the adaptive language model (step 27 of Figure 2). ). The adaptive language model has both the phrases obtained after training and the probability and frequency of redistribution corresponding to each phrase. Then, the adaptive language model needs to be converted into a second word dictionary. The structure of the second word dictionary has many kinds. The main purpose of the second word dictionary is to help the machine send the voice to the user faster and more accurately. The content is cut.

Here, the structure of one of the second word-cutting dictionaries is exemplarily described, and the structure of the second word-cutting dictionary includes two parts, an address area and a phrase area. The address information in the address area helps the machine find the corresponding position of the phrase in the second word dictionary according to the phrase after the user cuts the word; the phrase stored in the phrase area is the phrase corresponding to the address area.

Specifically, the address area may include 10 Arabic numerals (ie, 0 to 9), 26 uppercase letters or lowercase letters (ie, A to Z or a to z), and address information corresponding to the commonly used Chinese characters. Here, the numbers and letters are in full-width format, and each number or letter itself occupies two bytes. The address information corresponding to each number, letter or Chinese character occupies four bytes, which is assumed to be commonly used in the second word dictionary. There are 6768 Chinese characters, and the address information corresponding to numbers, letters and Chinese characters is shared (10+26+6768)*4=27216. If the first address is uniDict, the first address of the phrase area is uniDict+27216, as shown in Figure 3. Address area Schematic diagram of the address information in the domain: the first address of the phrase area is uniDict+27216, the address holds the phrase with the number “0”; the address corresponding to the letter area “A” is uniDict+40, and the address is saved with the letter “A”. The address of the first phrase; the address corresponding to the commonly used Chinese character "ah" is uniDict+144, which holds the address of the phrase headed by the Chinese character "ah".

Specifically, in the phrase area, taking the full-width number “0” as an example, as shown in FIG. 4 is a schematic diagram of a phrase in a phrase area: the first address corresponding to “0” is uniDict+27216, and can be seen as “0”. The phrase that is the first word can be "05 mm". If the user wants to find a phrase with the first word of "0", look down from the first address uniDict+27216 until the guard mark is encountered. The guard mark here refers to The phrase with the "0" as the first word in the second word dictionary has reached the last one. By using the first word of the phrase to divide the phrase area in the word dictionary, the efficiency of the machine in finding the phrase in the word dictionary is improved. In order to save space in the word dictionary, the first word may not be stored in the phrase portion. For example, "05 mm" shown in Fig. 4 is stored in the dictionary as "5 mm".

In addition to storing phrases in the phrase area, there are other parameters, here are a few examples:

Wordlen: indicates the length of the phrase;

Buf: means to remove the phrase content of the first word, then sizeof(buf)=wordlen-2, indicating the length of the phrase after the first word is removed;

Frequency: indicates the frequency corresponding to the phrase after redistribution in the language model, then sizeof(frequency)=2 bytes, indicating the length of the frequency;

Reclen: indicates the space occupied by storing a phrase, sizeof(reclen)=1 bytes, where reclen=sizeof(reclen)+sizeof(frequency)+sizeof(buf)+sizeof(wordlen);

Guard: indicates the end of each partition, sizeof(guard) = 1 byte.

The second word dictionary can include numbers, letters and Chinese characters, which can improve the accuracy of the machine to parse the semantics of the user's voice content. For example, the voice content input by the user is "when to play Journey 2", if the word is cut In the dictionary, only "Journey to the West", without the number "2", may cut the above voice content into "what / time / play / Journey to the West / ah", which may lead to machine parsing errors.

In this step, the manner in which the voice content sent by the user is cut according to the second word-cutting dictionary is very different. For example, the word can be cut in the manner of the backward maximum cut word, or the forward minimum cut word can be used. Ways to cut words. Here is a method of word-cutting in which the backward maximum cut word and the forward minimum cut word are combined: for example, for the voice content of the "time of the teenager baggage broadcast on the TV" sent by the user, As shown in FIG. 5, if the maximum length of the search maxLen=5 and the minimum length minLen=2 are specified, in the backward maximum cut word, the “time of broadcast” is searched first, and the corresponding phrase is not found in the cut word dictionary. Then subtract a word into "out of time" to re-search, after the search, the corresponding dictionary is not found in the word dictionary, and then the "time" is searched, so that the search is performed in turn by the word reduction method, and finally arrives. When the minimum length, that is, "time", find the corresponding phrase in the word dictionary; then use the above method to search for the phrase before "time", and finally complete the word for the voice content. In the forward minimum cut word, first search for the phrase in front of the voice content, for example, first search for the word "juvenile" in the word dictionary, and find the corresponding phrase, then the phrase after the "boy" Searching, that is, searching for "Baoqing", and finding that there is no corresponding phrase in the word dictionary, then re-searching for one more word, that is, searching for "Baoqingtian", using the same method to finally complete the cutting of the voice content. word.

After cutting the words of the user's voice content by using the above-mentioned backward maximum cut word and forward minimum cut word combination, if the obtained word results are different, that is, the obtained phrases are different, the words are compared by comparing different phrases. The probability or frequency in the dictionary determines the final result of the word. As shown in Figure 5, if the speech content of the "Juvenile Bao Qingtian broadcasts on the TV" is the maximum backward word and the forward minimum word, the result of the backward maximum word is "Junior/Bag". "Blue/Day in / Satellite TV / Broadcast / / Time", and after using the minimum forward word, the result is "Youth / Bao Qing Tian / In / Satellite / Broadcast / / Time", this time can be compared by The probability or frequency of "Tian Zai" and "Bao Qingtian" appearing in the cut word dictionary, and the probability or frequency of "Bao Qingtian" is found to be large, then the final result of the word cut is "Juvenile / Bao Qingtian / in / TV / broadcast Out / / time". Here, the word-cutting method of combining the maximum cut word and the forward minimum cut word makes the result of the cut word more accurate.

Step 14: Parse the phrases in the voice content according to the grammar file to obtain corresponding semantics.

There are many grammars used in grammar files. Here is the BNF grammar as an example. The basic rules of BNF grammar include but are not limited to the following aspects:

The content contained in <>: is mandatory, and is a non-terminal node whose syntax must be further explained;

[]: The content contained in the content is optional, indicating that its content can be skipped;

|: means to choose one of the left and right sides, which is equivalent to the meaning of "or";

(): indicates a combination;

However, in practical applications, sometimes these grammar rules cannot meet the needs of the user. The embodiment of the present application expands on the basis of the BNF grammar rules, and the following rules are added:

#: indicates a comment;

:: the delimiter of the non-terminal node and its interpretation;

;: indicates the end of the statement in the grammar;

"": indicates that an external dictionary file is referenced;

&root(<name>): written at the beginning of the grammar, indicating that the grammar name is name;

&keyword(textFrag,key,defaultValue,showValue): This function is used to extract the keywords of the input text. The function specifically indicates: the input is inputTextFrag, if inputTextFrag and textFrag match successfully, then key=showValue, otherwise key=defaultValue; and the function can not define showValue, that is, the function can be defined as: &keyword(textFrag, key, defaultValue ), if the inputTextFrag entered successfully matches textFrag, it is directly assigned to textFrag, which is key=textFrag.

Specifically, the function of the above function is illustrated. For example, the function defined in the machine is: &keyword(Beijing|Tianjin|Shanghai, place, local); &keyword(raining|snowing, weather, undefined, weather); &keyword(Tomorrow | Today | The day after tomorrow, date, today). If the text content input by the user is “It’s raining tomorrow”, the machine searches for the keyword in the defined function. First, there is no specific address in the content input by the user, so the function “&keyword(Beijing|Tianjin|Shanghai The keyword matching in "place, local)" fails. At this time, the content input by the user is automatically assigned. The specific value is the defaultValue of the function. The defaultValue here is "local"; then the "tomorrow" entered by the user is The keyword in the function "&keyword" is matched with "tomorrow" in the function. Because showValue is not defined in the function, the time entered by the user is directly assigned. "Tomorrow"; finally, the "rain" input by the user is matched with the keyword in "&keyword" (the rain, snow, weather, undefined, weather), and the "rain" in the function Successful match, because the function defines showValue, and the showValue of the function is "weather", so the "rain" input by the user is replaced with "weather". Then, according to the function, the machine matches the "when it rains tomorrow" input by the user into "local tomorrow weather" and performs related operations. The order in which the content input by the user is matched in the above example is only an exemplary description. The matching order is not specifically limited herein. For example, in the above example, the words "tomorrow" and "rain" are input by the user. In order, you can match "Tomorrow" first, or match "rain" first, or you can match both words at the same time.

&duplicate(TextFrag,least,most): This function indicates that the TextFrag is repeated m times. The value range of m is: least≤n≤most, for example, the definition function: &duplicate(TextFrag,1,3), the output content at this time Is: TextFrag[TextFrag][TextFrag];

&comb(textFrag1, textFrag2,...,textFragN): This function indicates that the syntax fragments TextFrag1, TextFrag2, ..., textFragN are arranged and combined, for example, the definition function: &comb(Text Frag1, TextFrag2); the output content is: (TextFrag1TextFrag2 )|(TextFrag2TextFrag1).

There are many ways to extend the BNF grammar rules. The above is only an exemplary description. For example, the above definitions of symbols may be replaced with other symbols; or the same symbols may indicate other meanings, which are not specifically limited herein. In addition, in order to explain the above syntax more clearly, the following is an example of a grammar file written based on the above grammar rules, and the contents of the file are as follows:

&root(<Video on Demand>);

#key words:

#type:Video category

#moviNamee: Film name

#year:年

According to the grammar rules defined above, the grammar file is parsed: the name of the grammar file is "video on demand", and the grammar file has three keywords: type, movie, and year. Specifically, if the text content "plays the 2002 film infernal" for the text content, the defined grammar file can be:

#例:Playing the 2002 film infernal

<Video on Demand>: [Play][<Year>][of]&comb([<Category List>], <Video List>)

<category list>: &keyword (movie | TV series, type, unspecified);

<video list>: &keyword("movieList.dic", movieName, unspecified);

<year>: &keyword((<time>year), year,unspecified);

<time>: &duplicate(<number>, 2, 4);

<number>:0|1|2|3|4|5|6|7|8|9;

In addition, in order to facilitate the machine to parse the text content input by the user according to the defined grammar rules, Each grammar is written in the form of a grammar tree, and finally a grammar file is written in the form of a "grammar forest." Taking the grammar file written in the above-mentioned "Playing Movies in 2002" as an example, the syntax tree written is as shown in Fig. 6: in the first level of the syntax tree, the file name is displayed: "video on demand"; second In the level, there are four parts: the first part is “play”, the second part is “year”, the third part is “of”, the fourth part is “film list” and “category list”, among them, “video list” It can be a movie or a TV show. This hierarchically categorizes the content of the grammar file in the form of a syntax tree, which facilitates the machine to parse the voice content input by the user.

After completing the definition of the relevant grammar, the machine can match the voice content sent by the user according to the grammar file, and the matching manner includes two types: full matching and keyword matching. The schematic diagram of the specific matching process is as shown in FIG. 7: firstly, the voice content input by the user is fully matched according to the grammar file (step 71 in FIG. 7), where the voice content is the voice content after the word is cut; the matching result is judged. (Step 72 in Figure 7), if the full match is successful, the matching result is printed (as in step 73 in Figure 7); if the full match fails, the keyword matching is performed (step 74 in Figure 7), specifically It means: searching for the corresponding keyword from the keyword list in the grammar file, and if the matching is successful, printing the matching result.

For the above matching process, a detailed description is given by way of example: for example, the voice content input by the user is “I want to play the 2002 infernal movie”, the machine converts the voice content into corresponding text content, and cuts the text content. The word, the result of the word cut is "I want / play / 2002 / / movie / infernal". Then match the text content according to the grammar file: First, "I want" in the grammar file does not have the corresponding word to cover, that is, the full match fails; then the keyword matching, as follows:

Type=movie; movieName=infernal; year=2002; in the keyword matching process, as long as the keywords in the input text can match the keywords in the keyword list in the grammar file, so as to match the full match The way to use keyword matching is more flexible, and the constraints on the input text content are smaller, which improves the probability of successful matching.

Through the above process of parsing the voice content input by the user according to the grammar file, it can be found that: to enable the machine to quickly and accurately parse the semantics of the user's voice content, the user should be as standardized as possible when writing the grammar file in the machine. Here is an example of the specification and writing skills of grammar files:

1. The grammar is as comprehensive as possible. Here you can write examples in the written grammar rules. The specific process is: first design the user scene; then write the example sentence; finally, cover the example sentence according to the written grammar.

2. According to the grammar scene, the keywords should be clear, which is convenient for the machine to perform keyword matching.

3. When writing a grammar file, try to avoid it. For example, the grammar fragment in the grammar file is "[today][[][[]][weather]", according to the grammar fragment can cover "/ /Weather" text content, obviously, the text content does not conform to human language habits, and this serious overproduction will reduce the advantages of the grammar file structure. In order to reduce this over-production situation, the grammar file can be split into several sub-entries. For example, for the grammar grammar fragment described above, it can be written as: the first-level sub-entries are: "[today][of]<广州>[] [Weather]"; the second sub-entries are: "[Today][Guangzhou][[][Weather]", the third-level subentry is: "[Today][[][Guangzhou][Weather]", so It is possible to reduce the overproduction in the grammar file.

4. When writing grammar files, try to use a hierarchical writing method to make the grammar files have good readability. For example, the syntax tree rules mentioned above.

5. The phrase in the grammar file is as close as possible to the phrase in the word dictionary, which makes the machine more accurately parse the user's voice content. For example, "I want to know" can be cut into "I want / know" according to the word dictionary, and the phrase in the grammar file should be consistent, which can be "[I want] <know>" instead of "[I ][Think] <Know> and so on.

Here, when parsing the voice content, it is necessary to consider the influence of the word-cutting. For example, the voice content sent by the user is "I want to make a call", and the machine may cut the voice content into "I want/play/telephone", at this time although the machine There is an error in the word cut, but the grammar file should be parsed according to the "call" method, which can reduce the machine's parsing error due to the wording error.

6. When the grammar file is written in the syntax tree, at least one mandatory option is included in the root node, otherwise the input text is overwritten by the syntax, causing the machine to parse the error. For example, the grammar fragment in the grammar file is "[today][[][[]][weather]", because the phrases in the grammar fragment are all optional, if the user input the voice content is "Today's Shanghai The weather, this time can also match the phrases in the grammar file, obviously, this will lead to machine parsing errors.

7. When writing a grammar file using the syntax tree, if the required phrase in the root node is also a keyword, you can set: defaultValue=error. When the voice content sent by the user cannot match the mandatory content in the root node, the error is directly output to prevent the machine from performing the keyword matching operation and waste the resources of the machine.

In order to understand the embodiment of the present application more clearly, the method for analyzing the voice content provided above is performed. The description of the system is as shown in FIG. 8: the first step: the adaptive process of the language model (step 81 in FIG. 8), specifically: adjusting each phrase in the corpus in the phrase in the corpus The probability or frequency of occurrence increases the probability and frequency of occurrence of phrases in a particular domain in a phrase in the corpus of the machine; the second step: cutting the speech content sent by the user according to the word-cut dictionary (as in Figure 8) Step 82); Step 3: Perform a full match on the voice content after the word cut according to the grammar file (step 83 in FIG. 8), at which time the machine determines whether the full match is successful (step 84 in FIG. 8), if If the matching is successful, the matching result is printed (such as step 85 in FIG. 8), where the grammar file can be in the form of a syntax tree; and the fourth step: if the full matching fails, the keyword matching is performed (step 86 in FIG. 8). ), the matching result is printed after the keyword matching is successful. The process of completing the matching is the process of the machine parsing the user's voice content.

The beneficial effects obtained by the method for parsing voice content provided by the embodiments of the present application are as follows:

1. When training the language model, adjust the probability or frequency of occurrence of each phrase in the stored corpus in all phrases in the machine, so that the probability or frequency of occurrence of phrases in a particular field in all phrases increases, thereby increasing the machine. Analyze the accuracy of the semantics of user voice content.

2. The word dictionary in the embodiment of the present application includes an address area and a phrase area, and the first word partition is used in the phrase area, so that the machine can quickly find the position of the corresponding phrase in the word dictionary. In addition, the phrases in the phrase area contain numbers, letters, and Chinese characters, increasing the accuracy of the machine's ability to resolve the semantics of the user's voice content.

3. The embodiment of the present application expands on the basis of the existing BNF grammar rules, and provides the writing skills of the grammar rules, improves the readability of the grammar file, and improves the semantic accuracy of the machine parsing the user's voice content. .

4. When matching the voice content sent by the user according to the grammar file, the full matching and the keyword matching are adopted, so that the matching is more comprehensive, thereby improving the accuracy of the machine parsing the semantics of the user's voice content.

Finally, it should be understood that those skilled in the art can understand that all or part of the process of implementing the above embodiments can be completed by a computer program to instruct related hardware, and the program can be stored in a non-transitory computer. In a readable storage medium, the program, when executed, may include the flow of an embodiment of the methods as described above. The non-transitory computer readable storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), or a random access memory (RAM).

Example 2

A method for parsing voice content is provided in Embodiment 1. Correspondingly, the embodiment of the present application provides an apparatus for parsing voice content, which is used to improve the accuracy of semantic analysis of semantics in a user's voice content.

An apparatus for parsing voice content, the apparatus comprising: a combining unit 91, a statistic unit 92, a word cutting unit 93, and a parsing unit 94;

The combining unit 91 may be configured to combine a phrase in a specific domain with a phrase in a non-specific domain to generate a first word dictionary, and perform a word cutting on the corpus stored in the machine according to the first word dictionary to obtain the a phrase in a corpus;

The statistic unit 92 may be configured to count the probability or frequency of occurrence of each phrase in the corpus in the phrase in the corpus, and adjust the probability or frequency according to a predetermined rule, so that the phrase in the specific domain is in the corpus The probability or frequency of occurrence in the phrase in the phrase increases;

The word unit 93 may be configured to combine the phrase in the corpus with the adjusted probability or frequency to generate a second word dictionary, and perform word cutting on the voice content sent by the user according to the second word dictionary. Obtaining a phrase in the voice content;

The parsing unit 94 is configured to parse the phrases in the voice content according to the grammar file to obtain corresponding semantics.

The working process of the above device embodiment is: first step: the combining unit 91 combines the phrase in the specific domain with the phrase in the non-specific domain to generate a first word dictionary, and stores the same in the machine according to the first word dictionary. The corpus performs a word cut to obtain a phrase in the corpus; the second step: the statistic unit 92 counts the probability or frequency of occurrence of each phrase in the corpus in the corpus, and adjusts the probability or frequency according to a predetermined rule, so that the specific The probability or frequency of occurrence of a phrase in the field in a phrase in the corpus increases; a third step: the word segmentation unit 93 combines the phrase in the corpus with the adjusted probability or frequency to generate a second word dictionary, and according to the phrase The second word-cut dictionary performs word-cutting on the voice content sent by the user to obtain a phrase in the voice content. The fourth step: the parsing unit 94 parses the phrase in the voice content according to the grammar file to obtain corresponding semantics.

The foregoing apparatus embodiments improve the accuracy of the machine to resolve the semantic accuracy of the user's voice content. For example, in one embodiment, the combining unit 91 includes: a word subunit, a statistical subunit, and a combined subunit. ;among them,

a word subunit, which can be used to perform a word segmentation on a corpus stored by the machine according to a phrase of a specific domain, to obtain a phrase of a specific domain in the corpus; and obtain an artificial markup method according to the prior art. The way in which a particular field of phrase is used, it is more convenient to use a machine-cut word to obtain a specific domain phrase.

a statistical subunit, which can be used to count the probability or frequency of occurrence of a phrase of each particular domain in the corpus in a particular domain of the corpus;

The combining subunit may be configured to select a preset number of phrases from a phrase of a specific domain in the corpus according to the ranking of the probability or frequency, and combine the selected phrase with a phrase in a non-specific domain to generate a first All word dictionary. Select the phrase with the highest probability or frequency of the phrase in a specific field, and generate the first word dictionary from the phrases that often appear in the corpus, which can improve the efficiency of machine word cutting.

In another embodiment, the word-cutting unit 93 includes:

Combining subunits, cutting subunits, and finding subunits; wherein

Combining a subunit, the phrase in the corpus and the adjusted probability or frequency are combined to generate a second word dictionary; wherein each phrase in the corpus is adjusted to appear in a phrase in the corpus The probability or frequency increases the probability and frequency of occurrences of phrases in a particular domain in a phrase in the corpus of the machine, thereby increasing the accuracy of the machine's semantics of parsing the user's speech content.

The word segmentation unit may be configured to perform a word segmentation on the voice content sent by the user according to the second word dictionary, using a backward maximum cut word and a forward minimum cut word;

The search subunit may be configured to search for a probability or a frequency corresponding to the different phrase in the second word dictionary when the phrases obtained by the two word-cutting methods are different, and select a phrase with a large probability or a frequency As the final word of the word.

The above-mentioned word segmentation unit and the search sub-unit are used to cut the word content of the user by using a word-cutting method in which the backward maximum word segmentation and the forward minimum word segmentation are combined, so that the result of the word segmentation is more accurate.

The beneficial effects obtained by the above device embodiments are the same as or similar to those obtained by the foregoing method embodiments. To avoid repetition, no further details are provided herein.

In another embodiment of the present application, an electronic device is provided, including the device for parsing voice content according to any of the foregoing embodiments.

In another embodiment of the present application, a non-transitory computer readable storage medium is also provided, the non-transitory computer readable storage medium storing computer executable instructions executable by any of the above methods The method of parsing voice content in the example.

10 is a hardware node of an electronic device for performing a method for parsing voice content according to an embodiment of the present application. The schematic diagram, as shown in FIG. 10, includes:

One or more processors 1010 and a memory 1020 are illustrated by one processor 1010 in FIG.

The apparatus that performs the method of parsing the voice content may further include: an input device 1030 and an output device 1040.

The processor 1010, the memory 1020, the input device 1030, and the output device 1040 may be connected by a bus or other means, as exemplified by a bus connection in FIG.

The memory 1020 is used as a non-transitory computer readable storage medium, and can be used for storing a non-volatile software program, a non-volatile computer executable program, and a module, such as a program corresponding to the method for parsing voice content in the embodiment of the present application. An instruction/module (for example, the combination unit 91, the statistical unit 92, the word-cutting unit 93, and the parsing unit 94 shown in FIG. 9). The processor 1010 executes various functional applications and data processing of the server by running non-volatile software programs, instructions, and modules stored in the memory 1020, that is, a method of parsing the voice content by the above method embodiments.

The memory 1020 may include a storage program area and an storage data area, wherein the storage program area may store an operating system, an application required for at least one function; the storage data area may store data created according to use of the device that parses the voice content, and the like. Moreover, memory 1020 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 1020 can optionally include memory remotely disposed relative to processor 1010, which can be connected to a device that parses the voice content over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Input device 1030 can receive input numeric or character information, as well as generate key signal inputs related to user settings and function control of the device that parses the voice content. The output device 1040 can include a display device such as a display screen.

The one or more modules are stored in the memory 1020, and when executed by the one or more processors 1010, perform the method of parsing voice content in any of the above method embodiments.

The above products can perform the methods provided by the embodiments of the present application, and have the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiments of the present application.

The electronic device of the embodiment of the present application exists in various forms, including but not limited to:

(1) Mobile communication devices: These devices are characterized by mobile communication functions and are mainly aimed at providing voice and data communication. Such terminals include: smart phones (such as iPhone), multimedia phones, functional phones, and low-end phones.

(2) Ultra-mobile personal computer equipment: This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has mobile Internet access. Such terminals include: PDAs, MIDs, and UMPC devices, such as the iPad.

(3) Portable entertainment devices: These devices can display and play multimedia content. Such devices include: audio, video players (such as iPod), handheld game consoles, e-books, and smart toys and portable car navigation devices.

(4) Server: A device that provides computing services. The server consists of a processor, a hard disk, a memory, a system bus, etc. The server is similar to a general-purpose computer architecture, but because of the need to provide highly reliable services, processing power and stability High reliability in terms of reliability, security, scalability, and manageability.

(5) Other electronic devices with data interaction functions.

The device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without deliberate labor.

Through the description of the above embodiments, those skilled in the art can clearly understand that the various embodiments can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware. Based on such understanding, the above-described technical solutions may be embodied in the form of software products in essence or in the form of software products, which may be stored in a computer readable storage medium such as ROM/RAM, magnetic Discs, optical discs, etc., include instructions for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various embodiments or portions of the embodiments.

Finally, it should be noted that the above embodiments are only used to explain the technical solutions of the present application, and are not limited thereto; although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that they can still Modify the technical solutions described in the foregoing embodiments, or part of the techniques The features are equivalently substituted; and the modifications or substitutions do not detract from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

A method for parsing voice content, which is characterized by being applied to an electronic device, the method comprising:

Combining a phrase in a specific domain with a phrase in a non-specific domain to generate a first word dictionary, and performing a word segmentation on the corpus stored in the machine according to the first word dictionary to obtain a phrase in the corpus;

Counting a probability or frequency of occurrence of each phrase in the corpus in a phrase in the corpus, and adjusting the probability or frequency according to a predetermined rule such that a phrase in a particular domain appears in a phrase in the corpus Or increase the frequency;

Combining the phrase in the corpus with the adjusted probability or frequency to generate a second word dictionary, and performing a word cut on the voice content sent by the user according to the second word dictionary to obtain the voice content phrase;

The phrases in the voice content are parsed according to a grammar file to obtain corresponding semantics.
The method according to claim 1, wherein the combining the phrases in the specific domain and the phrases in the non-specific domain to generate the first word dictionary comprises:

Generating a corpus stored by the machine according to a phrase of a specific domain, and obtaining a phrase of a specific domain in the corpus;

Counting the probability or frequency of occurrence of a phrase for each particular domain in the corpus in a particular domain of the phrase in the corpus;

According to the ranking of the probability or the frequency, a preset number of phrases are selected from the phrases of the specific domain in the corpus, and the selected phrases are combined with the phrases in the non-specific domain to generate a first word dictionary.
The method according to claim 1, wherein the translating the voice content sent by the user according to the second word-cut dictionary comprises:

According to the second word-cutting dictionary, the speech content sent by the user is respectively cut using the method of the backward maximum cut word and the forward minimum cut word, and if the two types of word-cutting methods are different, the The second word dictionary searches for the probability or frequency corresponding to the different phrases, and selects a phrase with a large probability or frequency as the final word segment.
The method of claim 1 wherein said second word-cut dictionary comprises:

Address area and phrase area;

The address area, the guiding machine searches for a position of a phrase in the voice content after the word-cut sent by the user in the second word-cut dictionary;

The phrase area stores a corresponding phrase in the address area.
The method according to claim 1, wherein the parsing the phrases in the voice content according to the grammar file specifically comprises:

Matching the phrase in the voice content with the phrase in the grammar file, if the phrase in the voice content completely matches the phrase in the grammar file, the analysis is successful; if the full match fails, the keyword matching is performed .
The method according to claim 5, wherein the keyword matching specifically comprises:

Matching the phrase in the voice content with the keyword in the grammar file, if the matching is successful, the parsing is successful; if the matching is unsuccessful, the parsing fails.
The method of claim 1 wherein the phrase of the particular domain comprises at least one of the following:

Chinese characters; English letters; numbers.
An apparatus for parsing a voice content, the apparatus comprising: a combination unit, a statistics unit, a word-cutting unit, and a parsing unit; wherein

The combining unit is configured to combine a phrase in a specific domain and a phrase in a non-specific domain to generate a first word dictionary, and perform a word cutting on the corpus stored in the machine according to the first word dictionary to obtain the corpus Phrase in

The statistical unit is configured to count a probability or frequency of occurrence of each phrase in the corpus in a phrase in the corpus, and adjust the probability or frequency according to a predetermined rule, so that a phrase in a specific domain is in the corpus The probability or frequency of occurrence in the phrase in the phrase increases;

The word-cutting unit is configured to combine the phrase in the corpus with the adjusted probability or frequency to generate a second word-cut dictionary, and perform word-cutting on the voice content sent by the user according to the second word-cut dictionary. Obtaining a phrase in the voice content;

The parsing unit is configured to parse a phrase in the voice content according to a grammar file to obtain a corresponding semantic.
The apparatus according to claim 8, wherein the combining unit comprises: a word subunit, a statistical subunit, and a combined subunit; wherein

The word-cutting unit, configured to perform a word-cutting on a corpus stored by the machine according to a phrase of a specific domain, to obtain a phrase of a specific domain in the corpus;

The statistical subunit is configured to count a probability or a frequency of occurrence of a phrase of each specific domain in the corpus in a phrase in a specific domain in the corpus;

The combining subunit, configured to select a preset number of phrases from a phrase of a specific domain in the corpus according to the ranking of the probability or frequency, and combine the selected phrase with a phrase in a non-specific domain to generate The first word dictionary.
The apparatus according to claim 8, wherein said word-cutting unit comprises:

Combining subunits, cutting subunits, and finding subunits; wherein

a combination subunit, configured to combine the phrase in the corpus and the adjusted probability or frequency to generate a second word dictionary;

The word-cutting unit is configured to perform a word-cutting on the voice content sent by the user according to the second word-cutting dictionary, using a backward maximum cut word and a forward minimum cut word;

The finding subunit is configured to search for a probability or a frequency corresponding to the different phrase in the second word dictionary, when the phrases obtained by the two word-cutting methods are different, and select a probability or a frequency with a large frequency The phrase is the final word of the word.
An electronic device, comprising the apparatus for parsing voice content according to any one of claims 8-10.
A non-transitory computer readable storage medium, wherein the non-transitory computer readable storage medium stores computer instructions for causing the computer to perform the method of any of claims 1-7 .
An electronic device, comprising:

One or more processors; and,

a memory communicatively coupled to the one or more processors; wherein

The memory stores instructions executable by the one or more processors, the instructions being executed by the one or more processors to enable the one or more processors to:

Combining a phrase in a specific domain with a phrase in a non-specific domain to generate a first word dictionary, and performing a word segmentation on the corpus stored in the machine according to the first word dictionary to obtain a phrase in the corpus;

Counting a probability or frequency of occurrence of each phrase in the corpus in a phrase in the corpus, and adjusting the probability or frequency according to a predetermined rule such that a phrase in a particular domain appears in a phrase in the corpus Or increase the frequency;

Combining the phrase in the corpus with the adjusted probability or frequency to generate a second word dictionary, and performing a word cut on the voice content sent by the user according to the second word dictionary to obtain the voice content phrase;

The phrases in the voice content are parsed according to a grammar file to obtain corresponding semantics.
A computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to execute The method of claims 1-7.