CN103617239A - Method and device for identifying named entity and method and device for establishing classification model - Google Patents

Method and device for identifying named entity and method and device for establishing classification model Download PDF

Info

Publication number
CN103617239A
CN103617239A CN201310611971.5A CN201310611971A CN103617239A CN 103617239 A CN103617239 A CN 103617239A CN 201310611971 A CN201310611971 A CN 201310611971A CN 103617239 A CN103617239 A CN 103617239A
Authority
CN
China
Prior art keywords
named entity
classification
disaggregated model
marked
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310611971.5A
Other languages
Chinese (zh)
Inventor
李超
李兴建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201310611971.5A priority Critical patent/CN103617239A/en
Publication of CN103617239A publication Critical patent/CN103617239A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Abstract

The invention provides a method and device for identifying a named entity and a method and device for establishing a classification model. The method for identifying the named entity includes the steps that the named entity to be identified is acquired; the named entity to be identified is sent to a search engine so as to obtain a search result, and the feature information of the search result is extracted; the named entity to be identified and the feature information are sent to a preset classification model, so that at least one classification category of the named entity to be identified is acquired according to the preset classification model. By the adoption of the method, the named entity can be identified according to the search result under the condition that contexts and clicking behavior records of users do not exist, the classification and identification approach of the named entity is increased, and more extensive significance is achieved particularly in a cold start search engine. In addition, the identifying accuracy of the named entity can also be improved, and the identifying efficiency is improved.

Description

The creation method of the recognition methods of named entity, device and disaggregated model, device
Technical field
The present invention relates to Internet technical field, relate in particular to a kind of recognition methods of named entity and creation method and the device of device and disaggregated model.
Background technology
Along with the fast development of Internet technology, it is more and more universal that information service becomes.Wherein, the identification of named entity is the important foundation work of the information service applications such as metadata mark of information extraction, question answering system, syntactic analysis, mechanical translation, Internet.Wherein, named entity can be called entity of sign etc. for name, mechanism's name, place name and other with name, and named entity also can be numeral, date, currency, address etc. more widely.
Conventionally, named entity recognition identifies three major types in pending text (entity class, time class and numeric class) named entity exactly, and seven groups (name, mechanism's name, place name, time, date, currency and number percent) named entity.At present, mainly by the context in pending text, judge to realize the identification of named entity.If there is no context, and need to judge when which kind of named entity certain word be merely, need to obtain user's click behavior record, and according to user's click behavior record judgement named entity.Therefore can find out that prior art exists following problem: if there is no user's click behavior record, cannot identify named entity.
Summary of the invention
The present invention is intended at least one of solve the problems of the technologies described above.
For this reason, first object of the present invention is to propose a kind of recognition methods of named entity.The method in the situation that do not have context, user's click behavior record according to Search Results, named entity to be identified, has increased the Classification and Identification approach of named entity, in addition, also can improve the accuracy of named entity recognition, has improved recognition efficiency.
Second object of the present invention is to propose a kind of creation method of disaggregated model.
The 3rd object of the present invention is to propose a kind of recognition device of named entity.
The 4th object of the present invention is to propose a kind of creation apparatus of disaggregated model.
To achieve these goals, the recognition methods of the named entity of first aspect present invention embodiment, comprises the following steps: obtain named entity to be identified; Described named entity to be identified is sent to search engine to obtain Search Results, and extracts the characteristic information of described Search Results; And described named entity to be identified, described characteristic information are sent to default disaggregated model, to obtain at least one class categories of described named entity to be identified according to described default disaggregated model.
The recognition methods of the named entity of the embodiment of the present invention, named entity to be identified can be sent to search engine to obtain Search Results, and the characteristic information of decimated search result, and by named entity to be identified, characteristic information is sent to default disaggregated model, to obtain the class categories of named entity to be identified according to default disaggregated model, thus, context can not had, in the situation of user's click behavior record, according to Search Results, named entity is identified, increased the Classification and Identification approach of named entity, particularly in the search engine of cold start-up, there is wide significance more.In addition, can also improve the accuracy of named entity recognition, improve recognition efficiency.
To achieve these goals, the creation method of the disaggregated model of second aspect present invention embodiment, comprises the following steps: obtain the sample named entity that marks classification; The described sample named entity that has marked classification is sent to search engine, and obtains described search engine according to the described Search Results that has marked the sample named entity feedback of classification; From the Search Results of described feedback, extract characteristic information; And marked the named entity of classification, the described characteristic information of the described mark classification of correspondence, correspondence trains to create the first disaggregated model according to existing algorithm according to described.
The creation method of the disaggregated model of the embodiment of the present invention, the sample named entity that marks classification can be sent to search engine, and obtain search engine according to the Search Results that marks the sample named entity feedback of classification, from the Search Results of feedback, extract characteristic information, and according to the named entity that marks classification, corresponding mark classification, characteristic of correspondence information trains to create the first disaggregated model according to existing algorithm, the recognition methods that method by supervised learning search engine is named entity creates disaggregated model, thereby by disaggregated model, obtain the class categories of named entity, improved recognition efficiency.
To achieve these goals, the recognition device of the named entity of third aspect present invention embodiment, comprising: named entity acquisition module, for obtaining named entity to be identified; Abstraction module, for described named entity to be identified is sent to search engine to obtain Search Results, and extracts the characteristic information of described Search Results; And class categories acquisition module, for described named entity to be identified, described characteristic information are sent to default disaggregated model, to obtain at least one class categories of described named entity to be identified according to described default disaggregated model.
The recognition device of the named entity of the embodiment of the present invention, by abstraction module, named entity to be identified is sent to search engine to obtain Search Results, and the characteristic information of decimated search result, class categories acquisition module is by named entity to be identified, characteristic information is sent to default disaggregated model, to obtain the class categories of named entity to be identified according to default disaggregated model, thus, context can not had, in the situation of user's click behavior record, according to Search Results, named entity is identified, increased the Classification and Identification approach of named entity, particularly in the search engine of cold start-up, there is wide significance more.In addition, can also improve the accuracy of named entity recognition, improve recognition efficiency.
To achieve these goals, the creation apparatus of the disaggregated model of fourth aspect present invention embodiment, comprising: sample named entity acquisition module, for obtaining the sample named entity that marks classification; Search Results acquisition module, for the described sample named entity that has marked classification is sent to search engine, and obtains described search engine according to the described Search Results that has marked the sample named entity feedback of classification; Abstraction module, extracts characteristic information for the Search Results from described feedback; And creation module, for having marked the named entity of classification, the described characteristic information of the described mark classification of correspondence, correspondence trains to create the first disaggregated model according to existing algorithm according to described.
The creation apparatus of the disaggregated model of the embodiment of the present invention, by Search Results acquisition module, the sample named entity that marks classification is sent to search engine, and obtain search engine according to the Search Results that marks the sample named entity feedback of classification, abstraction module extracts characteristic information from the Search Results of feedback, creation module is according to marking the named entity of classification, corresponding mark classification, characteristic of correspondence information trains to create the first disaggregated model according to existing algorithm, the recognition methods that method by supervised learning search engine is named entity creates disaggregated model, thereby by disaggregated model, obtain the class categories of named entity, improved recognition efficiency.
The aspect that the present invention is additional and advantage in the following description part provide, and part will become obviously from the following description, or recognize by practice of the present invention.
Accompanying drawing explanation
Above-mentioned and/or the additional aspect of the present invention and advantage will become from the following description of the accompanying drawings of embodiments and obviously and easily understand, wherein,
Fig. 1 is the process flow diagram of the recognition methods of named entity according to an embodiment of the invention;
Fig. 2 is the process flow diagram of the recognition methods of the named entity of a specific embodiment according to the present invention;
Fig. 3 is the process flow diagram of the creation method of disaggregated model according to an embodiment of the invention;
Fig. 4 is the process flow diagram of the creation method of the disaggregated model of a specific embodiment according to the present invention;
Fig. 5 is the process flow diagram of the creation method of the disaggregated model of another specific embodiment according to the present invention;
Fig. 6 is the structural representation of the recognition device of named entity according to an embodiment of the invention;
Fig. 7 is the structural representation of the recognition device of the named entity of a specific embodiment according to the present invention;
Fig. 8 is the structural representation of the creation apparatus of disaggregated model according to an embodiment of the invention;
Fig. 9 is the structural representation of the creation apparatus of the disaggregated model of a specific embodiment according to the present invention;
Figure 10 is the structural representation of the creation apparatus of the disaggregated model of another specific embodiment according to the present invention.
Embodiment
Describe embodiments of the invention below in detail, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has the element of identical or similar functions from start to finish.Below by the embodiment being described with reference to the drawings, be exemplary, only for explaining the present invention, and can not be interpreted as limitation of the present invention.On the contrary, embodiments of the invention comprise spirit and all changes within the scope of intension, modification and the equivalent that falls into additional claims.
In description of the invention, it will be appreciated that, term " first ", " second " etc. are only for describing object, and can not be interpreted as indication or hint relative importance.In description of the invention, it should be noted that, unless otherwise clearly defined and limited, term " is connected ", " connection " should be interpreted broadly, and for example, can be to be fixedly connected with, and can be also to removably connect, or connects integratedly; Can be mechanical connection, can be to be also electrically connected to; Can be to be directly connected, also can indirectly be connected by intermediary.For the ordinary skill in the art, can concrete condition understand above-mentioned term concrete meaning in the present invention.In addition,, in description of the invention, except as otherwise noted, the implication of " a plurality of " is two or more.
In process flow diagram or any process of otherwise describing at this or method describe and can be understood to, represent to comprise that one or more is for realizing module, fragment or the part of code of executable instruction of the step of specific logical function or process, and the scope of the preferred embodiment of the present invention comprises other realization, wherein can be not according to order shown or that discuss, comprise according to related function by the mode of basic while or by contrary order, carry out function, this should be understood by embodiments of the invention person of ordinary skill in the field.
In order to solve when lacking context and user's click behavior record, the problem that cannot identify named entity, the present invention proposes a kind of recognition methods of named entity and creation method and the device of device and disaggregated model, below with reference to accompanying drawing, describes according to creation method and the device of the recognition methods of the named entity of the embodiment of the present invention and device and disaggregated model.
A recognition methods for named entity, comprises the following steps: obtain named entity to be identified; Named entity to be identified is sent to search engine to obtain Search Results, and the characteristic information of decimated search result; And named entity to be identified, characteristic information are sent to default disaggregated model, to obtain at least one class categories of named entity to be identified according to default disaggregated model.
Fig. 1 is the process flow diagram of the recognition methods of named entity according to an embodiment of the invention.
As shown in Figure 1, the recognition methods of named entity comprises the following steps:
S101, obtains named entity to be identified.
Wherein, named entity can be called entity of sign etc. for name, mechanism's name, place name and other with name, and named entity can also be numeral, date, currency, address etc. more widely.Other take named entity that name is called sign such as being video display, book, game, song etc.
Named entity herein should be interpreted broadly, and is not limited in the several types of mentioning in background technology, and the named entity of the embodiment of the present invention can relate to multiple fields.
S102, is sent to search engine to obtain Search Results by named entity to be identified, and the characteristic information of decimated search result.
Particularly, the named entity to be identified getting can be sent to search engine as search word, search engine obtains Search Results according to this search word, and from the Search Results of search engine, extracts named entity characteristic of correspondence information to be identified.
For example, first, from Search Results, extract URL(Uniform Resource Locator, URL(uniform resource locator)), title(web page title), abstract(summary) etc.Afterwards, extract unigram as feature from URL, title, abstract etc., wherein, unigram is single word, as, the unigram form of " the pretty photo of Liu De China wife Zhu Li " is: Liu Dehua/wife/Zhu Liqian/photo.Can also from URL, title, abstract etc., extract bigram as feature, wherein, bigram i.e. two words, as, the bigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife/wife Zhu Liqian/Zhu Li.In addition, can also from URL, title, abstract etc., extract trigram as feature, wherein, trigram is three words, as, the trigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife Zhu Liqian/wife Zhu Li.
S103, is sent to default disaggregated model by named entity to be identified, characteristic information, to obtain at least one class categories of named entity to be identified according to default disaggregated model.
For example, named entity < < love > >, can be according to its characteristic information and default disaggregated model, obtaining corresponding type is the one or more class categories in title, movie name and song.
Wherein, default disaggregated model is the model that training in advance is good, will in subsequent embodiment, describe in detail.
The recognition methods of the named entity of the embodiment of the present invention, named entity to be identified can be sent to search engine to obtain Search Results, and the characteristic information of decimated search result, and by named entity to be identified, characteristic information is sent to default disaggregated model, to obtain the class categories of named entity to be identified according to default disaggregated model, thus, context can not had, in the situation of user's click behavior record, according to Search Results, named entity is identified, increased the Classification and Identification approach of named entity, particularly in the search engine of cold start-up, there is wide significance more.In addition, can also improve the accuracy of named entity recognition, improve recognition efficiency.
Fig. 2 is the process flow diagram of the recognition methods of the named entity of a specific embodiment according to the present invention.
In an embodiment of the present invention, if obtain a plurality of class categories by default disaggregated model, the corresponding degree of confidence of each class categories, can sort to a plurality of class categories according to degree of confidence, and ranking results is provided.Particularly, as shown in Figure 2, the recognition methods of named entity comprises the following steps:
S201, obtains named entity to be identified.
Wherein, named entity can be called entity of sign etc. for name, mechanism's name, place name and other with name, and named entity can also be numeral, date, currency, address etc. more widely.Other are called named entity such as video display, book, game, song of sign etc. with name.
Named entity herein should be interpreted broadly, and is not limited in the several types of mentioning in background technology, and the named entity of the embodiment of the present invention can relate to multiple fields.
S202, is sent to search engine to obtain Search Results by named entity to be identified, and the characteristic information of decimated search result.
Particularly, the named entity to be identified getting can be sent to search engine as search word, search engine obtains Search Results according to this search word, and from the Search Results of search engine, extracts named entity characteristic of correspondence information to be identified.
For example, first, from Search Results, extract url, title, abstract etc.Afterwards, extract unigram as feature from URL, title, abstract etc., wherein, unigram is single word, as, the unigram form of " the pretty photo of Liu De China wife Zhu Li " is: Liu Dehua/wife/Zhu Liqian/photo.Can also from URL, title, abstract etc., extract bigram as feature, wherein, bigram i.e. two words, as, the bigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife/wife Zhu Liqian/Zhu Li.In addition, can also from URL, title, abstract etc., extract trigram as feature, wherein, trigram is three words, as, the trigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife Zhu Liqian/wife Zhu Li.
S203, is sent to default disaggregated model by named entity to be identified, characteristic information, to obtain at least one class categories of named entity to be identified according to default disaggregated model.
For example, named entity < < love > >, can be according to its characteristic information and default disaggregated model, obtaining corresponding type is the one or more class categories in title, movie name and song.
Wherein, default disaggregated model is the model that training in advance is good, will in subsequent embodiment, describe in detail.
Be to be understood that, in an embodiment of the present invention, named entity may have one or more class categories, and for example, named entity " A Chinese Ghost Story " can have a plurality of class categories, " A Chinese Ghost Story " can be a game, also can be a film, can also be a TV play etc., and for example, " allow bullet fly " and have a class categories, " allowing bullet fly " can be a film.
S204, the degree of confidence corresponding according to class categories sorts to a plurality of class categories, and ranking results is provided.
Wherein, degree of confidence also can be regarded as fiduciary level, and the degree of reliability, the order of accuarcy of the class categories of the named entity to be identified obtaining according to default disaggregated model can be by being weighted to obtain degree of confidence corresponding to class categories to Classification and Identification result.
Particularly, when named entity to be identified has a plurality of class categories, can to a plurality of classification types, sort according to degree of confidence corresponding to each class categories, for example, degree of confidence is higher, can the clooating sequence of corresponding class categories is more forward, and ranking results is provided, thus can know that the demand of which class categories that named entity is corresponding is stronger according to ranking results.
The recognition methods of the named entity of the embodiment of the present invention, when the disaggregated model by default obtains a plurality of class categories, the corresponding degree of confidence of each class categories, can to a plurality of class categories, sort according to degree of confidence corresponding to class categories, and provide ranking results, thereby can know that the demand of which class categories that named entity is corresponding is stronger according to ranking results, improve reliability.
Wherein, default disaggregated model can have a plurality of disaggregated models, can be the first disaggregated model or the second default disaggregated model or the 3rd disaggregated model of presetting or the 4th default disaggregated model of presetting, can according to different characteristic parameters, according to existing algorithm, train to create different default disaggregated models respectively.
In one embodiment of the invention, can, by the Search Results of search engine collecting, according to existing algorithm, train to create default disaggregated model.Particularly, when default disaggregated model is the first disaggregated model of presetting, the first default disaggregated model creates according to following steps:
S101 ', obtains the sample named entity that marks classification.
For example, can mark in advance some named entities, this named entity has been marked class categories, thereby using these named entities as the sample named entity that marks classification.Meanwhile, in order to increase the accuracy of disaggregated model, it is unique that this has marked the mark classification that the sample named entity of classification is corresponding.
S102 ', is sent to search engine by the sample named entity that marks classification, and obtains search engine according to the Search Results that marks the sample named entity feedback of classification.
S103 ' extracts characteristic information from the Search Results of feedback.
For example, first, from Search Results, extract url, title, abstract etc.Afterwards, extract unigram as feature from URL, title, abstract etc., wherein, unigram is single word, as, the unigram form of " the pretty photo of Liu De China wife Zhu Li " is: Liu Dehua/wife/Zhu Liqian/photo.Can also from URL, title, abstract etc., extract bigram as feature, wherein, bigram i.e. two words, as, the bigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife/wife Zhu Liqian/Zhu Li.In addition, can also from URL, title, abstract etc., extract trigram as feature, wherein, trigram is three words, as, the trigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife Zhu Liqian/wife Zhu Li.
S104 ', according to marking the named entity of classification, corresponding mark classification, characteristic of correspondence information train to create the first default disaggregated model according to existing algorithm.
Wherein, existing algorithm can be Linear SVM (Support Vector Machine, support vector machine) algorithm, can also be other multiple existing algorithms, for example, decision tree inductive algorithm, KNN(K-Nearest Neighbor, K nearest neighbor method) algorithm etc., should be appreciated that because Linear SVM algorithm has good adaptive faculty and higher minute accurate rate, therefore preferential choice for use Linear SVM algorithm.
Thus, the recognition methods that the method by supervised learning search engine is named entity creates the first default disaggregated model, thereby according to the first default disaggregated model, obtains the class categories of named entity, has improved recognition efficiency.
In one embodiment of the invention, if named entity occurs as trunk in the title of webpage, the class categories of webpage becomes the class categories of named entity possibly, therefore, can obtain the webpage of the sample named entity that has marked classification, and this webpage is extracted to obtain the text feature information of sample named entity in webpage that has marked classification, thereby train to create default disaggregated model according to existing algorithm, particularly, when default disaggregated model is the second disaggregated model of presetting, the second default disaggregated model creates according to following steps:
S201 ', obtains the sample named entity that marks classification.
S202 ', is sent to search engine by the sample named entity that marks classification, and obtains search engine according to the Search Results that marks the sample named entity feedback of classification.
S203 ' extracts characteristic information from the Search Results of feedback.
S204 ', obtains the webpage of the sample named entity that marks classification.
Particularly, can from the Search Results of feedback, obtain the webpage at the sample named entity place that marks classification, this webpage comprises this sample named entity, also can comprise title, content of text etc.
S205 ', obtains the text feature information of sample named entity in webpage that marks classification.
For example, from Search Results, get the web page text at sample named entity place, from web page text, extract afterwards unigram as feature, wherein, unigram is single word, as, the unigram form of " the pretty photo of Liu De China wife Zhu Li " is: Liu Dehua/wife/Zhu Liqian/photo.Can also from this web page text, extract bigram as feature, wherein, bigram i.e. two words, as, the bigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife/wife Zhu Liqian/Zhu Li.In addition, can also from this web page text, extract trigram as feature, wherein, trigram is three words, as, the trigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife Zhu Liqian/wife Zhu Li.
S206 ', according to marking the named entity of classification, corresponding mark classification, characteristic of correspondence information, text feature information train to create the second default disaggregated model according to existing algorithm.
Thus, by creating the second default disaggregated model, realized and can for " long-tail " named entity, carry out discriminator more, improved the recognition function of named entity.
In one embodiment of the invention, the user that can obtain the sample named entity that has marked classification clicks user behaviors log, according to user, click user behaviors log afterwards and obtain user for the click feature information that marks the sample named entity of classification, carry out training to create default disaggregated model according to existing algorithm, particularly, when default disaggregated model is the 3rd disaggregated model of presetting, the 3rd default disaggregated model creates according to following steps:
S301 ', obtains the sample named entity that marks classification.
S302 ', is sent to search engine by the sample named entity that marks classification, and obtains search engine according to the Search Results that marks the sample named entity feedback of classification.
S303 ' extracts characteristic information from the Search Results of feedback.
S304 ', the user who obtains the sample named entity that marks classification clicks user behaviors log.
Wherein, user clicks user behaviors log and can comprise the info web (as URL, title etc.) at sample named entity, sample named entity place etc.
S305 ', obtains user for the click feature information that marks the sample named entity of classification.
Particularly, can click user behaviors log and obtain user for the click feature information that marks the sample named entity of classification from user, concrete acquisition methods can be with reference to the concrete acquisition methods of the characteristic information of above-mentioned sample named entity.
S306 ', according to marking the named entity of classification, corresponding mark classification, characteristic of correspondence information, click feature information train to create the 3rd default disaggregated model according to existing algorithm.
Thus, by creating the 3rd default disaggregated model, realized and can in conjunction with user's click behavior, identify the classification of named entity, with respect to the characteristic information in simple application searches result, carried out Classification and Identification and there is higher accuracy rate.
In one embodiment of the invention, can be using the various characteristic informations of sample named entity that mark classification as a global feature parameter, thereby according to existing algorithm, train to create default disaggregated model according to this characteristic parameter, particularly, when default disaggregated model is the 4th disaggregated model of presetting, the 4th default disaggregated model creates according to following steps: according to the named entity that marks classification, corresponding mark classification, characteristic of correspondence information, text feature information, click feature information trains to create the 4th default disaggregated model according to existing algorithm.Thus, can make default disaggregated model more perfect, thereby it is more accurate to make to obtain the class categories of named entity to be identified.
From above-mentioned four embodiment, can according to existing algorithm, train to create different default disaggregated models according to different characteristic parameters, thereby can to named entity to be identified, identify according to different default disaggregated models, improve the accuracy of recognition result.
In order to make effect of the present invention more obvious, illustrate the implementation procedure of above-described embodiment below.For example, in internet or life, there is new named entity, or when named entity has new class categories, need identify to obtain to this named entity the class categories of this named entity, first, can from internet or life, obtain this named entity to be identified, as word " love ", afterwards " love " is sent to search engine as search word, by search engine, according to " love ", get and take the webpage that " love-high definition is watched online with sudden peal of thunder download-film-sudden peal of thunder and being looked at " be title, take " I Love You singer: Wang Ruo beautiful jade special edition: Start From Here collection " be the Search Results of a plurality of correspondences such as webpage of content description, and from Search Results, extract corresponding unigram, bigram, trigram feature, finally, by " love ", corresponding unigram, bigram, trigram feature is sent to default disaggregated model, according to default disaggregated model, can obtain " love " this word be a first song, also be a film, thereby can know that " love " belongs to music class classification, also belong to video display class classification.
In the process that disaggregated model is identified at named entity, play very important effect, once after disaggregated model is created, can named entity to be identified be carried out to simulation identification by disaggregated model, thereby can obtain the class categories of the named entity of band identification.Therefore,, in order to realize above-described embodiment, the present invention also proposes a kind of creation method of disaggregated model.
A creation method for disaggregated model, comprises the following steps: obtain the sample named entity that marks classification; The sample named entity that marks classification is sent to search engine, and obtains search engine according to the Search Results that marks the sample named entity feedback of classification; From the Search Results of feedback, extract characteristic information; And according to marking the named entity of classification, corresponding mark classification, characteristic of correspondence information train to create the first disaggregated model according to existing algorithm.
Fig. 3 is the process flow diagram of the creation method of disaggregated model according to an embodiment of the invention.
As shown in Figure 3, the creation method of disaggregated model comprises the following steps:
S301, obtains the sample named entity that marks classification.
For example, can mark in advance some named entities, this named entity has been marked class categories, thereby using these named entities as the sample named entity that marks classification.Meanwhile, in order to increase the accuracy of disaggregated model, it is unique that this has marked the mark classification that the sample named entity of classification is corresponding.
S302, is sent to search engine by the sample named entity that marks classification, and obtains search engine according to the Search Results that marks the sample named entity feedback of classification.
S303 extracts characteristic information from the Search Results of feedback.
For example, first, from Search Results, extract url, title, abstract etc.Afterwards, extract unigram as feature from URL, title, abstract etc., wherein, unigram is single word, as, the unigram form of " the pretty photo of Liu De China wife Zhu Li " is: Liu Dehua/wife/Zhu Liqian/photo.Can also from URL, title, abstract etc., extract bigram as feature, wherein, bigram i.e. two words, as, the bigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife/wife Zhu Liqian/Zhu Li.In addition, can also from URL, title, abstract etc., extract trigram as feature, wherein, trigram is three words, as, the trigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife Zhu Liqian/wife Zhu Li.
S304, according to marking the named entity of classification, corresponding mark classification, characteristic of correspondence information train to create the first disaggregated model according to existing algorithm.
Wherein, existing algorithm can be Linear SVM algorithm, can also be other multiple existing algorithms, for example, decision tree inductive algorithm, KNN algorithm etc., be to be understood that, because Linear SVM algorithm has good adaptive faculty and higher minute accurate rate, therefore preferential choice for use Linear SVM algorithm.
The creation method of the disaggregated model of the embodiment of the present invention, the sample named entity that marks classification can be sent to search engine, and obtain search engine according to the Search Results that marks the sample named entity feedback of classification, from the Search Results of feedback, extract characteristic information, and according to the named entity that marks classification, corresponding mark classification, characteristic of correspondence information trains to create the first disaggregated model according to existing algorithm, the recognition methods that method by supervised learning search engine is named entity creates disaggregated model, thereby by disaggregated model, obtain the class categories of named entity, improved recognition efficiency.
Fig. 4 is the process flow diagram of the creation method of the disaggregated model of a specific embodiment according to the present invention.
If named entity occurs as trunk in the title of webpage, the class categories of webpage becomes the class categories of named entity possibly, therefore, can obtain the webpage of the sample named entity that has marked classification, and this webpage is extracted to obtain the text feature information of sample named entity in webpage that has marked classification, thus according to existing algorithm, train to create the second disaggregated model, particularly, as shown in Figure 4, the creation method of disaggregated model comprises the following steps:
S401, obtains the sample named entity that marks classification.
S402, is sent to search engine by the sample named entity that marks classification, and obtains search engine according to the Search Results that marks the sample named entity feedback of classification.
S403 extracts characteristic information from the Search Results of feedback.
S404, obtains the webpage of the sample named entity that marks classification.
Particularly, can from the Search Results of feedback, obtain the webpage at the sample named entity place that marks classification, this webpage comprises this sample named entity, also can comprise title, content of text etc.
S405, obtains the text feature information of sample named entity in webpage that marks classification.
For example, from Search Results, get the web page text at sample named entity place, from web page text, extract afterwards unigram as feature, wherein, unigram is single word, as, the unigram form of " the pretty photo of Liu De China wife Zhu Li " is: Liu Dehua/wife/Zhu Liqian/photo.Can also from this web page text, extract bigram as feature, wherein, bigram i.e. two words, as, the bigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife/wife Zhu Liqian/Zhu Li.In addition, can also from this web page text, extract trigram as feature, wherein, trigram is three words, as, the trigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife Zhu Liqian/wife Zhu Li.
S406, according to marking the named entity of classification, corresponding mark classification, characteristic of correspondence information, text feature information train to create the second disaggregated model according to existing algorithm.
The creation method of the disaggregated model of the embodiment of the present invention, obtain the webpage of the sample named entity that marks classification, and obtain the text feature information of sample named entity in webpage that marks classification, and according to marking the named entity of classification, corresponding mark classification, characteristic of correspondence information, text feature information train to create the second disaggregated model according to existing algorithm, realized and can for " long-tail " named entity, carry out discriminator more, improved the recognition function of named entity.
Fig. 5 is the process flow diagram of the creation method of the disaggregated model of another specific embodiment according to the present invention.
In order to improve the accuracy rate of the recognition result of named entity, the user that can first obtain the sample named entity that marks classification clicks user behaviors log, according to user, click user behaviors log afterwards and obtain user for the click feature information that marks the sample named entity of classification, carry out training to create default disaggregated model according to existing algorithm, particularly, as shown in Figure 5, the creation method of disaggregated model comprises the following steps:
S501, obtains the sample named entity that marks classification.
S502, is sent to search engine by the sample named entity that marks classification, and obtains search engine according to the Search Results that marks the sample named entity feedback of classification.
S503 extracts characteristic information from the Search Results of feedback.
S504, the user who obtains the sample named entity that marks classification clicks user behaviors log.
Wherein, user clicks user behaviors log and can comprise the info web (as URL, title etc.) at sample named entity, sample named entity place etc.
S505, obtains user for the click feature information that marks the sample named entity of classification.
Particularly, can click user behaviors log according to user and therefrom obtain user for the click feature information that marks the sample named entity of classification, concrete acquisition methods can be with reference to the concrete acquisition methods of the characteristic information of above-mentioned sample named entity.
S506, according to marking the named entity of classification, corresponding mark classification, characteristic of correspondence information, click feature information train to create the 3rd disaggregated model according to existing algorithm.
The creation method of the disaggregated model of the embodiment of the present invention, the user who obtains the sample named entity that marks classification clicks user behaviors log, and obtain user for the click feature information that marks the sample named entity of classification, and according to the named entity that marks classification, corresponding mark classification, characteristic of correspondence information, click feature information trains to create the 3rd disaggregated model according to existing algorithm, realized and can identify in conjunction with user's click behavior the classification of named entity, with respect to the characteristic information in simple application searches result, carry out Classification and Identification and there is higher accuracy rate.
In one embodiment of the invention, can be using the various characteristic informations of sample named entity that mark classification as a global feature parameter, thereby according to existing algorithm, train to create default disaggregated model according to this characteristic parameter, particularly, the creation method of disaggregated model also comprises: according to marking the named entity of classification, corresponding mark classification, characteristic of correspondence information, text feature information, click feature information train to create the 4th disaggregated model according to existing algorithm.Thus, can make default disaggregated model more perfect, thereby it is more accurate to make to obtain the class categories of named entity to be identified.
In order to realize above-described embodiment, the present invention also proposes a kind of recognition device of named entity.
A recognition device for named entity, comprising: named entity acquisition module, for obtaining named entity to be identified; Abstraction module, for named entity to be identified being sent to search engine to obtain Search Results, and the characteristic information of decimated search result; And class categories acquisition module, for named entity to be identified, characteristic information are sent to default disaggregated model, to obtain at least one class categories of named entity to be identified according to default disaggregated model.
Fig. 6 is the structural representation of the recognition device of named entity according to an embodiment of the invention.
As shown in Figure 6, the recognition device of named entity comprises: named entity acquisition module 110, abstraction module 120 and class categories acquisition module 130.
Particularly, named entity acquisition module 110 is for obtaining named entity to be identified.Wherein, named entity can be called entity of sign etc. for name, mechanism's name, place name and other with name, and named entity can also be numeral, date, currency, address etc. more widely.Other take named entity that name is called sign such as being video display, book, game, song etc.Named entity herein should be interpreted broadly, and is not limited in the several types of mentioning in background technology, and the named entity of the embodiment of the present invention can relate to multiple fields.
Abstraction module 120 is for named entity to be identified being sent to search engine to obtain Search Results, and the characteristic information of decimated search result.More specifically, abstraction module 120 can be sent to search engine as search word using the named entity to be identified getting, search engine obtains Search Results according to this search word, and from the Search Results of search engine, extracts named entity characteristic of correspondence information to be identified.
For example, first, from Search Results, extract url, title, abstract etc.Afterwards, extract unigram as feature from URL, title, abstract etc., wherein, unigram is single word, as, the unigram form of " the pretty photo of Liu De China wife Zhu Li " is: Liu Dehua/wife/Zhu Liqian/photo.Can also from URL, title, abstract etc., extract bigram as feature, wherein, bigram i.e. two words, as, the bigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife/wife Zhu Liqian/Zhu Li.In addition, can also from URL, title, abstract etc., extract trigram as feature, wherein, trigram is three words, as, the trigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife Zhu Liqian/wife Zhu Li.
Class categories acquisition module 130 is for named entity to be identified, characteristic information are sent to default disaggregated model, to obtain at least one class categories of named entity to be identified according to default disaggregated model.For example, named entity < < love > >, can be according to its characteristic information and default disaggregated model, obtaining corresponding type is the one or more class categories in title, movie name and song.Wherein, default disaggregated model is the model that training in advance is good, will in subsequent embodiment, describe in detail.
Be to be understood that, in an embodiment of the present invention, named entity may have one or more class categories, and for example, named entity " A Chinese Ghost Story " can have a plurality of class categories, " A Chinese Ghost Story " can be a game, also can be a film, can also be a TV play etc., and for example, " allow bullet fly " and have a class categories, " allowing bullet fly " can be a film.
In one embodiment of the invention, if obtain a plurality of class categories by default disaggregated model, the corresponding degree of confidence of each class categories.Wherein, degree of confidence also can be regarded as fiduciary level, and the degree of reliability, the order of accuarcy of the class categories of the named entity to be identified obtaining according to default disaggregated model can be by being weighted to obtain degree of confidence corresponding to class categories to Classification and Identification result.
The recognition device of the named entity of the embodiment of the present invention, by abstraction module, named entity to be identified is sent to search engine to obtain Search Results, and the characteristic information of decimated search result, class categories acquisition module is by named entity to be identified, characteristic information is sent to default disaggregated model, to obtain the class categories of named entity to be identified according to default disaggregated model, thus, context can not had, in the situation of user's click behavior record, according to Search Results, named entity is identified, increased the Classification and Identification approach of named entity, particularly in the search engine of cold start-up, there is wide significance more.In addition, can also improve the accuracy of named entity recognition, improve recognition efficiency.
Fig. 7 is the structural representation of the recognition device of the named entity of a specific embodiment according to the present invention.
As shown in Figure 7, the recognition device of named entity comprises: named entity acquisition module 110, abstraction module 120, class categories acquisition module 130 and order module 140.
Particularly, order module 140 is for according to degree of confidence corresponding to class categories, a plurality of class categories being sorted, and ranking results is provided.More specifically, when named entity to be identified has a plurality of class categories, order module 140 can sort to a plurality of classification types according to degree of confidence corresponding to each class categories, for example, degree of confidence is higher, can the clooating sequence of corresponding class categories is more forward, and ranking results is provided, thus can know that the demand of which class categories that named entity is corresponding is stronger according to ranking results.
The recognition device of the named entity of the embodiment of the present invention, by order module, according to degree of confidence corresponding to class categories, a plurality of class categories are sorted, and provide ranking results, thereby can know that the demand of which class categories that named entity is corresponding is stronger according to ranking results, improve reliability.
In order to realize above-described embodiment, the present invention proposes again a kind of creation apparatus of disaggregated model.
A creation apparatus for disaggregated model, comprising: sample named entity acquisition module, for obtaining the sample named entity that marks classification; Search Results acquisition module, for the sample named entity that marks classification is sent to search engine, and obtains search engine according to the Search Results that marks the sample named entity feedback of classification; Abstraction module, for extracting characteristic information from the Search Results of feedback; And creation module, for basis, marked the named entity of classification, corresponding mark classification, characteristic of correspondence information train to create the first disaggregated model according to existing algorithm.
Fig. 8 is the structural representation of the creation apparatus of disaggregated model according to an embodiment of the invention.
As shown in Figure 8, the creation apparatus of disaggregated model comprises: sample named entity acquisition module 210, Search Results acquisition module 220, abstraction module 230 and creation module 240.
Particularly, sample named entity acquisition module 210 is for obtaining the sample named entity that marks classification.For example, can mark in advance some named entities, this named entity has been marked class categories, thereby using these named entities as the sample named entity that marks classification.Meanwhile, in order to increase the accuracy of disaggregated model, it is unique that this has marked the mark classification that the sample named entity of classification is corresponding.
Search Results acquisition module 220 is for the sample named entity that marks classification is sent to search engine, and obtains search engine according to the Search Results that has marked the sample named entity feedback of classification.
Abstraction module 230 is for extracting characteristic information from the Search Results of feedback.For example, first, from Search Results, extract url, title, abstract etc.Afterwards, extract unigram as feature from URL, title, abstract etc., wherein, unigram is single word, as, the unigram form of " the pretty photo of Liu De China wife Zhu Li " is: Liu Dehua/wife/Zhu Liqian/photo.Can also from URL, title, abstract etc., extract bigram as feature, wherein, bigram i.e. two words, as, the bigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife/wife Zhu Liqian/Zhu Li.In addition, can also from URL, title, abstract etc., extract trigram as feature, wherein, trigram is three words, as, the trigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife Zhu Liqian/wife Zhu Li.
Creation module 240 has marked the named entity of classification, corresponding mark classification, characteristic of correspondence information train to create the first disaggregated model according to existing algorithm for basis.Wherein, existing algorithm can be Linear SVM algorithm, can also be other multiple existing algorithms, for example, decision tree inductive algorithm, KNN algorithm etc., be to be understood that, because Linear SVM algorithm has good adaptive faculty and higher minute accurate rate, therefore preferential choice for use Linear SVM algorithm.
The creation apparatus of the disaggregated model of the embodiment of the present invention, by Search Results acquisition module, the sample named entity that marks classification is sent to search engine, and obtain search engine according to the Search Results that marks the sample named entity feedback of classification, abstraction module extracts characteristic information from the Search Results of feedback, creation module is according to marking the named entity of classification, corresponding mark classification, characteristic of correspondence information trains to create the first disaggregated model according to existing algorithm, the recognition methods that method by supervised learning search engine is named entity creates disaggregated model, thereby by disaggregated model, obtain the class categories of named entity, improved recognition efficiency.
Fig. 9 is the structural representation of the creation apparatus of the disaggregated model of a specific embodiment according to the present invention.
As shown in Figure 9, the creation apparatus of disaggregated model comprises: sample named entity acquisition module 210, Search Results acquisition module 220, abstraction module 230, creation module 240, webpage acquisition module 250 and text feature acquisition of information module 260.
Particularly, webpage acquisition module 250 is for having marked the webpage of the sample named entity of classification described in obtaining.More specifically, webpage acquisition module 250 can obtain the webpage at the sample named entity place that marks classification from the Search Results of feedback, and this webpage comprises this sample named entity, also can comprise title, content of text etc.
Text feature acquisition of information module 260 has marked the sample named entity of classification in the text feature information of described webpage described in obtaining.For example, text feature acquisition of information module 260 can get the web page text at sample named entity place from Search Results, from web page text, extract afterwards unigram as feature, wherein, unigram is single word, as, the unigram form of " the pretty photo of Liu De China wife Zhu Li " is: Liu Dehua/wife/Zhu Liqian/photo.Can also from this web page text, extract bigram as feature, wherein, bigram i.e. two words, as, the bigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife/wife Zhu Liqian/Zhu Li.In addition, can also from this web page text, extract trigram as feature, wherein, trigram is three words, as, the trigram form of " the pretty photo of Liu De China wife Zhu Li " is: the pretty photo of Liu De China wife Zhu Liqian/wife Zhu Li.
In one embodiment of the invention, creation module 240 is also for having marked the named entity of classification, described characteristic information, the described text feature information of the described mark classification of correspondence, correspondence train to create the second disaggregated model according to existing algorithm according to described.
The creation apparatus of the disaggregated model of the embodiment of the present invention, described in obtaining by webpage acquisition module, marked the webpage of the sample named entity of classification, text feature acquisition of information module is obtained the text feature information of sample named entity in webpage that marks classification, creation module is according to marking the named entity of classification, corresponding mark classification, characteristic of correspondence information, text feature information trains to create the second disaggregated model according to existing algorithm, realized and can for " long-tail " named entity, carry out discriminator more, improved the recognition function of named entity.
Figure 10 is the structural representation of the creation apparatus of the disaggregated model of another specific embodiment according to the present invention.
As shown in figure 10, the creation apparatus of disaggregated model comprises: sample named entity acquisition module 210, Search Results acquisition module 220, abstraction module 230, creation module 240, webpage acquisition module 250, text feature acquisition of information module 260, user behaviors log acquisition module 270 and click feature acquisition of information module 280.
Particularly, user behaviors log acquisition module 270 is clicked user behaviors log for obtaining the user of the sample named entity that marks classification.Wherein, user clicks user behaviors log and can comprise the info web (as URL, title etc.) at sample named entity, sample named entity place etc.
Click feature acquisition of information module 280 is for obtaining user for the click feature information that marks the sample named entity of classification.More specifically, click feature acquisition of information module 280 can be clicked user behaviors log and obtain user for the click feature information that marks the sample named entity of classification from user, and concrete acquisition methods can be with reference to the concrete acquisition methods of the characteristic information of above-mentioned sample named entity.
In one embodiment of the invention, creation module 240 has also marked the named entity of classification, corresponding mark classification, characteristic of correspondence information, click feature information train to create the 3rd disaggregated model according to existing algorithm for basis.
The creation apparatus of the disaggregated model of the embodiment of the present invention, the user who obtains the sample named entity that marks classification by user behaviors log acquisition module clicks user behaviors log, click feature acquisition of information module is obtained user for the click feature information that marks the sample named entity of classification, creation module is according to marking the named entity of classification, corresponding mark classification, characteristic of correspondence information, click feature information trains to create the 3rd disaggregated model according to existing algorithm, realized and can identify in conjunction with user's click behavior the classification of named entity, with respect to the characteristic information in simple application searches result, carry out Classification and Identification and there is higher accuracy rate.
In one embodiment of the invention, can be using the various characteristic informations of sample named entity that mark classification as a global feature parameter, thereby according to existing algorithm, train to create default disaggregated model according to this characteristic parameter, particularly, creation module 240 has also marked the named entity of classification, corresponding mark classification, characteristic of correspondence information, text feature information, click feature information train to create the 4th disaggregated model according to existing algorithm for basis.Thus, can make default disaggregated model more perfect, thereby it is more accurate to make to obtain the class categories of named entity to be identified.
Should be appreciated that each several part of the present invention can realize with hardware, software, firmware or their combination.In the above-described embodiment, a plurality of steps or method can realize with being stored in storer and by software or the firmware of suitable instruction execution system execution.For example, if realized with hardware, the same in another embodiment, can realize by any one in following technology well known in the art or their combination: have for data-signal being realized to the discrete logic of the logic gates of logic function, the special IC with suitable combinational logic gate circuit, programmable gate array (PGA), field programmable gate array (FPGA) etc.
In the description of this instructions, the description of reference term " embodiment ", " some embodiment ", " example ", " concrete example " or " some examples " etc. means to be contained at least one embodiment of the present invention or example in conjunction with specific features, structure, material or the feature of this embodiment or example description.In this manual, the schematic statement of above-mentioned term is not necessarily referred to identical embodiment or example.And the specific features of description, structure, material or feature can be with suitable mode combinations in any one or more embodiment or example.
Although illustrated and described embodiments of the invention, those having ordinary skill in the art will appreciate that: in the situation that not departing from principle of the present invention and aim, can carry out multiple variation, modification, replacement and modification to these embodiment, scope of the present invention is limited by claim and equivalent thereof.

Claims (18)

1. a recognition methods for named entity, is characterized in that, comprises the following steps:
Obtain named entity to be identified;
Described named entity to be identified is sent to search engine to obtain Search Results, and extracts the characteristic information of described Search Results; And
Described named entity to be identified, described characteristic information are sent to default disaggregated model, to obtain at least one class categories of described named entity to be identified according to described default disaggregated model.
2. method according to claim 1, is characterized in that, if obtain a plurality of described class categories by described default disaggregated model, and the corresponding degree of confidence of each class categories.
3. method according to claim 2, is characterized in that, also comprises:
The degree of confidence corresponding according to described class categories sorts to described a plurality of class categories, and ranking results is provided.
4. according to the method in any one of claims 1 to 3, it is characterized in that, described default disaggregated model is the first default disaggregated model, and described the first default disaggregated model creates according to following steps:
Obtain the sample named entity that marks classification;
The described sample named entity that has marked classification is sent to search engine, and obtains described search engine according to the described Search Results that has marked the sample named entity feedback of classification;
From the Search Results of described feedback, extract characteristic information; And
According to described, marked the named entity of classification, the described characteristic information of the described mark classification of correspondence, correspondence trains to create described the first default disaggregated model according to existing algorithm.
5. method according to claim 4, is characterized in that, described default disaggregated model is the second default disaggregated model, and described the second default disaggregated model creates according to following steps:
Described in obtaining, marked the webpage of the sample named entity of classification;
The text feature information of the sample named entity that has marked classification described in obtaining in described webpage; And
According to described, marked the named entity of classification, described characteristic information, the described text feature information of the described mark classification of correspondence, correspondence train to create described the second default disaggregated model according to existing algorithm.
6. method according to claim 5, is characterized in that, described default disaggregated model is the 3rd default disaggregated model, and described the 3rd default disaggregated model creates according to following steps:
The user who has marked the sample named entity of classification described in obtaining clicks user behaviors log;
Obtain user for the described click feature information that has marked the sample named entity of classification;
According to described, marked the named entity of classification, described characteristic information, the described click feature information of the described mark classification of correspondence, correspondence train to create described the 3rd default disaggregated model according to existing algorithm.
7. method according to claim 6, is characterized in that, described default disaggregated model is the 4th default disaggregated model, and described the 4th default disaggregated model creates according to following steps:
According to described, marked the named entity of classification, described characteristic information, described text feature information, the described click feature information of the described mark classification of correspondence, correspondence train to create described the 4th default disaggregated model according to existing algorithm.
8. a creation method for disaggregated model, is characterized in that, comprises the following steps:
Obtain the sample named entity that marks classification;
The described sample named entity that has marked classification is sent to search engine, and obtains described search engine according to the described Search Results that has marked the sample named entity feedback of classification;
From the Search Results of described feedback, extract characteristic information; And
According to described, marked the named entity of classification, the described characteristic information of the described mark classification of correspondence, correspondence trains to create the first disaggregated model according to existing algorithm.
9. method according to claim 8, is characterized in that, also comprises:
Described in obtaining, marked the webpage of the sample named entity of classification;
The text feature information of the sample named entity that has marked classification described in obtaining in described webpage; And
According to described, marked the named entity of classification, described characteristic information, the described text feature information of the described mark classification of correspondence, correspondence train to create the second disaggregated model according to existing algorithm.
10. method according to claim 8, is characterized in that, also comprises:
The user who has marked the sample named entity of classification described in obtaining clicks user behaviors log;
Obtain user for the described click feature information that has marked the sample named entity of classification;
According to described, marked the named entity of classification, described characteristic information, the described click feature information of the described mark classification of correspondence, correspondence train to create the 3rd disaggregated model according to existing algorithm.
11. according to the method described in claim 9 or 10, it is characterized in that, also comprises:
According to described, marked the named entity of classification, described characteristic information, described text feature information, the described click feature information of the described mark classification of correspondence, correspondence train to create the 4th disaggregated model according to existing algorithm.
The recognition device of 12. 1 kinds of named entities, is characterized in that, comprising:
Named entity acquisition module, for obtaining named entity to be identified;
Abstraction module, for described named entity to be identified is sent to search engine to obtain Search Results, and extracts the characteristic information of described Search Results; And
Class categories acquisition module, for described named entity to be identified, described characteristic information are sent to default disaggregated model, to obtain at least one class categories of described named entity to be identified according to described default disaggregated model.
13. devices according to claim 12, is characterized in that, if obtain a plurality of described class categories by described default disaggregated model, and the corresponding degree of confidence of each class categories.
14. devices according to claim 13, is characterized in that, also comprise:
Order module, for according to degree of confidence corresponding to described class categories, described a plurality of class categories being sorted, and provides ranking results.
The creation apparatus of 15. 1 kinds of disaggregated models, is characterized in that, comprising:
Sample named entity acquisition module, for obtaining the sample named entity that marks classification;
Search Results acquisition module, for the described sample named entity that has marked classification is sent to search engine, and obtains described search engine according to the described Search Results that has marked the sample named entity feedback of classification;
Abstraction module, extracts characteristic information for the Search Results from described feedback; And
Creation module, for having marked the named entity of classification, the described characteristic information of the described mark classification of correspondence, correspondence trains to create the first disaggregated model according to existing algorithm according to described.
16. devices according to claim 15, is characterized in that, also comprise:
Webpage acquisition module, for having marked the webpage of the sample named entity of classification described in obtaining;
Text feature acquisition of information module has marked the sample named entity of classification in the text feature information of described webpage described in obtaining; Wherein,
Described creation module is also for having marked the named entity of classification, described characteristic information, the described text feature information of the described mark classification of correspondence, correspondence train to create the second disaggregated model according to existing algorithm according to described.
17. devices according to claim 15, is characterized in that, also comprise:
User behaviors log acquisition module, clicks user behaviors log for having marked the user of the sample named entity of classification described in obtaining;
Click feature acquisition of information module, for obtaining user for the described click feature information that has marked the sample named entity of classification; Wherein,
Described creation module is also for having marked the named entity of classification, described characteristic information, the described click feature information of the described mark classification of correspondence, correspondence train to create the 3rd disaggregated model according to existing algorithm according to described.
18. according to the device described in claim 16 or 17, it is characterized in that, described creation module also for: according to the described named entity that has marked classification, corresponding described mark classification, corresponding described characteristic information, described text feature information, described click feature information train to create the 4th disaggregated model according to existing algorithm.
CN201310611971.5A 2013-11-26 2013-11-26 Method and device for identifying named entity and method and device for establishing classification model Pending CN103617239A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310611971.5A CN103617239A (en) 2013-11-26 2013-11-26 Method and device for identifying named entity and method and device for establishing classification model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310611971.5A CN103617239A (en) 2013-11-26 2013-11-26 Method and device for identifying named entity and method and device for establishing classification model

Publications (1)

Publication Number Publication Date
CN103617239A true CN103617239A (en) 2014-03-05

Family

ID=50167942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310611971.5A Pending CN103617239A (en) 2013-11-26 2013-11-26 Method and device for identifying named entity and method and device for establishing classification model

Country Status (1)

Country Link
CN (1) CN103617239A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984705A (en) * 2014-04-25 2014-08-13 北京奇虎科技有限公司 Search result displaying method, device and system
CN104102739A (en) * 2014-07-28 2014-10-15 百度在线网络技术(北京)有限公司 Entity library expansion method and device
CN104615621A (en) * 2014-06-25 2015-05-13 腾讯科技(深圳)有限公司 Method and system for processing correlations in searches
CN105045909A (en) * 2015-08-11 2015-11-11 北京京东尚科信息技术有限公司 Method and device for recognizing commodity name from text
CN105894089A (en) * 2016-04-21 2016-08-24 百度在线网络技术(北京)有限公司 Method of establishing credit investigation model, credit investigation determination method and the corresponding apparatus thereof
CN106294341A (en) * 2015-05-12 2017-01-04 阿里巴巴集团控股有限公司 A kind of Intelligent Answer System and theme method of discrimination thereof and device
CN107038183A (en) * 2016-10-09 2017-08-11 北京百度网讯科技有限公司 Webpage label method and device
CN107609094A (en) * 2017-09-08 2018-01-19 北京百度网讯科技有限公司 Data disambiguation method, device and computer equipment
CN107622126A (en) * 2017-09-28 2018-01-23 联想(北京)有限公司 The method and apparatus sorted out to the solid data in data acquisition system
CN107885716A (en) * 2016-09-29 2018-04-06 腾讯科技(深圳)有限公司 Text recognition method and device
CN104978587B (en) * 2015-07-13 2018-06-01 北京工业大学 A kind of Entity recognition cooperative learning algorithm based on Doctype
CN108205524A (en) * 2016-12-20 2018-06-26 北京京东尚科信息技术有限公司 Text data processing method and device
CN108959552A (en) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 Recognition methods, device, equipment and the storage medium of question and answer class query statement
WO2019064137A1 (en) * 2017-09-27 2019-04-04 International Business Machines Corporation Extraction of expression for natural language processing
CN109582792A (en) * 2018-11-16 2019-04-05 北京奇虎科技有限公司 A kind of method and device of text classification
CN111159424A (en) * 2019-12-27 2020-05-15 东软集团股份有限公司 Method, device, storage medium and electronic equipment for labeling knowledge graph entities
CN111428506A (en) * 2020-03-31 2020-07-17 联想(北京)有限公司 Entity classification method, entity classification device and electronic equipment
CN111523314A (en) * 2020-07-03 2020-08-11 支付宝(杭州)信息技术有限公司 Model confrontation training and named entity recognition method and device
CN113128225A (en) * 2019-12-31 2021-07-16 阿里巴巴集团控股有限公司 Named entity identification method and device, electronic equipment and computer storage medium
WO2022262113A1 (en) * 2021-06-16 2022-12-22 北京来也网络科技有限公司 Information extraction method and apparatus based on rpa and ai, and device and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1629845A (en) * 2003-12-16 2005-06-22 微软公司 Query recognizer
US20110231347A1 (en) * 2010-03-16 2011-09-22 Microsoft Corporation Named Entity Recognition in Query

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1629845A (en) * 2003-12-16 2005-06-22 微软公司 Query recognizer
US20110231347A1 (en) * 2010-03-16 2011-09-22 Microsoft Corporation Named Entity Recognition in Query

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
段焕中: "事务类搜索意图分类模型研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984705B (en) * 2014-04-25 2018-05-04 北京奇虎科技有限公司 A kind of methods of exhibiting of search result, device and system
CN103984705A (en) * 2014-04-25 2014-08-13 北京奇虎科技有限公司 Search result displaying method, device and system
CN104615621B (en) * 2014-06-25 2017-11-21 腾讯科技(深圳)有限公司 Correlation treatment method and system in search
CN104615621A (en) * 2014-06-25 2015-05-13 腾讯科技(深圳)有限公司 Method and system for processing correlations in searches
CN104102739A (en) * 2014-07-28 2014-10-15 百度在线网络技术(北京)有限公司 Entity library expansion method and device
CN106294341A (en) * 2015-05-12 2017-01-04 阿里巴巴集团控股有限公司 A kind of Intelligent Answer System and theme method of discrimination thereof and device
CN104978587B (en) * 2015-07-13 2018-06-01 北京工业大学 A kind of Entity recognition cooperative learning algorithm based on Doctype
CN105045909A (en) * 2015-08-11 2015-11-11 北京京东尚科信息技术有限公司 Method and device for recognizing commodity name from text
CN105045909B (en) * 2015-08-11 2018-04-03 北京京东尚科信息技术有限公司 The method and apparatus that trade name is identified from text
CN105894089A (en) * 2016-04-21 2016-08-24 百度在线网络技术(北京)有限公司 Method of establishing credit investigation model, credit investigation determination method and the corresponding apparatus thereof
US11068655B2 (en) 2016-09-29 2021-07-20 Tencent Technology (Shenzhen) Company Limited Text recognition based on training of models at a plurality of training nodes
CN107885716B (en) * 2016-09-29 2020-02-11 腾讯科技(深圳)有限公司 Text recognition method and device
CN107885716A (en) * 2016-09-29 2018-04-06 腾讯科技(深圳)有限公司 Text recognition method and device
CN107038183A (en) * 2016-10-09 2017-08-11 北京百度网讯科技有限公司 Webpage label method and device
CN107038183B (en) * 2016-10-09 2021-01-29 北京百度网讯科技有限公司 Webpage labeling method and device
CN108205524A (en) * 2016-12-20 2018-06-26 北京京东尚科信息技术有限公司 Text data processing method and device
CN108205524B (en) * 2016-12-20 2022-01-07 北京京东尚科信息技术有限公司 Text data processing method and device
CN107609094A (en) * 2017-09-08 2018-01-19 北京百度网讯科技有限公司 Data disambiguation method, device and computer equipment
CN107609094B (en) * 2017-09-08 2020-12-04 北京百度网讯科技有限公司 Data disambiguation method and device and computer equipment
WO2019064137A1 (en) * 2017-09-27 2019-04-04 International Business Machines Corporation Extraction of expression for natural language processing
CN107622126A (en) * 2017-09-28 2018-01-23 联想(北京)有限公司 The method and apparatus sorted out to the solid data in data acquisition system
CN108959552A (en) * 2018-06-29 2018-12-07 北京百度网讯科技有限公司 Recognition methods, device, equipment and the storage medium of question and answer class query statement
CN109582792A (en) * 2018-11-16 2019-04-05 北京奇虎科技有限公司 A kind of method and device of text classification
CN111159424A (en) * 2019-12-27 2020-05-15 东软集团股份有限公司 Method, device, storage medium and electronic equipment for labeling knowledge graph entities
CN113128225A (en) * 2019-12-31 2021-07-16 阿里巴巴集团控股有限公司 Named entity identification method and device, electronic equipment and computer storage medium
CN111428506A (en) * 2020-03-31 2020-07-17 联想(北京)有限公司 Entity classification method, entity classification device and electronic equipment
CN111428506B (en) * 2020-03-31 2023-02-21 联想(北京)有限公司 Entity classification method, entity classification device and electronic equipment
CN111523314A (en) * 2020-07-03 2020-08-11 支付宝(杭州)信息技术有限公司 Model confrontation training and named entity recognition method and device
WO2022262113A1 (en) * 2021-06-16 2022-12-22 北京来也网络科技有限公司 Information extraction method and apparatus based on rpa and ai, and device and medium

Similar Documents

Publication Publication Date Title
CN103617239A (en) Method and device for identifying named entity and method and device for establishing classification model
CN110717339B (en) Semantic representation model processing method and device, electronic equipment and storage medium
Sankaranarayanan et al. Twitterstand: news in tweets
CN103853738B (en) A kind of recognition methods of info web correlation region
CN102163187B (en) Document marking method and device
CN103299324A (en) Learning tags for video annotation using latent subtags
Chambers et al. Identifying political sentiment between nation states with social media
WO2021093308A1 (en) Method and apparatus for extracting poi name, device, and computer storage medium
Becker Identification and characterization of events in social media
CN103049435A (en) Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device
CN103632388A (en) Semantic annotation method, device and client for image
CN103425640A (en) Multimedia questioning-answering system and method
CN103970733B (en) A kind of Chinese new word identification method based on graph structure
CN103678281A (en) Method and device for automatically labeling text
CN103246644B (en) Method and device for processing Internet public opinion information
CN103995885B (en) The recognition methods of physical name and device
CN103699689A (en) Method and device for establishing event repository
CN104598535A (en) Event extraction method based on maximum entropy
CN104899335A (en) Method for performing sentiment classification on network public sentiment of information
CN107357765B (en) Word document flaking method and device
CN102428467A (en) Similarity-Based Feature Set Supplementation For Classification
CN105138538A (en) Cross-domain knowledge discovery-oriented topic mining method
CN104317891A (en) Method and device for tagging pages
CN109614626A (en) Keyword Automatic method based on gravitational model
CN102722562B (en) Organization information integrating and updating method on basis of Internet

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140305