CN103268312A

CN103268312A - Training corpus collection system and method based on user feedback

Info

Publication number: CN103268312A
Application number: CN2013101590251A
Authority: CN
Inventors: 蒋昌俊; 程久军; 陈闳中; 闫春钢; 何良华; 侯静玉
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2013-05-03
Filing date: 2013-05-03
Publication date: 2013-08-28
Anticipated expiration: 2033-05-03
Also published as: CN103268312B

Abstract

Disclosed are training corpus collection system and method based on user feedback. According to characteristics of user feedback and applications in mobile terminals, whether trained models can complete corresponding langue processing tasks well or not is judged, and correctness of current marking is determined; and correct marked texts are stored in a database so as to generate and collect training corpus. With the development of internet, especially mobile internet, more applications specially oriented to the mobile internet are developed and provide users with various types of information through servers, and users' habits and information are collected through mobile devices. The method serves as a novel learning mechanism based on user feedback, machine learning is provided with novel ways and methods by information fed back by users, and the method is applicable to tasks of recognizing spam mails and the like.

Description

A kind of corpus collection system and method thereof based on user feedback

Technical field

The present invention relates to a kind of corpus collection method.

Technical background

Through the development of decades, natural language processing has obtained considerable progress.At present, at association area such as Chinese word segmentation, the machine learning method based on statistical model has mainly been adopted in text classification etc.Simultaneously, many problems in machine learning field can form turn to the Sequence Learning problem.In the Sequence Learning problem, some data points constitute the integral body that front and back are orderly, and each data point need be given a class label respectively.Because exist abundant and complicated sequence dependence in the sequence between the data point, this type of problem has very big challenge.Classical machine learning method, owing to be subjected to the limitation of data independence hypothesis, the dependence of data before and after can't considering and lost many important informations makes classifying quality reduce, even invalid.

Therefore, in the study and prediction task of sequence data, people adopt the most classical hidden Markov model always, and matured product major part in the market all is based on hidden Markov model.Yet hidden Markov model is production model, and is correct for guaranteeing derivation, must satisfy strict independence assumption, and this hypothesis is obvious and reality is not inconsistent.Though the maximum entropy Markov model of Chu Xianing can overcome the independence assumption problem afterwards, there is the problem of mark biasing.Calendar year 2001, the conditional random field models that is proposed by people such as Lafferty has solved the problem of mark biasing on original basis.Certainly conditional random field models is big owing to its complicacy exists training burden, and convergence waits problem slowly, but generally speaking is comparatively advanced at present statistical learning model.

Though condition random field is obtained significant success, train an effective model, still be faced with the challenge of cost prohibitive.These costs are mainly from two aspects.The cost that at first is design feature is high.Feature is most important for sorter, yet the feature that designs needs domain expert's participation and guidance, need pay very big cost.Secondly, for model training, mark a large amount of language materials, and the cost of corpus labeling is huge.How to reduce the mark cost, be the problem that the machine learning researcher endeavours to solve in recent years always.Because the complicacy of sequence data, more than two problems more outstanding for the Sequence Learning problem, still do not have effective solution.The problem that is directed to mark language material deficiency scholars has in recent years proposed many methods, such as utilizing the self-training algorithm to make up corpus.

Summary of the invention

The object of the invention is to overcome the deficiencies in the prior art, discloses a kind of corpus collection method based on user feedback, and cost is low, and is effective.

For achieving the above object, the present invention has provided the systems technology scheme, is characterized by: this system comprises applications client and server end, transmits data with http protocol between the described client and server;

In the described applications client application program module is installed, makes user's input information and the information participle is identified by application program module; Applications client will be imported information and be transferred to server with the XML file layout;

Comprise recognition system and database two parts in the described server, described recognition system is the conditional random field models that has trained; Described recognition system is responsible for the analyzing XML file, obtain the character string of user's input information after resolving, the character string order sent into carry out participle in the conditional random field models, according to the characteristics of application itself word segmentation result is handled again, obtain the information that the user needs, server returns to the user with this information;

After the user received information, can produce different reactions according to the difference of information result: if return results is correct, next the user can continue next step operation; Otherwise, if return results mistake, the user can abandon existing operation or carry out the operation of back again, client is judged the correctness of this word segmentation result and the result is returned to server whether server will will put into database to the information participle decision of user's input according to this result according to user's operation behavior; Described database adopts the MySQL database, namely is used for storing the corpus of having identified that has mark.

DescribedDatabase has a table to the corpus that has marked, and is used for depositing the language material of collection in the table, has also recorded information such as corresponding user, proposition time in the table, and convenience checks later on and excavates.Table is each the field implication in this table of database:

Based on said system, a kind of corpus collection method based on user feedback is characterized in that, is background with the participle, comprises the steps:

1, in the recognition system of server, selects the random field model of cognition trained, the random field model of cognition is put in the middle of the practical application.

2, the user provides information in the mode of client by literal input, and client is transferred to information in the random field model of cognition in the server recognition system and identifies, and recognition result is fed back to the user.Specific implementation: input characters is changed into the information of XML form and it is transmitted the arrival server end in client, at first be responsible for the analyzing XML file by server program at server end, obtain the character string of user's input information after resolving, the character string order is sent in the random field model of cognition of recognition system and carried out participle, according to the characteristics of application itself word segmentation result is handled again, obtain the information that the user needs, server returns to the user with this information.

3, gather user's behavior, judge the result of identification this time, this identification is marked and in the database of depositing as corpus according to this result.Embodiment: after the user received information, can produce different reactions according to the difference of information result: if return results is correct, next the user can continue next step operation; Otherwise, if return results mistake, the user can abandon existing operation or carry out the operation of back again, client is according to two kinds of different user's operation behaviors, thereby can judge the correctness of this word segmentation result, client returns to server with this judged result, and whether server will will put into database to the information participle decision of user's input according to this judged result.

4, repeating step 2 and step 3 in conjunction with user feedback and training algorithm, are improved the corpus in the database gradually.

The present invention, judges thus whether the model of having trained is well finished for corresponding Language Processing task, thereby can determine the correctness of this mark simultaneously also in conjunction with the application on the portable terminal according to the characteristics of user feedback.Thereby correct retrtieval is stored into generation and the collection that database has been realized corpus.

Along with the particularly development of mobile Internet of internet, more and more towards the application of mobile Internet specially, these application provide various information by server to the user, also can utilize mobile device to collect user's custom and information simultaneously.The present invention utilizes the information of user feedback for machine learning provides new approaches and methods as a kind of new study mechanism based on user feedback, can be applicable in the middle of the task such as spam identification.

Description of drawings

Fig. 1 is system construction drawing.

Fig. 2 is user, identification module and database diagram.

Embodiment

Below in conjunction with accompanying drawing and case technical solution of the present invention is described further.

Native system is divided into two big ingredients: client and server end.Wherein, client is mainly developed in the Android system, and it is one of operation system of smart phone of present main flow.Server end is based on the server of LAMP framework.

Transmit data with http protocol between the described client and server.

In the described applications client application program module is installed, makes user's input information and the information participle is identified by application program module; Applications client will be imported information and be transferred to server with the XML file layout.

Comprise recognition system and database two parts in the described server, described recognition system is the conditional random field models that has trained; Described recognition system is responsible for the analyzing XML file, obtain the character string of user's input information after resolving, the character string order sent into carry out participle in the conditional random field models, according to the characteristics of application itself word segmentation result is handled again, obtain the information that the user needs, server returns to the user with this information; After the user received information, can produce different reactions according to the difference of information result: if return results is correct, next the user can continue next step operation; Otherwise, if return results mistake, the user can abandon existing operation or carry out the operation of back again, client is judged the correctness of this word segmentation result and the result is returned to server whether server will will put into database to the information participle decision of user's input according to this result according to user's operation behavior; Described database adopts the MySQL database, namely is used for storing the corpus of having identified that has mark.

Based on said system, realized the corpus collection method based on user feedback, following steps:

2 users provide information in the mode of client by literal input, and client is transferred to information in the random field model of cognition in the server recognition system and identifies, and recognition result is fed back to the user.

The user is transmitted information the recognition system that offers in the server by the mode of literal input in client, identify in the random field model of cognition of recognition system, and recognition result fed back to the user " specific implementation: input characters is changed into the information of XML form and it is transmitted and arrives server end in client; at first be responsible for the analyzing XML file by server program at server end; obtain the character string of user's input information after resolving; the character string order is sent in the random field model of cognition of recognition system and carried out participle; according to the characteristics of application itself word segmentation result is handled again; obtain the user need information, server returns to the user with this information.

3 gather users' behavior, judge the result of identification this time, this identification is marked and in the database of depositing as corpus according to this result.Embodiment: after the user received information, can produce different reactions according to the difference of information result: if return results is correct, next the user can continue next step operation; Otherwise, if return results mistake, the user can abandon existing operation or carry out the operation of back again, client is according to two kinds of different user's operation behaviors, thereby can judge the correctness of this word segmentation result, client returns to server with this judged result, and whether server will will put into database to the information participle decision of user's input according to this judged result.

4 repeating step 2 and steps 3 in conjunction with user feedback and training algorithm, are improved the corpus in the database gradually.

Case

(1) experimental system and platform builds

The present invention is background with the participle, discloses the corpus collection method.At first, the alternative condition random field is as the model of identification.We adopt CRFsuite as the framework of increasing income of conditional random field models.CRFsuite has supported comparatively advanced SGD(Stochastic Gradient Descent in parameter estimation) algorithm, be greatly improved on the training time.After recognition system builds, it is connected with server, server adopts linux system.Customer end adopted Android system is attached on the related application.In order to finish the function that will realize in the design, we adopt the Android system 4.2 of present latest edition.Database adopts MySQL, and MySQL is a Relational DBMS, by the exploitation of Sweden MySQL AB company, belongs to Oracle company at present.Linked database saves the data in the different tables, rather than all data are placed in the big warehouse, has so just increased speed and has improved dirigibility.The sql like language of MySQL is the most frequently used standardized language for accessing database.MySQL software has adopted two authorization policies, and it is divided into community's version and commercial version, because its volume is little, speed is fast, the total cost of ownership is low, and these characteristics of open source code especially, the exploitation of general middle-size and small-size website all selects MySQL as site databases.Because the performance brilliance of its community's version, collocation PHP and Apache can form good development environment.Fig. 1 is system construction drawing.Comprise recognition system and database two parts in the server, recognition system is the conditional random field models that has trained.Database adopts the MySQL database, is used for storing the good corpus of identification.

(2) the mode with corpus labeling obtained of user feedback

For obtaining of user feedback, at first plan client part and put into a certain concrete application.Application should have the characteristics that need user's input information and the identification of information participle.The user is input information in application, and applications client is transferred to server with information with certain format then.Use http protocol and XML file as data transmission manner here.XML is extend markup language, is used for the electroactive marker son file and makes it have structural SGML, can be used for flag data, definition data type, the source language that to be a kind of user of permission define oneself SGML.XML is the subclass of standard generalized markup language (SGML), is fit to very much the Web transmission.XML provides unified method to describe and exchange the structural data that is independent of application program or supplier.

After information arrives server end, at first be responsible for the analyzing XML file by server program, obtain the character string of user's input information after resolving.The character string order sent into carry out participle in the recognition system, according to the characteristics of application itself word segmentation result is handled again, obtain the information that the user needs.Server returns to the user with information afterwards.

After the user received information, can produce different reactions according to the difference of information result: if return results is correct, next the user can continue next step operation; Otherwise if the return results mistake, the user can abandon existing operation or carry out the operation of back again.Therefore, client can obtain user's operation behavior, thereby can judge the correctness of this word segmentation result.Afterwards, client returns to server with the result, and whether server will will put into database to the information participle decision of user's input according to this result.Fig. 2 is the user, the graph of a relation of identification module and database.

(3) storage scheme of corpus in database

About the corpus that has marked there being a table, be used for depositing the language material of collection in the table in the database.In addition, also recorded information such as corresponding user, proposition time in the table, convenience checks later on and excavates.Table 1 is each the field implication in this table of database.

Each field implication in the table 1 language material table

Field	Explanation
		id	Language material id
group_name	Grouping
		author	User id is proposed
timeinfo	Time
		moreinfo	Reserved field
title	Title

Innovative point of the present invention: utilize user feedback to mark for language material, thus the method for collecting as a kind of language material.By in application, collecting the content of user feedback, determine the correctness as a result of labeling system mark, and the language material that will correctly mark deposits in the database in.

Claims

1. the corpus collection system based on user feedback is characterized in that this system comprises applications client and server end, transmits data with http protocol between the described client and server;

2. the system as claimed in claim 1 is characterized in that, described database, the corpus that has marked there is a table, is used for depositing the language material of collection in the table, also recorded corresponding user in the table, information such as proposition time, convenience checks later on and excavates.

3. table is each the field implication in this table of database:

A kind of corpus collection method based on user feedback is characterized in that, is background with the participle, comprises the steps:

(1, in the recognition system of server, select the random field model of cognition trained, the random field model of cognition is put in the middle of the practical application;

(2, the user provides information in the mode of client by literal input, client is transferred to information in the random field model of cognition in the server recognition system and identifies, and recognition result is fed back to the user; Specific implementation: input characters is changed into the information of XML form and it is transmitted the arrival server end in client, at first be responsible for the analyzing XML file by server program at server end, obtain the character string of user's input information after resolving, the character string order is sent in the random field model of cognition of recognition system and carried out participle, according to the characteristics of application itself word segmentation result is handled again, obtain the information that the user needs, server returns to the user with this information;

(3, gather user's behavior, judge the result of identification this time, this identification marked and in the database of depositing as corpus according to this result;

Embodiment: after the user received information, can produce different reactions according to the difference of information result: if return results is correct, next the user can continue next step operation; Otherwise, if return results mistake, the user can abandon existing operation or carry out the operation of back again, client is according to two kinds of different user's operation behaviors, thereby can judge the correctness of this word segmentation result, client returns to server with this judged result, and whether server will will put into database to the information participle decision of user's input according to this judged result;

(4, repeating step (2 and step (3, in conjunction with user feedback and training algorithm, improve the corpus in the database gradually.