CN103268312B

CN103268312B - A kind of corpus collection system based on user feedback and method thereof

Info

Publication number: CN103268312B
Application number: CN201310159025.1A
Authority: CN
Inventors: 蒋昌俊; 程久军; 陈闳中; 闫春钢; 何良华; 侯静玉
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2013-05-03
Filing date: 2013-05-03
Publication date: 2016-04-06
Anticipated expiration: 2033-05-03
Also published as: CN103268312A

Abstract

A kind of corpus collection system based on user feedback and method thereof, according to the feature of user feedback, simultaneously also in conjunction with the application on mobile terminal, judge whether housebroken model well completes for corresponding language processing tasks thus, thus the correctness of this time mark can be determined.Correct retrtieval is stored into database thus achieves generation and the collection of corpus.Along with the development of internet particularly mobile Internet, the application specially towards mobile Internet gets more and more, and these application provide various information by server to user, also can utilize custom and the information of mobile device collection user simultaneously.The present invention, as a kind of study mechanism based on user feedback newly, utilizes the information of user feedback to provide new approaches and methods for machine learning, can be applicable in the middle of the tasks such as spam filtering.

Description

A kind of corpus collection system based on user feedback and method thereof

Technical field

The present invention relates to a kind of corpus collection method.

Technical background

Through the development of decades, natural language processing has achieved considerable progress.At present, in association area as Chinese word segmentation, text classification etc., mainly have employed the machine learning method of Corpus--based Method model.Meanwhile, many problems in machine learning field form can turn to Sequence Learning problem.In Sequence Learning problem, some data points form the orderly entirety in front and back, and each data point need give a class label respectively.Enrich and the sequence dependence of complexity because also exist between data point in sequence, problems has very large challenge.Classical machine learning method, due to the limitation by data independence hypothesis, cannot consider the dependence of front and back data and lost many important informations, classifying quality is reduced, even invalid.

Therefore, in the study and prediction task of sequence data, people adopt hidden Markov model the most classical always, and matured product major part is in the market all based on hidden Markov model.But hidden Markov model is production model, for ensureing to derive correctly, strict independence assumption must be met, and this hypothesis is obvious and reality is not inconsistent.Although the maximum entropy Markov model occurred afterwards can overcome independence assumption problem, there is the problem that mark is biased.Calendar year 2001, the conditional random field models proposed by people such as Lafferty solves the biased problem of mark on original basis.Certain conditional random field models is large because its complicacy also exists training burden, restrains the problems such as slow, but is generally speaking Statistical learning model comparatively advanced at present.

Although condition random field obtains significant success, train an effective model, be still faced with the challenge of cost prohibitive.These costs are mainly from two aspects.First be the cost of design feature be high.Feature is for most important sorter, but the feature designed needs participation and the guidance of domain expert, needs to pay very big cost.Secondly, in order to model training, a large amount of language materials be marked, and the cost of corpus labeling is huge.How to reduce labeled cost, be the problem that machine learning researcher endeavours to solve in recent years always.Due to the complicacy of sequence data, above two problems is more outstanding for Sequence Learning problem, there is no effective solution.The problem being directed to mark language material deficiency in recent years scholars proposes many methods, such as utilizes self-training algorithm to build corpus.

Summary of the invention

The object of the invention is to overcome the deficiencies in the prior art, and disclose a kind of corpus collection method based on user feedback, cost is low, effective.

For achieving the above object, The present invention gives system solution, be characterized by: this system comprises applications client and server end, with http protocol transmission data between described client and server;

In described applications client, application program module is installed, makes user's input information by application program module and to the identification of information participle; Applications client by input information with XML file format transmission to server;

Described server comprises recognition system and database two parts, and described recognition system is the conditional random field models trained; Described recognition system is responsible for analyzing XML file, the character string of user's input information is obtained after parsing, character string order is sent in conditional random field models and carries out participle, according to the feature of application itself, word segmentation result is processed again, obtain the information that user needs, this information is returned to user by server;

After user receives information, different reactions can be produced according to the difference of information result: if return results correct, next user can continue next step operation; Otherwise, if the mistake of returning results, user can abandon existing operation or re-start the operation of back, client judges the correctness of this word segmentation result according to user operation behavior and result is returned to server, and server determines whether will put into database by according to this result to the information participle that user inputs; Described database adopts MySQL database, is namely used for storing the corpus with mark identified.

describeddatabase, has a table to the corpus marked, and is used for depositing the language material of collection, also have recorded corresponding user, the information such as proposition time in table in table, carries out checking and excavating after convenient.Table is each field meanings in this table of database:

Based on said system, a kind of corpus collection method based on user feedback, is characterized in that, take participle as background, comprise the steps:

1, in the recognition system of server, select the random field model of cognition trained, random field model of cognition is put in the middle of practical application.

2, user provides information in client by the mode of text event detection, and information transmission identifies in the random field model of cognition in server recognition system by client, and recognition result is fed back to user.Specific implementation: input characters changed into the information of XML format in client and it transmission is arrived server end, first analyzing XML file is responsible for by server program at server end, the character string of user's input information is obtained after parsing, character string order is sent in the random field model of cognition of recognition system and carry out participle, according to the feature of application itself, word segmentation result is processed again, obtain the information that user needs, this information is returned to user by server.

3, gather the behavior of user, judge the result this time identified, according to this result to this identify mark and by it stored in database as corpus.Embodiment: after user receives information, can produce different reactions according to the difference of information result: if return results correct, next user can continue next step operation; Otherwise, if the mistake of returning results, user can abandon existing operation or re-start the operation of back, client is according to two kinds of different user operation behaviors, thus the correctness of this word segmentation result can be judged, this judged result is returned to server by client, and server determines whether will put into database by according to this judged result to the information participle that user inputs.

4, repeat step 2 and step 3, in conjunction with user feedback and training algorithm, improve the corpus in database gradually.

The present invention, according to the feature of user feedback, simultaneously also in conjunction with the application on mobile terminal, judges whether housebroken model well completes for corresponding language processing tasks thus, thus can determine the correctness of this time mark.Correct retrtieval is stored into database thus achieves generation and the collection of corpus.

Along with the development of internet particularly mobile Internet, the application specially towards mobile Internet gets more and more, and these application provide various information by server to user, also can utilize custom and the information of mobile device collection user simultaneously.The present invention, as a kind of study mechanism based on user feedback newly, utilizes the information of user feedback to provide new approaches and methods for machine learning, can be applicable in the middle of the tasks such as spam filtering.

Accompanying drawing explanation

Fig. 1 is system construction drawing.

Fig. 2 is user, identification module and database diagram.

Embodiment

Below in conjunction with accompanying drawing and case, technical solution of the present invention is described further.

Native system is divided into two large ingredients: client and server.Wherein, client is mainly developed in android system, and it is one of operation system of smart phone of current main flow.Server end is the server based on LAMP framework.

With http protocol transmission data between described client and server.

In described applications client, application program module is installed, makes user's input information by application program module and to the identification of information participle; Applications client by input information with XML file format transmission to server.

Described server comprises recognition system and database two parts, and described recognition system is the conditional random field models trained; Described recognition system is responsible for analyzing XML file, the character string of user's input information is obtained after parsing, character string order is sent in conditional random field models and carries out participle, according to the feature of application itself, word segmentation result is processed again, obtain the information that user needs, this information is returned to user by server; After user receives information, different reactions can be produced according to the difference of information result: if return results correct, next user can continue next step operation; Otherwise, if the mistake of returning results, user can abandon existing operation or re-start the operation of back, client judges the correctness of this word segmentation result according to user operation behavior and result is returned to server, and server determines whether will put into database by according to this result to the information participle that user inputs; Described database adopts MySQL database, is namely used for storing the corpus with mark identified.

Based on said system, achieve the corpus collection method based on user feedback, following steps:

2 users provide information in client by the mode of text event detection, and information transmission identifies in the random field model of cognition in server recognition system by client, and recognition result is fed back to user.

Information transmission to be supplied to the recognition system in server by user in client by the mode of text event detection, identify in the random field model of cognition of recognition system, and recognition result is fed back to user " and specific implementation: input characters changed into the information of XML format in client and it transmission is arrived server end, first analyzing XML file is responsible for by server program at server end, the character string of user's input information is obtained after parsing, character string order is sent in the random field model of cognition of recognition system and carry out participle, according to the feature of application itself, word segmentation result is processed again, obtain the information that user needs, this information is returned to user by server.

3 gather the behavior of users, judge the result this time identified, according to this result to this identify mark and by it stored in database as corpus.Embodiment: after user receives information, can produce different reactions according to the difference of information result: if return results correct, next user can continue next step operation; Otherwise, if the mistake of returning results, user can abandon existing operation or re-start the operation of back, client is according to two kinds of different user operation behaviors, thus the correctness of this word segmentation result can be judged, this judged result is returned to server by client, and server determines whether will put into database by according to this judged result to the information participle that user inputs.

4 repeat step 2 and step 3, in conjunction with user feedback and training algorithm, improve the corpus in database gradually.

Case

(1) the building of experimental system and platform

The present invention is background with participle, discloses corpus collection method.First, alternative condition random field is as the model identified.We adopt CRFsuite as the Open Framework of conditional random field models.CRFsuite supports comparatively advanced SGD(StochasticGradientDescent in parameter estimation) algorithm, the training time is greatly improved.After recognition system builds, it be connected with server, server adopts linux system.Client adopts android system, is attached on related application.In order to the function that will realize in complete design, we adopt the android system 4.2 of current latest edition.Database adopts MySQL, MySQL to be Relational DBMSs, is developed, belong to Oracle company at present by MySQLAB company of Sweden.Linked database saves the data in different tables, instead of all data is placed in a large warehouse, which adds speed and improves dirigibility.The sql like language of MySQL is the most frequently used standardized language for accessing database.MySQL software have employed two authorization policy, and it is divided into Community Edition and commercial version, and because its volume is little, speed is fast, the total cost of ownership is low, especially this feature of open source code, the exploitation of general middle-size and small-size website all selects MySQL as site databases.Because the performance of its Community Edition is remarkable, collocation PHP and Apache can form good development environment.Fig. 1 is system construction drawing.Server comprises recognition system and database two parts, and recognition system is the conditional random field models trained.Database adopts MySQL database, is used for storing the corpus identified.

(2) acquisition of user feedback and the mode of corpus labeling

For the acquisition of user feedback, first plan client part and put into a certain embody rule.Application should have the feature needing user's input information and the identification of information participle.User inputs information in the application, and then information is transferred to server with certain format by applications client.Use the mode that http protocol and XML file are transmitted as data here.XML and extend markup language, make it have structural markup language for electroactive marker son file, can be used for flag data, definition data type, be the source language that the markup language of a kind of user of permission to oneself defines.XML is the subset of standard generalized markup language (SGML), is applicable to very much Web transmission.XML provides unified method to describe and exchange structural data independent of application program or supplier.

After information arrives server end, be first responsible for analyzing XML file by server program, after parsing, obtain the character string of user's input information.Character string order is sent in recognition system and carries out participle, then according to the feature of application itself, word segmentation result is processed, obtain the information that user needs.Information is returned to user by server afterwards.

After user receives information, different reactions can be produced according to the difference of information result: if return results correct, next user can continue next step operation; Otherwise if the mistake of returning results, user can abandon existing operation or re-start the operation of back.Therefore, client can obtain user operation behavior, thus can judge the correctness of this word segmentation result.Afterwards, result is returned to server by client, and server determines whether will put into database by according to this result to the information participle that user inputs.Fig. 2 is the graph of a relation of user, identification module and database.

(3) corpus storage scheme in a database

About there being a table to the corpus marked in database, in table, be used for depositing the language material of collection.In addition, in table, also have recorded corresponding user, the information such as proposition time, carry out checking and excavating after convenient.Table 1 is each field meanings in this table of database.

Each field meanings in table 1 language material table

Field	Explanation
		id	Language material id
group_name	Grouping
		author	User id is proposed
timeinfo	Time
		moreinfo	Reserved field
title	Title

innovative point of the present invention: utilize user feedback to mark for language material, thus as a kind of method that language material is collected.By collecting the content of user feedback in the application, determine the result correctness that labeling system marks, and by the language material that correctly marks stored in database.

Claims

1. based on a corpus collection system for user feedback, it is characterized in that, this system comprises applications client and server end, with http protocol transmission data between described client and server;

After user receives information, different reactions can be produced according to the difference of information result: if return results correct, next user can continue next step operation; Otherwise, if the mistake of returning results, user can abandon existing operation or re-start the operation of back, client judges the correctness of this word segmentation result according to user operation behavior and result is returned to server, and server determines whether will put into database by according to this result to the information participle that user inputs; Described database adopts MySQL database, is namely used for storing the corpus with mark identified;

Described database, has a table to the corpus marked, and is used for depositing the language material of collection, also have recorded corresponding user, the information such as proposition time in table in table, carries out checking and excavating after convenient, shows as each field meanings in this table of database:

Field " id ", its implication is " language material id "

Field " group_name ", its implication is " grouping "

Field " author ", its implication is " proposing user id "

Field " timeinfo ", its implication is " time "

Field " moreinfo ", its implication is " reserved field "

Field " title ", its implication is " title ".

2. based on a corpus collection method for user feedback, it is characterized in that, be background with participle, comprise the steps:

(1), in the recognition system of server, select the random field model of cognition that trained, random field model of cognition is put in the middle of practical application;

(2), user provides information in client by the mode of text event detection, and information transmission identifies in the random field model of cognition in server recognition system by client, and recognition result is fed back to user; Specific implementation: input characters changed into the information of XML format in client and it transmission is arrived server end, first analyzing XML file is responsible for by server program at server end, the character string of user's input information is obtained after parsing, character string order is sent in the random field model of cognition of recognition system and carry out participle, according to the feature of application itself, word segmentation result is processed again, obtain the information that user needs, this information is returned to user by server;

(3), gather the behavior of user, judge the result this time identified, according to this result to this identify mark and by it stored in database as corpus;

Embodiment: after user receives information, can produce different reactions according to the difference of information result: if return results correct, next user can continue next step operation; Otherwise, if the mistake of returning results, user can abandon existing operation or re-start the operation of back, client is according to two kinds of different user operation behaviors, thus the correctness of this word segmentation result can be judged, this judged result is returned to server by client, and server determines whether will put into database by according to this judged result to the information participle that user inputs;

(4), repeat step (2) and step (3), in conjunction with user feedback and training algorithm, improve the corpus in database gradually.