CN103268312A - Training corpus collection system and method based on user feedback - Google Patents

Training corpus collection system and method based on user feedback Download PDF

Info

Publication number
CN103268312A
CN103268312A CN2013101590251A CN201310159025A CN103268312A CN 103268312 A CN103268312 A CN 103268312A CN 2013101590251 A CN2013101590251 A CN 2013101590251A CN 201310159025 A CN201310159025 A CN 201310159025A CN 103268312 A CN103268312 A CN 103268312A
Authority
CN
China
Prior art keywords
user
information
server
database
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013101590251A
Other languages
Chinese (zh)
Other versions
CN103268312B (en
Inventor
蒋昌俊
程久军
陈闳中
闫春钢
何良华
侯静玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201310159025.1A priority Critical patent/CN103268312B/en
Publication of CN103268312A publication Critical patent/CN103268312A/en
Application granted granted Critical
Publication of CN103268312B publication Critical patent/CN103268312B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

Disclosed are training corpus collection system and method based on user feedback. According to characteristics of user feedback and applications in mobile terminals, whether trained models can complete corresponding langue processing tasks well or not is judged, and correctness of current marking is determined; and correct marked texts are stored in a database so as to generate and collect training corpus. With the development of internet, especially mobile internet, more applications specially oriented to the mobile internet are developed and provide users with various types of information through servers, and users' habits and information are collected through mobile devices. The method serves as a novel learning mechanism based on user feedback, machine learning is provided with novel ways and methods by information fed back by users, and the method is applicable to tasks of recognizing spam mails and the like.

Description

A kind of corpus collection system and method thereof based on user feedback
Technical field
The present invention relates to a kind of corpus collection method.
Technical background
Through the development of decades, natural language processing has obtained considerable progress.At present, at association area such as Chinese word segmentation, the machine learning method based on statistical model has mainly been adopted in text classification etc.Simultaneously, many problems in machine learning field can form turn to the Sequence Learning problem.In the Sequence Learning problem, some data points constitute the integral body that front and back are orderly, and each data point need be given a class label respectively.Because exist abundant and complicated sequence dependence in the sequence between the data point, this type of problem has very big challenge.Classical machine learning method, owing to be subjected to the limitation of data independence hypothesis, the dependence of data before and after can't considering and lost many important informations makes classifying quality reduce, even invalid.
Therefore, in the study and prediction task of sequence data, people adopt the most classical hidden Markov model always, and matured product major part in the market all is based on hidden Markov model.Yet hidden Markov model is production model, and is correct for guaranteeing derivation, must satisfy strict independence assumption, and this hypothesis is obvious and reality is not inconsistent.Though the maximum entropy Markov model of Chu Xianing can overcome the independence assumption problem afterwards, there is the problem of mark biasing.Calendar year 2001, the conditional random field models that is proposed by people such as Lafferty has solved the problem of mark biasing on original basis.Certainly conditional random field models is big owing to its complicacy exists training burden, and convergence waits problem slowly, but generally speaking is comparatively advanced at present statistical learning model.
Though condition random field is obtained significant success, train an effective model, still be faced with the challenge of cost prohibitive.These costs are mainly from two aspects.The cost that at first is design feature is high.Feature is most important for sorter, yet the feature that designs needs domain expert's participation and guidance, need pay very big cost.Secondly, for model training, mark a large amount of language materials, and the cost of corpus labeling is huge.How to reduce the mark cost, be the problem that the machine learning researcher endeavours to solve in recent years always.Because the complicacy of sequence data, more than two problems more outstanding for the Sequence Learning problem, still do not have effective solution.The problem that is directed to mark language material deficiency scholars has in recent years proposed many methods, such as utilizing the self-training algorithm to make up corpus.
Summary of the invention
The object of the invention is to overcome the deficiencies in the prior art, discloses a kind of corpus collection method based on user feedback, and cost is low, and is effective.
 
For achieving the above object, the present invention has provided the systems technology scheme, is characterized by: this system comprises applications client and server end, transmits data with http protocol between the described client and server;
In the described applications client application program module is installed, makes user's input information and the information participle is identified by application program module; Applications client will be imported information and be transferred to server with the XML file layout;
Comprise recognition system and database two parts in the described server, described recognition system is the conditional random field models that has trained; Described recognition system is responsible for the analyzing XML file, obtain the character string of user's input information after resolving, the character string order sent into carry out participle in the conditional random field models, according to the characteristics of application itself word segmentation result is handled again, obtain the information that the user needs, server returns to the user with this information;
After the user received information, can produce different reactions according to the difference of information result: if return results is correct, next the user can continue next step operation; Otherwise, if return results mistake, the user can abandon existing operation or carry out the operation of back again, client is judged the correctness of this word segmentation result and the result is returned to server whether server will will put into database to the information participle decision of user's input according to this result according to user's operation behavior; Described database adopts the MySQL database, namely is used for storing the corpus of having identified that has mark.
DescribedDatabase has a table to the corpus that has marked, and is used for depositing the language material of collection in the table, has also recorded information such as corresponding user, proposition time in the table, and convenience checks later on and excavates.Table is each the field implication in this table of database:
Figure 964779DEST_PATH_IMAGE002
Based on said system, a kind of corpus collection method based on user feedback is characterized in that, is background with the participle, comprises the steps:
1, in the recognition system of server, selects the random field model of cognition trained, the random field model of cognition is put in the middle of the practical application.
2, the user provides information in the mode of client by literal input, and client is transferred to information in the random field model of cognition in the server recognition system and identifies, and recognition result is fed back to the user.Specific implementation: input characters is changed into the information of XML form and it is transmitted the arrival server end in client, at first be responsible for the analyzing XML file by server program at server end, obtain the character string of user's input information after resolving, the character string order is sent in the random field model of cognition of recognition system and carried out participle, according to the characteristics of application itself word segmentation result is handled again, obtain the information that the user needs, server returns to the user with this information.
3, gather user's behavior, judge the result of identification this time, this identification is marked and in the database of depositing as corpus according to this result.Embodiment: after the user received information, can produce different reactions according to the difference of information result: if return results is correct, next the user can continue next step operation; Otherwise, if return results mistake, the user can abandon existing operation or carry out the operation of back again, client is according to two kinds of different user's operation behaviors, thereby can judge the correctness of this word segmentation result, client returns to server with this judged result, and whether server will will put into database to the information participle decision of user's input according to this judged result.
4, repeating step 2 and step 3 in conjunction with user feedback and training algorithm, are improved the corpus in the database gradually.
The present invention, judges thus whether the model of having trained is well finished for corresponding Language Processing task, thereby can determine the correctness of this mark simultaneously also in conjunction with the application on the portable terminal according to the characteristics of user feedback.Thereby correct retrtieval is stored into generation and the collection that database has been realized corpus.
Along with the particularly development of mobile Internet of internet, more and more towards the application of mobile Internet specially, these application provide various information by server to the user, also can utilize mobile device to collect user's custom and information simultaneously.The present invention utilizes the information of user feedback for machine learning provides new approaches and methods as a kind of new study mechanism based on user feedback, can be applicable in the middle of the task such as spam identification.
Description of drawings
Fig. 1 is system construction drawing.
Fig. 2 is user, identification module and database diagram.
Embodiment
Below in conjunction with accompanying drawing and case technical solution of the present invention is described further.
 
Native system is divided into two big ingredients: client and server end.Wherein, client is mainly developed in the Android system, and it is one of operation system of smart phone of present main flow.Server end is based on the server of LAMP framework.
Transmit data with http protocol between the described client and server.
In the described applications client application program module is installed, makes user's input information and the information participle is identified by application program module; Applications client will be imported information and be transferred to server with the XML file layout.
Comprise recognition system and database two parts in the described server, described recognition system is the conditional random field models that has trained; Described recognition system is responsible for the analyzing XML file, obtain the character string of user's input information after resolving, the character string order sent into carry out participle in the conditional random field models, according to the characteristics of application itself word segmentation result is handled again, obtain the information that the user needs, server returns to the user with this information; After the user received information, can produce different reactions according to the difference of information result: if return results is correct, next the user can continue next step operation; Otherwise, if return results mistake, the user can abandon existing operation or carry out the operation of back again, client is judged the correctness of this word segmentation result and the result is returned to server whether server will will put into database to the information participle decision of user's input according to this result according to user's operation behavior; Described database adopts the MySQL database, namely is used for storing the corpus of having identified that has mark.
Based on said system, realized the corpus collection method based on user feedback, following steps:
1, in the recognition system of server, selects the random field model of cognition trained, the random field model of cognition is put in the middle of the practical application.
2 users provide information in the mode of client by literal input, and client is transferred to information in the random field model of cognition in the server recognition system and identifies, and recognition result is fed back to the user.
The user is transmitted information the recognition system that offers in the server by the mode of literal input in client, identify in the random field model of cognition of recognition system, and recognition result fed back to the user " specific implementation: input characters is changed into the information of XML form and it is transmitted and arrives server end in client; at first be responsible for the analyzing XML file by server program at server end; obtain the character string of user's input information after resolving; the character string order is sent in the random field model of cognition of recognition system and carried out participle; according to the characteristics of application itself word segmentation result is handled again; obtain the user need information, server returns to the user with this information.
3 gather users' behavior, judge the result of identification this time, this identification is marked and in the database of depositing as corpus according to this result.Embodiment: after the user received information, can produce different reactions according to the difference of information result: if return results is correct, next the user can continue next step operation; Otherwise, if return results mistake, the user can abandon existing operation or carry out the operation of back again, client is according to two kinds of different user's operation behaviors, thereby can judge the correctness of this word segmentation result, client returns to server with this judged result, and whether server will will put into database to the information participle decision of user's input according to this judged result.
4 repeating step 2 and steps 3 in conjunction with user feedback and training algorithm, are improved the corpus in the database gradually.
Case
(1) experimental system and platform builds
The present invention is background with the participle, discloses the corpus collection method.At first, the alternative condition random field is as the model of identification.We adopt CRFsuite as the framework of increasing income of conditional random field models.CRFsuite has supported comparatively advanced SGD(Stochastic Gradient Descent in parameter estimation) algorithm, be greatly improved on the training time.After recognition system builds, it is connected with server, server adopts linux system.Customer end adopted Android system is attached on the related application.In order to finish the function that will realize in the design, we adopt the Android system 4.2 of present latest edition.Database adopts MySQL, and MySQL is a Relational DBMS, by the exploitation of Sweden MySQL AB company, belongs to Oracle company at present.Linked database saves the data in the different tables, rather than all data are placed in the big warehouse, has so just increased speed and has improved dirigibility.The sql like language of MySQL is the most frequently used standardized language for accessing database.MySQL software has adopted two authorization policies, and it is divided into community's version and commercial version, because its volume is little, speed is fast, the total cost of ownership is low, and these characteristics of open source code especially, the exploitation of general middle-size and small-size website all selects MySQL as site databases.Because the performance brilliance of its community's version, collocation PHP and Apache can form good development environment.Fig. 1 is system construction drawing.Comprise recognition system and database two parts in the server, recognition system is the conditional random field models that has trained.Database adopts the MySQL database, is used for storing the good corpus of identification.
(2) the mode with corpus labeling obtained of user feedback
For obtaining of user feedback, at first plan client part and put into a certain concrete application.Application should have the characteristics that need user's input information and the identification of information participle.The user is input information in application, and applications client is transferred to server with information with certain format then.Use http protocol and XML file as data transmission manner here.XML is extend markup language, is used for the electroactive marker son file and makes it have structural SGML, can be used for flag data, definition data type, the source language that to be a kind of user of permission define oneself SGML.XML is the subclass of standard generalized markup language (SGML), is fit to very much the Web transmission.XML provides unified method to describe and exchange the structural data that is independent of application program or supplier.
After information arrives server end, at first be responsible for the analyzing XML file by server program, obtain the character string of user's input information after resolving.The character string order sent into carry out participle in the recognition system, according to the characteristics of application itself word segmentation result is handled again, obtain the information that the user needs.Server returns to the user with information afterwards.
After the user received information, can produce different reactions according to the difference of information result: if return results is correct, next the user can continue next step operation; Otherwise if the return results mistake, the user can abandon existing operation or carry out the operation of back again.Therefore, client can obtain user's operation behavior, thereby can judge the correctness of this word segmentation result.Afterwards, client returns to server with the result, and whether server will will put into database to the information participle decision of user's input according to this result.Fig. 2 is the user, the graph of a relation of identification module and database.
(3) storage scheme of corpus in database
About the corpus that has marked there being a table, be used for depositing the language material of collection in the table in the database.In addition, also recorded information such as corresponding user, proposition time in the table, convenience checks later on and excavates.Table 1 is each the field implication in this table of database.
Each field implication in the table 1 language material table
Field Explanation
id Language material id
group_name Grouping
author User id is proposed
timeinfo Time
moreinfo Reserved field
title Title
 
Innovative point of the present invention: utilize user feedback to mark for language material, thus the method for collecting as a kind of language material.By in application, collecting the content of user feedback, determine the correctness as a result of labeling system mark, and the language material that will correctly mark deposits in the database in.

Claims (3)

1. the corpus collection system based on user feedback is characterized in that this system comprises applications client and server end, transmits data with http protocol between the described client and server;
In the described applications client application program module is installed, makes user's input information and the information participle is identified by application program module; Applications client will be imported information and be transferred to server with the XML file layout;
Comprise recognition system and database two parts in the described server, described recognition system is the conditional random field models that has trained; Described recognition system is responsible for the analyzing XML file, obtain the character string of user's input information after resolving, the character string order sent into carry out participle in the conditional random field models, according to the characteristics of application itself word segmentation result is handled again, obtain the information that the user needs, server returns to the user with this information;
After the user received information, can produce different reactions according to the difference of information result: if return results is correct, next the user can continue next step operation; Otherwise, if return results mistake, the user can abandon existing operation or carry out the operation of back again, client is judged the correctness of this word segmentation result and the result is returned to server whether server will will put into database to the information participle decision of user's input according to this result according to user's operation behavior; Described database adopts the MySQL database, namely is used for storing the corpus of having identified that has mark.
2. the system as claimed in claim 1 is characterized in that, described database, the corpus that has marked there is a table, is used for depositing the language material of collection in the table, also recorded corresponding user in the table, information such as proposition time, convenience checks later on and excavates.
3. table is each the field implication in this table of database:
Field Explanation id Language material id group_name Grouping author User id is proposed timeinfo Time moreinfo Reserved field title Title
A kind of corpus collection method based on user feedback is characterized in that, is background with the participle, comprises the steps:
(1, in the recognition system of server, select the random field model of cognition trained, the random field model of cognition is put in the middle of the practical application;
(2, the user provides information in the mode of client by literal input, client is transferred to information in the random field model of cognition in the server recognition system and identifies, and recognition result is fed back to the user; Specific implementation: input characters is changed into the information of XML form and it is transmitted the arrival server end in client, at first be responsible for the analyzing XML file by server program at server end, obtain the character string of user's input information after resolving, the character string order is sent in the random field model of cognition of recognition system and carried out participle, according to the characteristics of application itself word segmentation result is handled again, obtain the information that the user needs, server returns to the user with this information;
(3, gather user's behavior, judge the result of identification this time, this identification marked and in the database of depositing as corpus according to this result;
Embodiment: after the user received information, can produce different reactions according to the difference of information result: if return results is correct, next the user can continue next step operation; Otherwise, if return results mistake, the user can abandon existing operation or carry out the operation of back again, client is according to two kinds of different user's operation behaviors, thereby can judge the correctness of this word segmentation result, client returns to server with this judged result, and whether server will will put into database to the information participle decision of user's input according to this judged result;
(4, repeating step (2 and step (3, in conjunction with user feedback and training algorithm, improve the corpus in the database gradually.
CN201310159025.1A 2013-05-03 2013-05-03 A kind of corpus collection system based on user feedback and method thereof Active CN103268312B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310159025.1A CN103268312B (en) 2013-05-03 2013-05-03 A kind of corpus collection system based on user feedback and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310159025.1A CN103268312B (en) 2013-05-03 2013-05-03 A kind of corpus collection system based on user feedback and method thereof

Publications (2)

Publication Number Publication Date
CN103268312A true CN103268312A (en) 2013-08-28
CN103268312B CN103268312B (en) 2016-04-06

Family

ID=49011943

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310159025.1A Active CN103268312B (en) 2013-05-03 2013-05-03 A kind of corpus collection system based on user feedback and method thereof

Country Status (1)

Country Link
CN (1) CN103268312B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI734085B (en) * 2019-03-13 2021-07-21 中華電信股份有限公司 Dialogue system using intention detection ensemble learning and method thereof
CN114532923A (en) * 2022-02-11 2022-05-27 珠海格力电器股份有限公司 Health detection method and device, sweeping robot and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6766320B1 (en) * 2000-08-24 2004-07-20 Microsoft Corporation Search engine with natural language-based robust parsing for user query and relevance feedback learning
US20060143254A1 (en) * 2004-12-24 2006-06-29 Microsoft Corporation System and method for using anchor text as training data for classifier-based search systems
CN102063194A (en) * 2010-04-16 2011-05-18 百度在线网络技术(北京)有限公司 Method, equipment, server and system for inputting characters by user
CN102214227A (en) * 2011-06-23 2011-10-12 华南理工大学 Automatic public opinion monitoring method based on internet hierarchical structure storage
CN102426591A (en) * 2011-10-31 2012-04-25 北京百度网讯科技有限公司 Method and device for operating corpus used for inputting contents
CN102930022A (en) * 2012-10-31 2013-02-13 中国运载火箭技术研究院 User-oriented information search engine system and method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6766320B1 (en) * 2000-08-24 2004-07-20 Microsoft Corporation Search engine with natural language-based robust parsing for user query and relevance feedback learning
US20060143254A1 (en) * 2004-12-24 2006-06-29 Microsoft Corporation System and method for using anchor text as training data for classifier-based search systems
CN102063194A (en) * 2010-04-16 2011-05-18 百度在线网络技术(北京)有限公司 Method, equipment, server and system for inputting characters by user
CN102214227A (en) * 2011-06-23 2011-10-12 华南理工大学 Automatic public opinion monitoring method based on internet hierarchical structure storage
CN102426591A (en) * 2011-10-31 2012-04-25 北京百度网讯科技有限公司 Method and device for operating corpus used for inputting contents
CN102930022A (en) * 2012-10-31 2013-02-13 中国运载火箭技术研究院 User-oriented information search engine system and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIAOHUI YU 等: "Query Segmentation Using Conditional Random Fields", 《PROCEEDINGS OF THE FIRST INTERNATIONAL WORKSHOP ON KEYWORD SEARCH ON STRUCTURED DATA》 *
王鑫 等: "基于用户反馈和增量学习的垃圾邮件识别方法", 《清华大学学报(自然科学版)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI734085B (en) * 2019-03-13 2021-07-21 中華電信股份有限公司 Dialogue system using intention detection ensemble learning and method thereof
CN114532923A (en) * 2022-02-11 2022-05-27 珠海格力电器股份有限公司 Health detection method and device, sweeping robot and storage medium
CN114532923B (en) * 2022-02-11 2023-09-12 珠海格力电器股份有限公司 Health detection method and device, sweeping robot and storage medium

Also Published As

Publication number Publication date
CN103268312B (en) 2016-04-06

Similar Documents

Publication Publication Date Title
CN106874378B (en) Method for constructing knowledge graph based on entity extraction and relation mining of rule model
CN106250412B (en) Knowledge mapping construction method based on the fusion of multi-source entity
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN101174273B (en) News event detecting method based on metadata analysis
CN104881488B (en) Configurable information extraction method based on relation table
CN101470728B (en) Method and device for automatically abstracting text of Chinese news web page
CN101079024B (en) Special word list dynamic generation system and method
CN109493265A (en) A kind of Policy Interpretation method and Policy Interpretation system based on deep learning
CN109543034B (en) Text clustering method and device based on knowledge graph and readable storage medium
CN110110075A (en) Web page classification method, device and computer readable storage medium
CN105718579A (en) Information push method based on internet-surfing log mining and user activity recognition
CN106484829B (en) A kind of foundation and microblogging diversity search method of microblogging order models
CN102890702A (en) Internet forum-oriented opinion leader mining method
CN103942340A (en) Microblog user interest recognizing method based on text mining
CN102207946B (en) Knowledge network semi-automatic generation method
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN103914478A (en) Webpage training method and system and webpage prediction method and system
CN107392143A (en) A kind of resume accurate Analysis method based on SVM text classifications
CN103294781A (en) Method and equipment used for processing page data
CN1936893A (en) Method and system for generating input-method word frequency base based on internet information
CN103020293A (en) Method and system for constructing ontology base in mobile application
CN103559234A (en) System and method for automated semantic annotation of RESTful Web services
CN103838837A (en) Remote-sensing metadata integration method based on lexeme templates
CN104346331A (en) Retrieval method and system for XML database
CN115827819A (en) Intelligent question and answer processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant