CN103268312B - A kind of corpus collection system based on user feedback and method thereof - Google Patents

A kind of corpus collection system based on user feedback and method thereof Download PDF

Info

Publication number
CN103268312B
CN103268312B CN201310159025.1A CN201310159025A CN103268312B CN 103268312 B CN103268312 B CN 103268312B CN 201310159025 A CN201310159025 A CN 201310159025A CN 103268312 B CN103268312 B CN 103268312B
Authority
CN
China
Prior art keywords
user
information
server
database
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310159025.1A
Other languages
Chinese (zh)
Other versions
CN103268312A (en
Inventor
蒋昌俊
程久军
陈闳中
闫春钢
何良华
侯静玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201310159025.1A priority Critical patent/CN103268312B/en
Publication of CN103268312A publication Critical patent/CN103268312A/en
Application granted granted Critical
Publication of CN103268312B publication Critical patent/CN103268312B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

A kind of corpus collection system based on user feedback and method thereof, according to the feature of user feedback, simultaneously also in conjunction with the application on mobile terminal, judge whether housebroken model well completes for corresponding language processing tasks thus, thus the correctness of this time mark can be determined.Correct retrtieval is stored into database thus achieves generation and the collection of corpus.Along with the development of internet particularly mobile Internet, the application specially towards mobile Internet gets more and more, and these application provide various information by server to user, also can utilize custom and the information of mobile device collection user simultaneously.The present invention, as a kind of study mechanism based on user feedback newly, utilizes the information of user feedback to provide new approaches and methods for machine learning, can be applicable in the middle of the tasks such as spam filtering.

Description

A kind of corpus collection system based on user feedback and method thereof
Technical field
The present invention relates to a kind of corpus collection method.
Technical background
Through the development of decades, natural language processing has achieved considerable progress.At present, in association area as Chinese word segmentation, text classification etc., mainly have employed the machine learning method of Corpus--based Method model.Meanwhile, many problems in machine learning field form can turn to Sequence Learning problem.In Sequence Learning problem, some data points form the orderly entirety in front and back, and each data point need give a class label respectively.Enrich and the sequence dependence of complexity because also exist between data point in sequence, problems has very large challenge.Classical machine learning method, due to the limitation by data independence hypothesis, cannot consider the dependence of front and back data and lost many important informations, classifying quality is reduced, even invalid.
Therefore, in the study and prediction task of sequence data, people adopt hidden Markov model the most classical always, and matured product major part is in the market all based on hidden Markov model.But hidden Markov model is production model, for ensureing to derive correctly, strict independence assumption must be met, and this hypothesis is obvious and reality is not inconsistent.Although the maximum entropy Markov model occurred afterwards can overcome independence assumption problem, there is the problem that mark is biased.Calendar year 2001, the conditional random field models proposed by people such as Lafferty solves the biased problem of mark on original basis.Certain conditional random field models is large because its complicacy also exists training burden, restrains the problems such as slow, but is generally speaking Statistical learning model comparatively advanced at present.
Although condition random field obtains significant success, train an effective model, be still faced with the challenge of cost prohibitive.These costs are mainly from two aspects.First be the cost of design feature be high.Feature is for most important sorter, but the feature designed needs participation and the guidance of domain expert, needs to pay very big cost.Secondly, in order to model training, a large amount of language materials be marked, and the cost of corpus labeling is huge.How to reduce labeled cost, be the problem that machine learning researcher endeavours to solve in recent years always.Due to the complicacy of sequence data, above two problems is more outstanding for Sequence Learning problem, there is no effective solution.The problem being directed to mark language material deficiency in recent years scholars proposes many methods, such as utilizes self-training algorithm to build corpus.
Summary of the invention
The object of the invention is to overcome the deficiencies in the prior art, and disclose a kind of corpus collection method based on user feedback, cost is low, effective.
For achieving the above object, The present invention gives system solution, be characterized by: this system comprises applications client and server end, with http protocol transmission data between described client and server;
In described applications client, application program module is installed, makes user's input information by application program module and to the identification of information participle; Applications client by input information with XML file format transmission to server;
Described server comprises recognition system and database two parts, and described recognition system is the conditional random field models trained; Described recognition system is responsible for analyzing XML file, the character string of user's input information is obtained after parsing, character string order is sent in conditional random field models and carries out participle, according to the feature of application itself, word segmentation result is processed again, obtain the information that user needs, this information is returned to user by server;
After user receives information, different reactions can be produced according to the difference of information result: if return results correct, next user can continue next step operation; Otherwise, if the mistake of returning results, user can abandon existing operation or re-start the operation of back, client judges the correctness of this word segmentation result according to user operation behavior and result is returned to server, and server determines whether will put into database by according to this result to the information participle that user inputs; Described database adopts MySQL database, is namely used for storing the corpus with mark identified.
describeddatabase, has a table to the corpus marked, and is used for depositing the language material of collection, also have recorded corresponding user, the information such as proposition time in table in table, carries out checking and excavating after convenient.Table is each field meanings in this table of database:
Based on said system, a kind of corpus collection method based on user feedback, is characterized in that, take participle as background, comprise the steps:
1, in the recognition system of server, select the random field model of cognition trained, random field model of cognition is put in the middle of practical application.
2, user provides information in client by the mode of text event detection, and information transmission identifies in the random field model of cognition in server recognition system by client, and recognition result is fed back to user.Specific implementation: input characters changed into the information of XML format in client and it transmission is arrived server end, first analyzing XML file is responsible for by server program at server end, the character string of user's input information is obtained after parsing, character string order is sent in the random field model of cognition of recognition system and carry out participle, according to the feature of application itself, word segmentation result is processed again, obtain the information that user needs, this information is returned to user by server.
3, gather the behavior of user, judge the result this time identified, according to this result to this identify mark and by it stored in database as corpus.Embodiment: after user receives information, can produce different reactions according to the difference of information result: if return results correct, next user can continue next step operation; Otherwise, if the mistake of returning results, user can abandon existing operation or re-start the operation of back, client is according to two kinds of different user operation behaviors, thus the correctness of this word segmentation result can be judged, this judged result is returned to server by client, and server determines whether will put into database by according to this judged result to the information participle that user inputs.
4, repeat step 2 and step 3, in conjunction with user feedback and training algorithm, improve the corpus in database gradually.
The present invention, according to the feature of user feedback, simultaneously also in conjunction with the application on mobile terminal, judges whether housebroken model well completes for corresponding language processing tasks thus, thus can determine the correctness of this time mark.Correct retrtieval is stored into database thus achieves generation and the collection of corpus.
Along with the development of internet particularly mobile Internet, the application specially towards mobile Internet gets more and more, and these application provide various information by server to user, also can utilize custom and the information of mobile device collection user simultaneously.The present invention, as a kind of study mechanism based on user feedback newly, utilizes the information of user feedback to provide new approaches and methods for machine learning, can be applicable in the middle of the tasks such as spam filtering.
Accompanying drawing explanation
Fig. 1 is system construction drawing.
Fig. 2 is user, identification module and database diagram.
Embodiment
Below in conjunction with accompanying drawing and case, technical solution of the present invention is described further.
Native system is divided into two large ingredients: client and server.Wherein, client is mainly developed in android system, and it is one of operation system of smart phone of current main flow.Server end is the server based on LAMP framework.
With http protocol transmission data between described client and server.
In described applications client, application program module is installed, makes user's input information by application program module and to the identification of information participle; Applications client by input information with XML file format transmission to server.
Described server comprises recognition system and database two parts, and described recognition system is the conditional random field models trained; Described recognition system is responsible for analyzing XML file, the character string of user's input information is obtained after parsing, character string order is sent in conditional random field models and carries out participle, according to the feature of application itself, word segmentation result is processed again, obtain the information that user needs, this information is returned to user by server; After user receives information, different reactions can be produced according to the difference of information result: if return results correct, next user can continue next step operation; Otherwise, if the mistake of returning results, user can abandon existing operation or re-start the operation of back, client judges the correctness of this word segmentation result according to user operation behavior and result is returned to server, and server determines whether will put into database by according to this result to the information participle that user inputs; Described database adopts MySQL database, is namely used for storing the corpus with mark identified.
Based on said system, achieve the corpus collection method based on user feedback, following steps:
1, in the recognition system of server, select the random field model of cognition trained, random field model of cognition is put in the middle of practical application.
2 users provide information in client by the mode of text event detection, and information transmission identifies in the random field model of cognition in server recognition system by client, and recognition result is fed back to user.
Information transmission to be supplied to the recognition system in server by user in client by the mode of text event detection, identify in the random field model of cognition of recognition system, and recognition result is fed back to user " and specific implementation: input characters changed into the information of XML format in client and it transmission is arrived server end, first analyzing XML file is responsible for by server program at server end, the character string of user's input information is obtained after parsing, character string order is sent in the random field model of cognition of recognition system and carry out participle, according to the feature of application itself, word segmentation result is processed again, obtain the information that user needs, this information is returned to user by server.
3 gather the behavior of users, judge the result this time identified, according to this result to this identify mark and by it stored in database as corpus.Embodiment: after user receives information, can produce different reactions according to the difference of information result: if return results correct, next user can continue next step operation; Otherwise, if the mistake of returning results, user can abandon existing operation or re-start the operation of back, client is according to two kinds of different user operation behaviors, thus the correctness of this word segmentation result can be judged, this judged result is returned to server by client, and server determines whether will put into database by according to this judged result to the information participle that user inputs.
4 repeat step 2 and step 3, in conjunction with user feedback and training algorithm, improve the corpus in database gradually.
Case
(1) the building of experimental system and platform
The present invention is background with participle, discloses corpus collection method.First, alternative condition random field is as the model identified.We adopt CRFsuite as the Open Framework of conditional random field models.CRFsuite supports comparatively advanced SGD(StochasticGradientDescent in parameter estimation) algorithm, the training time is greatly improved.After recognition system builds, it be connected with server, server adopts linux system.Client adopts android system, is attached on related application.In order to the function that will realize in complete design, we adopt the android system 4.2 of current latest edition.Database adopts MySQL, MySQL to be Relational DBMSs, is developed, belong to Oracle company at present by MySQLAB company of Sweden.Linked database saves the data in different tables, instead of all data is placed in a large warehouse, which adds speed and improves dirigibility.The sql like language of MySQL is the most frequently used standardized language for accessing database.MySQL software have employed two authorization policy, and it is divided into Community Edition and commercial version, and because its volume is little, speed is fast, the total cost of ownership is low, especially this feature of open source code, the exploitation of general middle-size and small-size website all selects MySQL as site databases.Because the performance of its Community Edition is remarkable, collocation PHP and Apache can form good development environment.Fig. 1 is system construction drawing.Server comprises recognition system and database two parts, and recognition system is the conditional random field models trained.Database adopts MySQL database, is used for storing the corpus identified.
(2) acquisition of user feedback and the mode of corpus labeling
For the acquisition of user feedback, first plan client part and put into a certain embody rule.Application should have the feature needing user's input information and the identification of information participle.User inputs information in the application, and then information is transferred to server with certain format by applications client.Use the mode that http protocol and XML file are transmitted as data here.XML and extend markup language, make it have structural markup language for electroactive marker son file, can be used for flag data, definition data type, be the source language that the markup language of a kind of user of permission to oneself defines.XML is the subset of standard generalized markup language (SGML), is applicable to very much Web transmission.XML provides unified method to describe and exchange structural data independent of application program or supplier.
After information arrives server end, be first responsible for analyzing XML file by server program, after parsing, obtain the character string of user's input information.Character string order is sent in recognition system and carries out participle, then according to the feature of application itself, word segmentation result is processed, obtain the information that user needs.Information is returned to user by server afterwards.
After user receives information, different reactions can be produced according to the difference of information result: if return results correct, next user can continue next step operation; Otherwise if the mistake of returning results, user can abandon existing operation or re-start the operation of back.Therefore, client can obtain user operation behavior, thus can judge the correctness of this word segmentation result.Afterwards, result is returned to server by client, and server determines whether will put into database by according to this result to the information participle that user inputs.Fig. 2 is the graph of a relation of user, identification module and database.
(3) corpus storage scheme in a database
About there being a table to the corpus marked in database, in table, be used for depositing the language material of collection.In addition, in table, also have recorded corresponding user, the information such as proposition time, carry out checking and excavating after convenient.Table 1 is each field meanings in this table of database.
Each field meanings in table 1 language material table
Field Explanation
id Language material id
group_name Grouping
author User id is proposed
timeinfo Time
moreinfo Reserved field
title Title
innovative point of the present invention: utilize user feedback to mark for language material, thus as a kind of method that language material is collected.By collecting the content of user feedback in the application, determine the result correctness that labeling system marks, and by the language material that correctly marks stored in database.

Claims (2)

1. based on a corpus collection system for user feedback, it is characterized in that, this system comprises applications client and server end, with http protocol transmission data between described client and server;
In described applications client, application program module is installed, makes user's input information by application program module and to the identification of information participle; Applications client by input information with XML file format transmission to server;
Described server comprises recognition system and database two parts, and described recognition system is the conditional random field models trained; Described recognition system is responsible for analyzing XML file, the character string of user's input information is obtained after parsing, character string order is sent in conditional random field models and carries out participle, according to the feature of application itself, word segmentation result is processed again, obtain the information that user needs, this information is returned to user by server;
After user receives information, different reactions can be produced according to the difference of information result: if return results correct, next user can continue next step operation; Otherwise, if the mistake of returning results, user can abandon existing operation or re-start the operation of back, client judges the correctness of this word segmentation result according to user operation behavior and result is returned to server, and server determines whether will put into database by according to this result to the information participle that user inputs; Described database adopts MySQL database, is namely used for storing the corpus with mark identified;
Described database, has a table to the corpus marked, and is used for depositing the language material of collection, also have recorded corresponding user, the information such as proposition time in table in table, carries out checking and excavating after convenient, shows as each field meanings in this table of database:
Field " id ", its implication is " language material id "
Field " group_name ", its implication is " grouping "
Field " author ", its implication is " proposing user id "
Field " timeinfo ", its implication is " time "
Field " moreinfo ", its implication is " reserved field "
Field " title ", its implication is " title ".
2. based on a corpus collection method for user feedback, it is characterized in that, be background with participle, comprise the steps:
(1), in the recognition system of server, select the random field model of cognition that trained, random field model of cognition is put in the middle of practical application;
(2), user provides information in client by the mode of text event detection, and information transmission identifies in the random field model of cognition in server recognition system by client, and recognition result is fed back to user; Specific implementation: input characters changed into the information of XML format in client and it transmission is arrived server end, first analyzing XML file is responsible for by server program at server end, the character string of user's input information is obtained after parsing, character string order is sent in the random field model of cognition of recognition system and carry out participle, according to the feature of application itself, word segmentation result is processed again, obtain the information that user needs, this information is returned to user by server;
(3), gather the behavior of user, judge the result this time identified, according to this result to this identify mark and by it stored in database as corpus;
Embodiment: after user receives information, can produce different reactions according to the difference of information result: if return results correct, next user can continue next step operation; Otherwise, if the mistake of returning results, user can abandon existing operation or re-start the operation of back, client is according to two kinds of different user operation behaviors, thus the correctness of this word segmentation result can be judged, this judged result is returned to server by client, and server determines whether will put into database by according to this judged result to the information participle that user inputs;
(4), repeat step (2) and step (3), in conjunction with user feedback and training algorithm, improve the corpus in database gradually.
CN201310159025.1A 2013-05-03 2013-05-03 A kind of corpus collection system based on user feedback and method thereof Active CN103268312B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310159025.1A CN103268312B (en) 2013-05-03 2013-05-03 A kind of corpus collection system based on user feedback and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310159025.1A CN103268312B (en) 2013-05-03 2013-05-03 A kind of corpus collection system based on user feedback and method thereof

Publications (2)

Publication Number Publication Date
CN103268312A CN103268312A (en) 2013-08-28
CN103268312B true CN103268312B (en) 2016-04-06

Family

ID=49011943

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310159025.1A Active CN103268312B (en) 2013-05-03 2013-05-03 A kind of corpus collection system based on user feedback and method thereof

Country Status (1)

Country Link
CN (1) CN103268312B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI734085B (en) * 2019-03-13 2021-07-21 中華電信股份有限公司 Dialogue system using intention detection ensemble learning and method thereof
CN114532923B (en) * 2022-02-11 2023-09-12 珠海格力电器股份有限公司 Health detection method and device, sweeping robot and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6766320B1 (en) * 2000-08-24 2004-07-20 Microsoft Corporation Search engine with natural language-based robust parsing for user query and relevance feedback learning
CN102063194A (en) * 2010-04-16 2011-05-18 百度在线网络技术(北京)有限公司 Method, equipment, server and system for inputting characters by user
CN102214227A (en) * 2011-06-23 2011-10-12 华南理工大学 Automatic public opinion monitoring method based on internet hierarchical structure storage
CN102426591A (en) * 2011-10-31 2012-04-25 北京百度网讯科技有限公司 Method and device for operating corpus used for inputting contents
CN102930022A (en) * 2012-10-31 2013-02-13 中国运载火箭技术研究院 User-oriented information search engine system and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7480667B2 (en) * 2004-12-24 2009-01-20 Microsoft Corporation System and method for using anchor text as training data for classifier-based search systems

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6766320B1 (en) * 2000-08-24 2004-07-20 Microsoft Corporation Search engine with natural language-based robust parsing for user query and relevance feedback learning
CN102063194A (en) * 2010-04-16 2011-05-18 百度在线网络技术(北京)有限公司 Method, equipment, server and system for inputting characters by user
CN102214227A (en) * 2011-06-23 2011-10-12 华南理工大学 Automatic public opinion monitoring method based on internet hierarchical structure storage
CN102426591A (en) * 2011-10-31 2012-04-25 北京百度网讯科技有限公司 Method and device for operating corpus used for inputting contents
CN102930022A (en) * 2012-10-31 2013-02-13 中国运载火箭技术研究院 User-oriented information search engine system and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Query Segmentation Using Conditional Random Fields;Xiaohui Yu 等;《Proceedings of the first International Workshop on Keyword Search on Structured Data》;20091231;21-26 *
基于用户反馈和增量学习的垃圾邮件识别方法;王鑫 等;《清华大学学报(自然科学版)》;20060131;第46卷(第1期);70-73 *

Also Published As

Publication number Publication date
CN103268312A (en) 2013-08-28

Similar Documents

Publication Publication Date Title
CN106874378B (en) Method for constructing knowledge graph based on entity extraction and relation mining of rule model
KR102094659B1 (en) Automatic generation of headlines
CN104881488B (en) Configurable information extraction method based on relation table
CN103914494B (en) Method and system for identifying identity of microblog user
CN109543034B (en) Text clustering method and device based on knowledge graph and readable storage medium
CN102955848B (en) A kind of three-dimensional model searching system based on semanteme and method
CN105205699A (en) User label and hotel label matching method and device based on hotel comments
Pasupat et al. Zero-shot entity extraction from web pages
CN106201465A (en) Software project personalized recommendation method towards open source community
CN109766417A (en) A kind of construction method of the literature annals question answering system of knowledge based map
CN103870506B (en) Webpage information extraction method and system
CN106484829B (en) A kind of foundation and microblogging diversity search method of microblogging order models
CN107392143A (en) A kind of resume accurate Analysis method based on SVM text classifications
CN103294781A (en) Method and equipment used for processing page data
CN107247739B (en) A kind of financial bulletin text knowledge extracting method based on factor graph
CN102890702A (en) Internet forum-oriented opinion leader mining method
CN104484380A (en) Personalized search method and personalized search device
CN102750316A (en) Concept relation label drawing method based on semantic co-occurrence model
CN104199938B (en) Agricultural land method for sending information and system based on RSS
CN103678412A (en) Document retrieval method and device
CN112925901B (en) Evaluation resource recommendation method for assisting online questionnaire evaluation and application thereof
CN102135976A (en) Hypertext markup language page structured data extraction method and device
CN103970898A (en) Method and device for extracting information based on multistage rule base
CN104346331A (en) Retrieval method and system for XML database
CN104699797A (en) Webpage data structured analytic method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant