US20110010374A1 - Filtering Information Using Targeted Filtering Schemes - Google Patents

Filtering Information Using Targeted Filtering Schemes Download PDF

Info

Publication number
US20110010374A1
US20110010374A1 US12/667,145 US66714509A US2011010374A1 US 20110010374 A1 US20110010374 A1 US 20110010374A1 US 66714509 A US66714509 A US 66714509A US 2011010374 A1 US2011010374 A1 US 2011010374A1
Authority
US
United States
Prior art keywords
user
filtering
data
information
targeted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US12/667,145
Other versions
US8725746B2 (en
Inventor
Junjie Yang
Liang Ni
Zhenghua Zhang
Zhenyu Zhang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Assigned to ALIBABA GROUP HOLDING LIMITED reassignment ALIBABA GROUP HOLDING LIMITED ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: NI, Liang, YANG, JUNJIE, ZHANG, ZHENGHUA, ZHANG, ZHENYU
Publication of US20110010374A1 publication Critical patent/US20110010374A1/en
Priority to US14/197,118 priority Critical patent/US9201953B2/en
Application granted granted Critical
Publication of US8725746B2 publication Critical patent/US8725746B2/en
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • G06F16/337Profile generation, learning or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/04Real-time or near real-time messaging, e.g. instant messaging [IM]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/02Network architectures or network communication protocols for network security for separating internal from external traffic, e.g. firewalls
    • H04L63/0227Filtering policies
    • H04L63/0245Filtering by information in the payload

Definitions

  • the present disclosure relates to fields of information security technologies, and particularly to methods, and apparatuses for filtering user information.
  • methods for promulgating information by users through the Internet have become more and more effective and comprehensive.
  • methods for promulgating information include using instant messaging tools, sending various kinds of information by email, or posting information on a forum on a network.
  • part of this circulated information may be undesirable to a user, or information may be illegally promulgated and need to be filtered.
  • Existing methods for filtering user information are based on direct keyword determination. If a relevant keyword appears in user information, associated user is determined to be a target user.
  • the present disclosure provides a method and an apparatus for filtering user information.
  • the method takes into account not only specific keywords in the user information, but also related user-characteristic data (e.g., user activity data), and allows determining targeted user characteristics from multiple aspects of user activities.
  • the disclosed method adopts different filtering schemes for different types of targeted users to improve the recognition accuracy with respect to the target user information.
  • the method determines a suitable filtering scheme using a correspondence relationship between the filtering scheme and keywords and user-characteristic data.
  • the method uses modeling of one or more sample users and multiple candidate filtering schemes to formulate targeted filtering scheme.
  • the method obtains keywords and user-characteristic data of a user, and selects a filtering scheme according to the correspondence relationship between the filtering scheme and the keywords and user-characteristic data.
  • the selected filtering scheme is used to filter the information of the user.
  • the filtering scheme is modeled to filter targeted users whose information has similar keywords and user-characteristic data.
  • the correspondence relationship between the filtering scheme and the keywords and user-characteristic data may be configured using a pre-stage (e.g., off-line) modeling procedure.
  • the procedure sets (e.g., extracts and configures) targeted keywords and user-characteristic data from a data collection of targeted users, and generates user-characteristic parameters of a target user based on the targeted keywords and user-characteristic data.
  • the procedure formulates the filtering scheme based on the filtered user-characteristic parameters, and established a correspondence relationship between the filtering scheme and the targeted keywords and user-characteristic data.
  • the method To generate user-characteristic parameters of the target user, the method identifies valid data in the set keywords and user-characteristic data, selects one or more sample users based on the valid data, and obtains the user-characteristic parameters of the target user based on the keywords and user-characteristic data of the one or more sample users.
  • the user-characteristic parameters of the target user may be obtained by generating useful variables such as aggregated variables, ratio variables and average variables.
  • An aggregated variable of the target user is generated based on a total frequency according to the keywords and user-characteristic data of the one or more sample users.
  • a ratio variable of the target user is generated based on a send/receive ratio according to the user-characteristic data of the one or more sample users.
  • An average variable is generated based on an average frequency according to the keywords and user-characteristic data of the one or more sample users.
  • the filtering scheme is modeled to filter targeted users using a pre-stage (e.g., off-line) procedure.
  • this procedure sets keywords and user-characteristic data from a data collection of targeted users, and generates user-characteristic parameters of a target user based on the extracted keywords and user-characteristic data.
  • the procedure selects one or more rule generation parameters from the user-characteristic parameters, and generates multiple filtering schemes based on the rule generation parameter.
  • the procedure selects the filtering scheme having the highest accuracy among the multiple filtering schemes to be the filtering scheme for the target user.
  • the method scores a user based on the selected filtering scheme.
  • the information of the user is filtered according to the filtering scheme only if the user's exceeds a preset threshold.
  • the user-characteristic data may include user activity data, user information data and network user-characteristic data.
  • the apparatus has a configuration module, an acquisition module and a filter module.
  • the configuration module is used for configuring a correspondence relationship between a targeted filtering scheme and targeted keywords and user-characteristic data from a data collection of targeted users.
  • the acquisition module is used for acquiring keywords and user-characteristic data of a present user.
  • the filtering module is used for selecting the targeted filtering scheme according to the correspondence relationship based on the acquired keywords and user-characteristic data of the present user, and for filtering information of the present user according to the targeted filtering scheme.
  • the configuration module may also be adapted to perform a pre-stage (e.g., off-line) modeling process to generate the targeted filtering scheme for the target user based on the filtered user-characteristic parameters.
  • a pre-stage e.g., off-line
  • the disclosed apparatus may employ a server computer to perform both the modeling and the filtering.
  • FIG. 1 shows a flow chart of an exemplary method for filtering user information in accordance with the present disclosure.
  • FIG. 2 shows a flow chart illustrating an exemplary process for configuring a correspondence relationship among keywords, user-characteristic data and the scheme for filtering target user information in accordance with the present disclosure.
  • FIG. 3 shows a flow chart of an exemplary method for filtering user information in accordance with the present disclosure.
  • FIG. 4 shows a structural diagram of an exemplary apparatus for filtering user information in accordance with the present disclosure.
  • the present disclosure provides a method and an apparatus for filtering user information.
  • the method takes into account not only specific keywords in the user information, but also other user-related information.
  • the filtering techniques disclosed herein allow determining the characteristics of a target user (e.g., a user who spreads information that is unwanted by others, or a user who illegally promulgates information) from multiple aspects of user activities, and process the target user accordingly to improve the recognition accuracy with respect to the target user, and to strengthen the information security.
  • FIG. 1 shows a flow chart of an exemplary method 100 for filtering user information in accordance with the present disclosure.
  • the order in which a process is described is not intended to be construed as a limitation, and any the number of the described process blocks may be combined in any order to implement the method, or an alternate method.
  • the exemplary method 100 includes a procedure described as follows.
  • Block 101 configures a correspondence relationship between a filtering scheme and user keywords and user-characteristic data.
  • This stage is the preparation stage for filtering user information.
  • Various filtering schemes may be designed to target various types of targeted users. For example, one filtering scheme may be used to target users who circulate pornographic information, and another filtering scheme may be used to target users who send out spam information containing unsolicited advertisements.
  • a correspondence relationship is built between filtering schemes and various types of user keywords and user-characteristic data, such that a suitable filtering scheme can be selected according to the keywords and user-characteristic data of an actual user.
  • the preparation stage is a pre-stage which can be performed separately from the actual filtering application. Such pre-stage may even be done off-line using a data collection of the information of targeted users.
  • a filtering scheme may be defined by various variables (e.g., a key word, a type of keywords, frequency, etc.) and their threshold values. Filtering schemes may differ from one another in various ways and different degrees. For example, filtering schemes targeting different types of users may be different from each other in the different variables which define the filtering schemes. In comparison, filtering schemes targeting the same type of users may have the same or similar variables but different threshold values provide different emphasis flavors (or different effectiveness) of filtration.
  • One exemplary filtering scheme performs filtering based on frequencies of one or more keywords. For example, if a keyword A appears N (N ⁇ 1) times, associated user information is then filtered. A correspondence relationship may be established between this exemplary filtering scheme and the keyword and its filtering requirement (i.e., frequency of the keyword).
  • Block 102 obtains keywords and user-characteristic data of a user. This is the beginning of the actual filtration stage. The system analyzes the information of the present user in order to decide whether the information of the user needs to be filtered, and if yes, which filtering scheme should be used.
  • Block 103 selects the filtering scheme using the correspondence relationship based on the keywords and user-characteristic data of the user.
  • the correspondence relationship is configured at block 101 .
  • the system compares the keywords and user-characteristic data of the present user with the keywords and the user-characteristic data that corresponds to each available filtering scheme and decides which one is suitable for the present user. In general, the closer match indicates a more suitable filtering scheme.
  • the filtering scheme and its correspondence relationship with the keywords and user-characteristic data are configured using a pre-stage modeling process represented by block 101 . Detail of this modeling process will be described further below.
  • FIG. 2 shows a flow chart illustrating an exemplary process 200 for configuring a correspondence relationship among keywords, user-characteristic data and the scheme for filtering target user information in accordance with the present disclosure.
  • Block 201 sets targeted keywords user-characteristic data from a data collection of targeted users. This may involve extracting and configuring target keywords and user-characteristic data from the data collection of targeted users.
  • the user-characteristic data may include various types of data such as user activity data, user information data, and network user-characteristic data, as described as follows.
  • the user activity data includes one or more types of information including frequencies of characteristic word groups in the information sent by the user within a time frame, the number of times the user sends/receives information, and the amount of information the user sends/receives.
  • the user information data includes one or more types of information including initial login time of the user, user activities after login, and the number of contacts of the user.
  • the network user-characteristic data includes one or more types of information including the number of user IDs within the same IP, and the number of user IDs within the same machine ID.
  • Block 202 generates user-characteristic parameters of a target user based on the user-characteristic data.
  • a target user does not have to be a specific user. Instead, a target user may be an abstract user representing a type of targeted users such as those who circulate pornographic information.
  • An exemplary process for generating such user-characteristic parameters is described as follows.
  • Valid data in the user-characteristic data are first identified. Specifically, upon obtaining enough data, data cleaning is required to eliminate some fields or records. For example, certain data contents are configured to be essential while other data contents are set to be non-essential according to user requirements. The non-essential data contents can be then eliminated, leaving only data contents that are essential.
  • one or more sample users are selected from the available targeted users for modeling.
  • the information records in the data selection are sampled to determine a target user which is used for building a rule model.
  • the information records may include messages sent or promulgated by the targeted users.
  • User-characteristic parameters of the target user are obtained based on user-characteristic data of the sample users.
  • the user-characteristic parameters are characteristic properties of the target user.
  • the user-characteristic parameters are defined by straight variables (such as the frequency of the keyword) and derived variables. According to the modeling objective, derived variables are obtained using existing data in order to better understand activities of a user with a more comprehensive perspective. The derived variables are acquired based on performing combinational computation on multiple user-characteristic data.
  • derived variables may be designed and computed from the user-characteristic data. Several examples are described below.
  • An aggregated variable is a statistical result of all user-characteristic data.
  • the ratio variable embodies a proportional relationship of different statuses of the user-characteristic data of the target user.
  • the Average variable embodies an average frequency of the user-characteristic data of the target user within a unit time.
  • the computation may use the user-characteristic data of the selected sample users only.
  • Block 203 filters anomalies (e.g., anomalous values) in the user-characteristic parameters.
  • An exemplary filtering process used here searches for variables that need to be eliminated and missing values that need to be replaced are searched for.
  • the filtering process replaces any missing value in the user-characteristic parameters by a replacement value.
  • a replacement rule for data's missing value is set.
  • a rule may be set to replace all missing values by zeros.
  • the filtering process also replaces an irregular value (a value that does not satisfy a format rule) by a regular value.
  • the filtering process may replace traditional characters by corresponding simplified characters, uppercases by lowercases, and SBC cases by DBC cases, etc., in all text information.
  • Block 204 generates a filtering scheme for the target user based on the filtered user-characteristic parameters.
  • Filtering schemes are generated using a modeling process described herein. Upon obtaining data that satisfy the requirements in the foregoing blocks, the process enters into a modeling stage to generate candidate filtering schemes and select the best performing candidate filtering schemes as targeted filtering schemes.
  • Modeling includes tasks such as selecting a suitable algorithm, selecting suitable parameters, formulating a model verification scheme, formulating a data sampling plan, and configuring model parameters.
  • An exemplary modeling process selects one or more user-characteristic parameters from the filtered user-characteristic data to be rule generation parameters, and generates multiple filtering schemes by filtering scheme adjustment based on adjusting the rule generation parameters.
  • the process than selects a filtering scheme that has demonstrated the highest accuracy in the testing from multiple filtering schemes to be the filtering scheme of the target user.
  • Modeling and data preparation may be performed interactively.
  • An initial result of modeling may produce new requirements for data preparation, while a result of data preparation may directly affect model construction.
  • Pattern rules of the target user are formed using the above process. Furthermore, in a practical application, the system may score a user based on the filtering scheme of the target user. If the score of the user exceeds a preset threshold, information of the user is filtered to monitor and ensure the network security.
  • the filtering scheme may be applied in information filtering tasks in various interactive networked processes implemented for information communication. Examples of such processes include email, forums, and instant messaging. The scope of the present disclosure covers such applications using the filtering scheme in these processes.
  • FIG. 3 shows an example process 300 of filtering user information of users who circulate pornographic information.
  • the system analyzes chat messages of these targeted users, and discovers implicit models for the information sent by these targeted users who circulate pornographic information.
  • the models are obtained through data mining modeling.
  • a filtering scheme for a target user who circulates pornographic information is generated using a generating scheme.
  • the target user is an abstract user representing a certain type of targeted users.
  • the filtering scheme thus generated is then used to monitor the information of this type of users.
  • Block 301 sets the user-characteristic data of targeted users through analyzing the targeted information.
  • the target information may be a data collection of the targeted users (e.g., messages sent by users who are known to be circulating pornographic information).
  • the user-characteristic data is set by extracting and configuring the targeted information.
  • the user-characteristic data may include user activity data, user information data, and network user-characteristic data. Specific scopes and results of the data setting are described as follows.
  • Setting of the user activity data including:
  • Frequencies of keywords such as catalog, movie, channel, video, animation, cartoon, picture, show, watch, download, online, pornographic, erotic, sexual, passion, adult, ethics, porn star, classics, R-rating, “A” film, uncoded, clear, AV, etc.
  • the user-characteristic parameters of the target user i.e., a representative user who sends pornographic information in the illustrated example
  • the user-characteristic parameters would reflect whether a user is sending pornographic information.
  • the target user required in modeling is found through analysis and screening. An exemplary process is described as follows.
  • Block 302 identifies the valid data in the user-characteristic data, and eliminates the invalid variables and observations.
  • Block 303 selects one or more sample users, and determines a target object (a target user).
  • This process sets a user who sends pornographic information as a model object, samples and extracts the communicated information records of this type of users to determine a target user for building the model.
  • Examples of the communicated information records include chat records, message records, and email records.
  • Block 304 computes derived variables.
  • this process computes the derived variables using the above data in order to understand the customer activities of the target user more comprehensively.
  • three primary types of derived variables are used in modeling: aggregated variable, ratio variable, and average variable. These derived variables are described further below.
  • An aggregated variable is the number of targeted keyword types that appear. For example, if information contains keywords “AV”, “pornstar”, and “R rating”, the number of targeted keyword types that appeared is three, and therefore the value of the corresponding aggregated variable data is three.
  • Another example of an aggregated variable is the number of keyword appearances measured by keyword groups. For example, keywords such as “watch”, “download”, and “online” are categorized into the same homogeneous group, and the total number of times these keywords appear is computed for the same group.
  • ratio variable is the send/receive ratio, such as a ratio between the number of times information is sent and the number of times information is received, and a ratio between the number of bytes of information sent and the number of bytes of information received.
  • This type of variables may be indicative of a certain type of users who send out large quantities of information but seldom receive information from others.
  • an average variable is the average frequency of a keyword type's appearance. This can be measured by the frequency of a keyword divided by the total frequency of all keywords.
  • Block 305 filters user-characteristic parameters.
  • a variable which has a missing value is replaced according to a replacement rule for missing values of data. For example, a missing value is replaced by zero.
  • an exemplary data cleaning may be converting all text information from traditional Chinese characters to simplified Chinese character, uppercases to lowercases, and SBC cases to DBC cases.
  • Block 306 generates a filtering scheme of the target user based on the filtered user-characteristic parameters.
  • Modeling includes such tasks as selecting a suitable algorithm, selecting suitable parameters, formulating a model verification scheme, formulating a data sampling plan, and configuring model parameters.
  • Modeling and data preparation are interactive processes. An initial result of modeling may produce new requirements for data preparation, while a result of data preparation may directly affect model construction.
  • the model satisfies an accuracy requirement. From all models that have satisfied the accuracy requirement, one or more models having the highest accuracy are selected to be the filtering scheme(s) for the target user. That is, the selected the filtering scheme(s) is to be used for filtering users who promulgates pornographic information.
  • the disclosed method achieves time-division monitoring and collecting of the records of the communicated information.
  • the data mining and modeling may be a pre-stage preparation carried out at the backend. This may even be done off-line.
  • the system uses the targeted filtering scheme generated by the modeling to give each user a score. If a user's score exceeds a preset threshold, the system concludes that the user is promulgating pornographic information, and therefore applies the filtering scheme to filter the information of the user.
  • the system may apply other relevant controlling measures to penalize the user accordingly. For instance, the system may enlist the user into a monitoring system. A supervising network security officer may then determine, from a business point of view, whether the user who has been enlisted into the monitoring system has met the penalty requirements, and penalize the user if the user needs the penalty requirements.
  • FIG. 4 shows a structural diagram of an exemplary apparatus 400 for filtering user information in accordance with the present disclosure.
  • Information filtering apparatus 400 includes a configuration module 410 , acquisition module 420 and a future module 430 .
  • the configuration module 410 is used for configuring a correspondence relationship between a filtering scheme and the keywords and user-characteristic data of targeted users.
  • the acquisition module 420 is used for acquiring keywords and user-characteristic data of a present user.
  • the filtering module 430 is used for finding and selecting the targeted filtering scheme according to the correspondence relationship using the keywords and user-characteristic data of the present user, and for further filtering information of the target user according to the selected filtering scheme.
  • a parameter generation sub-module 412 is used for setting the keywords and user-characteristic data of information of targeted users, and generating user-characteristic parameters of a representative target user based on the set keywords and user-characteristic data.
  • a filtering sub-module 414 is used for filtering out anomalies in the user-characteristic parameters that are generated by the parameter generation sub-module 412 .
  • a rule generation sub-module 416 is used for generating the filtering scheme for the target user based on the filtered user-characteristic parameters.
  • the parameter generation sub-module 412 may use various units (not shown) to perform various functions.
  • a recognition unit (not shown) may be used for recognizing valid data in the user-characteristic data
  • a selection unit may be used for selecting one or more sample users based on the valid data recognized by the recognition unit
  • a computation unit is used for computing the user-characteristic parameters of the target user based on user-characteristic data of the one or more sample users selected by the selection unit.
  • the filtering sub-module 414 may use various units (not shown) to perform various functions. For example, a first filtering unit is used for replacing a missing value in the user-characteristic parameters by a replacement value; and a second filtering unit is used for replacing an irregular value in the user-characteristic parameters by a regular value.
  • the rule generation sub-module 416 may use various units (not shown) to perform its own related functions. For example, a parameter selection unit is used for selecting the one or more user-characteristic parameters obtained to be rule generation parameter(s); a rule computing unit is used for generating multiple filtering schemes based on the rule generation parameters selected by the parameter selection unit; and a rule selection unit is used for selecting a filtering scheme having the highest accuracy from the filtering schemes to be the filtering scheme of the target user.
  • a parameter selection unit is used for selecting the one or more user-characteristic parameters obtained to be rule generation parameter(s)
  • a rule computing unit is used for generating multiple filtering schemes based on the rule generation parameters selected by the parameter selection unit
  • a rule selection unit is used for selecting a filtering scheme having the highest accuracy from the filtering schemes to be the filtering scheme of the target user.
  • a searching sub-module 434 is used for finding the targeted filtering scheme according to the correspondence relationship using the keywords and user-characteristic data of a present user.
  • a filtering sub-module 436 is used for filtering the target user information based on the selected filtering scheme.
  • a determination sub-module 432 is used for storing a present user based on the targeted filtering scheme of the target user, and triggering the filtering sub-module 436 if the score of the present user exceeds a preset threshold.
  • a “module” or a “unit” in general refers to a functionality designed to perform a particular task or function.
  • a module or unit can be a device which is a tool or machine designed to perform a particular task or function.
  • a module or unit can be a piece of hardware, software, a plan or scheme, or a combination thereof, for effectuating a purpose associated with the particular task or function.
  • the functions of multiple modules or units may be achieved using a single device.
  • configuration module 410 acquisition module 420 and filtering module 430 may either be implemented in an integrated machine (a server or use of a system) or each in a separate machine (a computer or a server).
  • configuration module 410 is implemented in a first machine
  • acquisition module 420 and filtering module 430 are implemented in a second machine which is connected to (but not necessarily integrated with) the first machine.
  • the exemplary embodiments disclosed herein may have the following advantages. Because the method establishes filtering rules of a target user to determine user activities based on specific keywords and other user related information, characteristics of user activities can be determined from multiple aspects, and processed accordingly to improve recognition accuracy with respect to a certain type of targeted users and improve the efficiency of information security.
  • the disclosed method and system may be implemented using software and universal hardware platform, or can be implemented using hardware only. However, in many instances, implementation using a combination of software and hardware is preferred. Based on this understanding, the technical schemes of the present disclosure, or portions contributing to existing technologies, may be implemented in the form of software products which are stored in a storage media.
  • the software includes instructions for a computing device (e.g., a cell phone, a personal computer, a server or a networked device) to execute the method described in the exemplary embodiments of the current disclosure.

Abstract

A method for filtering user information takes into account not only specific keywords in the user information, but also related user-characteristic data (e.g., user activity data), and allows targeted user characteristics to be determined from multiple aspects of user activities. In one aspect, the disclosed method adopts different filtering schemes for different types of targeted users to improve the recognition accuracy with respect to the target user information. The method determines a suitable filtering scheme using a correspondence relationship between the filtering scheme and keywords and user-characteristic data. The method uses modeling of sample users and multiple candidate filtering schemes to formulate targeted filtering scheme. An apparatus for implementing the method is also disclosed.

Description

    RELATED APPLICATIONS
  • This application is a national stage application of international patent application PCT/US09/48817 filed Jun. 26, 2009, entitled “FILTERING INFORMATION USING TARGETED FILTERING SCHEMES” which claims priority from Chinese patent application, Application No. 200810126362.X, filed Jun. 26, 2008, entitled “METHOD AND APPARATUS FOR FILTERING USER INFORMATION”, which applications are hereby incorporated in their entirety by reference.
  • TECHNICAL FIELD
  • The present disclosure relates to fields of information security technologies, and particularly to methods, and apparatuses for filtering user information.
  • BACKGROUND
  • The developments of digital technologies have resulted in tremendous amount of data generated everywhere. The daily large volume transaction data of a bank is but one example. Data processing to obtain beneficial information has been a rewarding experience. Rapid development of computer technology has opened up a variety of possibilities for data processing, and promoted great leaps in the development of database technology. Faced with the ever increasing data volume, however, people are becoming increasingly dissatisfied with query function of a database. One basic question has emerged: Can we obtain from the data real information or knowledge that is actually useful for decision making? The conventional database technology, which is only good at data management, has become powerless in face of this basic question, and so is conventional statistical technology which faces its own great challenges. Therefore, a new method for processing massive amount of data is urgently needed.
  • At the same time, methods for promulgating information by users through the Internet have become more and more effective and comprehensive. Examples of methods for promulgating information include using instant messaging tools, sending various kinds of information by email, or posting information on a forum on a network. However, part of this circulated information may be undesirable to a user, or information may be illegally promulgated and need to be filtered. Existing methods for filtering user information are based on direct keyword determination. If a relevant keyword appears in user information, associated user is determined to be a target user.
  • However, existing technical scheme only matches information using keywords and does not penalize the information or user characteristics from other aspects, and therefore may incur a high false-alarm rate. For example, if “win a prize” is used as a keyword to future fake prize-winning advertisements, and if something like “I won a prize today” appears in a message of a user, a system may falsely conclude that what the user is sending is a fake prize-winning advertisement, and therefore filter out the message of the user to cause the user to fail to perform related normal operations such as chatting and leaving comments.
  • SUMMARY
  • The present disclosure provides a method and an apparatus for filtering user information. The method takes into account not only specific keywords in the user information, but also related user-characteristic data (e.g., user activity data), and allows determining targeted user characteristics from multiple aspects of user activities. The disclosed method adopts different filtering schemes for different types of targeted users to improve the recognition accuracy with respect to the target user information. The method determines a suitable filtering scheme using a correspondence relationship between the filtering scheme and keywords and user-characteristic data. The method uses modeling of one or more sample users and multiple candidate filtering schemes to formulate targeted filtering scheme.
  • In one embodiment, the method obtains keywords and user-characteristic data of a user, and selects a filtering scheme according to the correspondence relationship between the filtering scheme and the keywords and user-characteristic data. The selected filtering scheme is used to filter the information of the user. The filtering scheme is modeled to filter targeted users whose information has similar keywords and user-characteristic data.
  • The correspondence relationship between the filtering scheme and the keywords and user-characteristic data may be configured using a pre-stage (e.g., off-line) modeling procedure. In one embodiment, the procedure sets (e.g., extracts and configures) targeted keywords and user-characteristic data from a data collection of targeted users, and generates user-characteristic parameters of a target user based on the targeted keywords and user-characteristic data. After filtering out anomalies in the user-characteristic parameters, the procedure formulates the filtering scheme based on the filtered user-characteristic parameters, and established a correspondence relationship between the filtering scheme and the targeted keywords and user-characteristic data.
  • To generate user-characteristic parameters of the target user, the method identifies valid data in the set keywords and user-characteristic data, selects one or more sample users based on the valid data, and obtains the user-characteristic parameters of the target user based on the keywords and user-characteristic data of the one or more sample users.
  • The user-characteristic parameters of the target user may be obtained by generating useful variables such as aggregated variables, ratio variables and average variables. An aggregated variable of the target user is generated based on a total frequency according to the keywords and user-characteristic data of the one or more sample users. A ratio variable of the target user is generated based on a send/receive ratio according to the user-characteristic data of the one or more sample users. An average variable is generated based on an average frequency according to the keywords and user-characteristic data of the one or more sample users.
  • The filtering scheme is modeled to filter targeted users using a pre-stage (e.g., off-line) procedure. In one embodiment, this procedure sets keywords and user-characteristic data from a data collection of targeted users, and generates user-characteristic parameters of a target user based on the extracted keywords and user-characteristic data. The procedure then selects one or more rule generation parameters from the user-characteristic parameters, and generates multiple filtering schemes based on the rule generation parameter. The procedure selects the filtering scheme having the highest accuracy among the multiple filtering schemes to be the filtering scheme for the target user.
  • In an exemplary implementation, the method scores a user based on the selected filtering scheme. The information of the user is filtered according to the filtering scheme only if the user's exceeds a preset threshold.
  • The user-characteristic data may include user activity data, user information data and network user-characteristic data.
  • Another aspect of the disclosure is an apparatus for filtering user information. The apparatus has a configuration module, an acquisition module and a filter module. The configuration module is used for configuring a correspondence relationship between a targeted filtering scheme and targeted keywords and user-characteristic data from a data collection of targeted users. The acquisition module is used for acquiring keywords and user-characteristic data of a present user. The filtering module is used for selecting the targeted filtering scheme according to the correspondence relationship based on the acquired keywords and user-characteristic data of the present user, and for filtering information of the present user according to the targeted filtering scheme.
  • The configuration module may also be adapted to perform a pre-stage (e.g., off-line) modeling process to generate the targeted filtering scheme for the target user based on the filtered user-characteristic parameters.
  • The disclosed apparatus may employ a server computer to perform both the modeling and the filtering.
  • This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
  • DESCRIPTION OF DRAWINGS
  • The detailed description is described with reference to the accompanying figures. In the figures, the use of the same reference numbers in different figures indicates similar or identical items.
  • FIG. 1 shows a flow chart of an exemplary method for filtering user information in accordance with the present disclosure.
  • FIG. 2 shows a flow chart illustrating an exemplary process for configuring a correspondence relationship among keywords, user-characteristic data and the scheme for filtering target user information in accordance with the present disclosure.
  • FIG. 3 shows a flow chart of an exemplary method for filtering user information in accordance with the present disclosure.
  • FIG. 4 shows a structural diagram of an exemplary apparatus for filtering user information in accordance with the present disclosure.
  • DETAILED DESCRIPTION
  • The present disclosure provides a method and an apparatus for filtering user information. The method takes into account not only specific keywords in the user information, but also other user-related information. Using the filtering techniques disclosed herein, the method and apparatus allow determining the characteristics of a target user (e.g., a user who spreads information that is unwanted by others, or a user who illegally promulgates information) from multiple aspects of user activities, and process the target user accordingly to improve the recognition accuracy with respect to the target user, and to strengthen the information security.
  • FIG. 1 shows a flow chart of an exemplary method 100 for filtering user information in accordance with the present disclosure. In this description, the order in which a process is described is not intended to be construed as a limitation, and any the number of the described process blocks may be combined in any order to implement the method, or an alternate method. As shown in FIG. 1, the exemplary method 100 includes a procedure described as follows.
  • Block 101 configures a correspondence relationship between a filtering scheme and user keywords and user-characteristic data. This stage is the preparation stage for filtering user information. Various filtering schemes may be designed to target various types of targeted users. For example, one filtering scheme may be used to target users who circulate pornographic information, and another filtering scheme may be used to target users who send out spam information containing unsolicited advertisements. During the preparation stage represented by block 101, a correspondence relationship is built between filtering schemes and various types of user keywords and user-characteristic data, such that a suitable filtering scheme can be selected according to the keywords and user-characteristic data of an actual user. As will be described in further detail herein, the preparation stage is a pre-stage which can be performed separately from the actual filtering application. Such pre-stage may even be done off-line using a data collection of the information of targeted users.
  • In the present disclosure, a filtering scheme may be defined by various variables (e.g., a key word, a type of keywords, frequency, etc.) and their threshold values. Filtering schemes may differ from one another in various ways and different degrees. For example, filtering schemes targeting different types of users may be different from each other in the different variables which define the filtering schemes. In comparison, filtering schemes targeting the same type of users may have the same or similar variables but different threshold values provide different emphasis flavors (or different effectiveness) of filtration.
  • One exemplary filtering scheme performs filtering based on frequencies of one or more keywords. For example, if a keyword A appears N (N≧1) times, associated user information is then filtered. A correspondence relationship may be established between this exemplary filtering scheme and the keyword and its filtering requirement (i.e., frequency of the keyword).
  • Block 102 obtains keywords and user-characteristic data of a user. This is the beginning of the actual filtration stage. The system analyzes the information of the present user in order to decide whether the information of the user needs to be filtered, and if yes, which filtering scheme should be used.
  • Block 103 selects the filtering scheme using the correspondence relationship based on the keywords and user-characteristic data of the user. The correspondence relationship is configured at block 101. The system compares the keywords and user-characteristic data of the present user with the keywords and the user-characteristic data that corresponds to each available filtering scheme and decides which one is suitable for the present user. In general, the closer match indicates a more suitable filtering scheme.
  • The filtering scheme and its correspondence relationship with the keywords and user-characteristic data are configured using a pre-stage modeling process represented by block 101. Detail of this modeling process will be described further below.
  • An exemplary procedure for Block 101 includes a process illustrated in FIG. 2. FIG. 2 shows a flow chart illustrating an exemplary process 200 for configuring a correspondence relationship among keywords, user-characteristic data and the scheme for filtering target user information in accordance with the present disclosure.
  • Block 201 sets targeted keywords user-characteristic data from a data collection of targeted users. This may involve extracting and configuring target keywords and user-characteristic data from the data collection of targeted users. The user-characteristic data may include various types of data such as user activity data, user information data, and network user-characteristic data, as described as follows.
  • The user activity data includes one or more types of information including frequencies of characteristic word groups in the information sent by the user within a time frame, the number of times the user sends/receives information, and the amount of information the user sends/receives.
  • The user information data includes one or more types of information including initial login time of the user, user activities after login, and the number of contacts of the user.
  • The network user-characteristic data includes one or more types of information including the number of user IDs within the same IP, and the number of user IDs within the same machine ID.
  • Block 202 generates user-characteristic parameters of a target user based on the user-characteristic data. During the modeling stage, a target user does not have to be a specific user. Instead, a target user may be an abstract user representing a type of targeted users such as those who circulate pornographic information. An exemplary process for generating such user-characteristic parameters is described as follows.
  • Valid data in the user-characteristic data are first identified. Specifically, upon obtaining enough data, data cleaning is required to eliminate some fields or records. For example, certain data contents are configured to be essential while other data contents are set to be non-essential according to user requirements. The non-essential data contents can be then eliminated, leaving only data contents that are essential.
  • Based on the above-identified valid data, one or more sample users are selected from the available targeted users for modeling. The information records in the data selection are sampled to determine a target user which is used for building a rule model. The information records may include messages sent or promulgated by the targeted users.
  • User-characteristic parameters of the target user are obtained based on user-characteristic data of the sample users. The user-characteristic parameters are characteristic properties of the target user. The user-characteristic parameters are defined by straight variables (such as the frequency of the keyword) and derived variables. According to the modeling objective, derived variables are obtained using existing data in order to better understand activities of a user with a more comprehensive perspective. The derived variables are acquired based on performing combinational computation on multiple user-characteristic data.
  • Different kinds of derived variables may be designed and computed from the user-characteristic data. Several examples are described below.
  • (1) Compute the total frequency of the user-characteristic data, and generate an aggregated variable of the target user. An aggregated variable is a statistical result of all user-characteristic data.
  • (2) Compute a send/receive ratio of the information having the user-characteristic data, and create a ratio variable of the target user. The ratio variable embodies a proportional relationship of different statuses of the user-characteristic data of the target user.
  • (3) Compute an average frequency of the user-characteristic data, and create an average variable of the target user. The Average variable embodies an average frequency of the user-characteristic data of the target user within a unit time.
  • In the above, the computation may use the user-characteristic data of the selected sample users only.
  • Block 203 filters anomalies (e.g., anomalous values) in the user-characteristic parameters.
  • An exemplary filtering process used here searches for variables that need to be eliminated and missing values that need to be replaced are searched for. Specifically, the filtering process replaces any missing value in the user-characteristic parameters by a replacement value. To do this, a replacement rule for data's missing value is set. For example, a rule may be set to replace all missing values by zeros. The filtering process also replaces an irregular value (a value that does not satisfy a format rule) by a regular value. For example, the filtering process may replace traditional characters by corresponding simplified characters, uppercases by lowercases, and SBC cases by DBC cases, etc., in all text information.
  • Block 204 generates a filtering scheme for the target user based on the filtered user-characteristic parameters.
  • Filtering schemes are generated using a modeling process described herein. Upon obtaining data that satisfy the requirements in the foregoing blocks, the process enters into a modeling stage to generate candidate filtering schemes and select the best performing candidate filtering schemes as targeted filtering schemes. Modeling includes tasks such as selecting a suitable algorithm, selecting suitable parameters, formulating a model verification scheme, formulating a data sampling plan, and configuring model parameters.
  • An exemplary modeling process selects one or more user-characteristic parameters from the filtered user-characteristic data to be rule generation parameters, and generates multiple filtering schemes by filtering scheme adjustment based on adjusting the rule generation parameters. The process than selects a filtering scheme that has demonstrated the highest accuracy in the testing from multiple filtering schemes to be the filtering scheme of the target user.
  • Modeling and data preparation may be performed interactively. An initial result of modeling may produce new requirements for data preparation, while a result of data preparation may directly affect model construction.
  • Pattern rules of the target user are formed using the above process. Furthermore, in a practical application, the system may score a user based on the filtering scheme of the target user. If the score of the user exceeds a preset threshold, information of the user is filtered to monitor and ensure the network security.
  • The filtering scheme may be applied in information filtering tasks in various interactive networked processes implemented for information communication. Examples of such processes include email, forums, and instant messaging. The scope of the present disclosure covers such applications using the filtering scheme in these processes.
  • Exemplary implementations of the present disclosure are described in further detail using accompanying figures and exemplary embodiments.
  • To illustrate the method for filtering user information, FIG. 3 shows an example process 300 of filtering user information of users who circulate pornographic information. The system analyzes chat messages of these targeted users, and discovers implicit models for the information sent by these targeted users who circulate pornographic information. The models are obtained through data mining modeling. A filtering scheme for a target user who circulates pornographic information is generated using a generating scheme. During the modeling process which generates filtering schemes, the target user is an abstract user representing a certain type of targeted users. The filtering scheme thus generated is then used to monitor the information of this type of users.
  • The process 300 of FIG. 3 is further described in detail as follows.
  • Block 301 sets the user-characteristic data of targeted users through analyzing the targeted information. The target information may be a data collection of the targeted users (e.g., messages sent by users who are known to be circulating pornographic information). The user-characteristic data is set by extracting and configuring the targeted information. The user-characteristic data may include user activity data, user information data, and network user-characteristic data. Specific scopes and results of the data setting are described as follows.
  • 1. Setting of the user activity data, including:
  • (1) Frequencies of keywords such as catalog, movie, channel, video, animation, cartoon, picture, show, watch, download, online, pornographic, erotic, sexual, passion, adult, ethics, porn star, classics, R-rating, “A” film, uncoded, clear, AV, etc.
  • (2) The number of times the user sends information, and the total number of bytes of the sent information.
  • (3) The number of times the user receives information, and the total number of bytes of the received information;
  • (4) The number of times the user sends information to strangers.
  • 2. Setting of the scopes for user information data, including:
  • (1) The time at which the user first logins.
  • (2) The activities of the user.
  • (3) The number of buddies of the user.
  • 3. Setting of the scopes for network user-characteristic data, including:
  • (1) The number of users on the same IP.
  • (2) The number of users on the same MAC address.
  • Upon completing the settings, the user-characteristic parameters of the target user (i.e., a representative user who sends pornographic information in the illustrated example) are generated based on the user-characteristic data that have just been set. The user-characteristic parameters would reflect whether a user is sending pornographic information. The target user required in modeling is found through analysis and screening. An exemplary process is described as follows.
  • Block 302 identifies the valid data in the user-characteristic data, and eliminates the invalid variables and observations.
  • For example, data regarding the number of buddies a user has added and the number of times a user has sent information to strangers are unavailable under existing technologies. Therefore, options related to these contents are removed from the setting result of the user-characteristic data.
  • Block 303 selects one or more sample users, and determines a target object (a target user).
  • This process sets a user who sends pornographic information as a model object, samples and extracts the communicated information records of this type of users to determine a target user for building the model. Examples of the communicated information records include chat records, message records, and email records.
  • Block 304 computes derived variables.
  • Based on the model object, this process computes the derived variables using the above data in order to understand the customer activities of the target user more comprehensively. In the present exemplary embodiment, three primary types of derived variables are used in modeling: aggregated variable, ratio variable, and average variable. These derived variables are described further below.
  • 1. Aggregated Variables
  • One example of an aggregated variable is the number of targeted keyword types that appear. For example, if information contains keywords “AV”, “pornstar”, and “R rating”, the number of targeted keyword types that appeared is three, and therefore the value of the corresponding aggregated variable data is three.
  • Another example of an aggregated variable is the number of keyword appearances measured by keyword groups. For example, keywords such as “watch”, “download”, and “online” are categorized into the same homogeneous group, and the total number of times these keywords appear is computed for the same group.
  • 2. Ratio Variables
  • One example of a ratio variable is the send/receive ratio, such as a ratio between the number of times information is sent and the number of times information is received, and a ratio between the number of bytes of information sent and the number of bytes of information received. This type of variables may be indicative of a certain type of users who send out large quantities of information but seldom receive information from others.
  • 3. Average Variables
  • One example of an average variable is the average frequency of a keyword type's appearance. This can be measured by the frequency of a keyword divided by the total frequency of all keywords.
  • Block 305 filters user-characteristic parameters.
  • A variable which has a missing value is replaced according to a replacement rule for missing values of data. For example, a missing value is replaced by zero.
  • For text information, an exemplary data cleaning may be converting all text information from traditional Chinese characters to simplified Chinese character, uppercases to lowercases, and SBC cases to DBC cases.
  • Block 306 generates a filtering scheme of the target user based on the filtered user-characteristic parameters.
  • After the user-characteristic parameters have been prepared, the process enters into the modeling stage. Modeling includes such tasks as selecting a suitable algorithm, selecting suitable parameters, formulating a model verification scheme, formulating a data sampling plan, and configuring model parameters.
  • Modeling and data preparation are interactive processes. An initial result of modeling may produce new requirements for data preparation, while a result of data preparation may directly affect model construction.
  • Because user-characteristic parameters and modeling algorithms can be varied, multiple rule models may be computed. In order to select the most accurate model among the computed results to be a final filtering scheme of the target user, filtering tests for the models may be performed on model data or training data which contains information of targeted users whose behavior is known. TABLE 1 shows an example of a model prediction result.
  • TABLE 1
    Statistics of model testing result
    Predict: False Predict: True
    Actual: False 896 87
    Actual: True 173 423
  • Based on data in TABLE 1, the accuracy of associated model is computed as follows:

  • (Predicted True and Actual True+Predicted False and Actual False)/The total number of samples=(423+896)/(896+423+87+173)=83.5%
  • Based on the above computed result, determination is made whether the model satisfies an accuracy requirement. From all models that have satisfied the accuracy requirement, one or more models having the highest accuracy are selected to be the filtering scheme(s) for the target user. That is, the selected the filtering scheme(s) is to be used for filtering users who promulgates pornographic information.
  • The disclosed method achieves time-division monitoring and collecting of the records of the communicated information. The data mining and modeling may be a pre-stage preparation carried out at the backend. This may even be done off-line. In application, the system uses the targeted filtering scheme generated by the modeling to give each user a score. If a user's score exceeds a preset threshold, the system concludes that the user is promulgating pornographic information, and therefore applies the filtering scheme to filter the information of the user. The system may apply other relevant controlling measures to penalize the user accordingly. For instance, the system may enlist the user into a monitoring system. A supervising network security officer may then determine, from a business point of view, whether the user who has been enlisted into the monitoring system has met the penalty requirements, and penalize the user if the user needs the penalty requirements.
  • FIG. 4 shows a structural diagram of an exemplary apparatus 400 for filtering user information in accordance with the present disclosure. Information filtering apparatus 400 includes a configuration module 410, acquisition module 420 and a future module 430.
  • The configuration module 410 is used for configuring a correspondence relationship between a filtering scheme and the keywords and user-characteristic data of targeted users. The acquisition module 420 is used for acquiring keywords and user-characteristic data of a present user. The filtering module 430 is used for finding and selecting the targeted filtering scheme according to the correspondence relationship using the keywords and user-characteristic data of the present user, and for further filtering information of the target user according to the selected filtering scheme.
  • In the configuration module 410, a parameter generation sub-module 412 is used for setting the keywords and user-characteristic data of information of targeted users, and generating user-characteristic parameters of a representative target user based on the set keywords and user-characteristic data. A filtering sub-module 414 is used for filtering out anomalies in the user-characteristic parameters that are generated by the parameter generation sub-module 412. A rule generation sub-module 416 is used for generating the filtering scheme for the target user based on the filtered user-characteristic parameters.
  • The parameter generation sub-module 412 may use various units (not shown) to perform various functions. For example, a recognition unit (not shown) may be used for recognizing valid data in the user-characteristic data; a selection unit may be used for selecting one or more sample users based on the valid data recognized by the recognition unit; and a computation unit is used for computing the user-characteristic parameters of the target user based on user-characteristic data of the one or more sample users selected by the selection unit.
  • The filtering sub-module 414 may use various units (not shown) to perform various functions. For example, a first filtering unit is used for replacing a missing value in the user-characteristic parameters by a replacement value; and a second filtering unit is used for replacing an irregular value in the user-characteristic parameters by a regular value.
  • The rule generation sub-module 416 may use various units (not shown) to perform its own related functions. For example, a parameter selection unit is used for selecting the one or more user-characteristic parameters obtained to be rule generation parameter(s); a rule computing unit is used for generating multiple filtering schemes based on the rule generation parameters selected by the parameter selection unit; and a rule selection unit is used for selecting a filtering scheme having the highest accuracy from the filtering schemes to be the filtering scheme of the target user.
  • In the filtering module 430, a searching sub-module 434 is used for finding the targeted filtering scheme according to the correspondence relationship using the keywords and user-characteristic data of a present user. A filtering sub-module 436 is used for filtering the target user information based on the selected filtering scheme. A determination sub-module 432 is used for storing a present user based on the targeted filtering scheme of the target user, and triggering the filtering sub-module 436 if the score of the present user exceeds a preset threshold.
  • In the presence disclosure, a “module” or a “unit” in general refers to a functionality designed to perform a particular task or function. A module or unit can be a device which is a tool or machine designed to perform a particular task or function. A module or unit can be a piece of hardware, software, a plan or scheme, or a combination thereof, for effectuating a purpose associated with the particular task or function. The functions of multiple modules or units may be achieved using a single device.
  • In particular, configuration module 410, acquisition module 420 and filtering module 430 may either be implemented in an integrated machine (a server or use of a system) or each in a separate machine (a computer or a server). In one embodiment, configuration module 410 is implemented in a first machine, and acquisition module 420 and filtering module 430 are implemented in a second machine which is connected to (but not necessarily integrated with) the first machine.
  • The exemplary embodiments disclosed herein may have the following advantages. Because the method establishes filtering rules of a target user to determine user activities based on specific keywords and other user related information, characteristics of user activities can be determined from multiple aspects, and processed accordingly to improve recognition accuracy with respect to a certain type of targeted users and improve the efficiency of information security.
  • From the exemplary embodiments described above, it is appreciated that the disclosed method and system may be implemented using software and universal hardware platform, or can be implemented using hardware only. However, in many instances, implementation using a combination of software and hardware is preferred. Based on this understanding, the technical schemes of the present disclosure, or portions contributing to existing technologies, may be implemented in the form of software products which are stored in a storage media. The software includes instructions for a computing device (e.g., a cell phone, a personal computer, a server or a networked device) to execute the method described in the exemplary embodiments of the current disclosure.
  • It is appreciated that the potential benefits and advantages discussed herein are not to be construed as a limitation or restriction to the scope of the appended claims.
  • Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as exemplary forms of implementing the claims.

Claims (17)

1. A method for filtering user information, the method comprising:
obtaining keywords and user-characteristic data of a user;
selecting a filtering scheme according to a correspondence relationship between the filtering scheme and the keywords and user-characteristic data, the filtering scheme being modeled to filter targeted users whose information has similar keywords and user-characteristic data; and
filtering the information of the user according to the selected filtering scheme.
2. The method as recited in claim 1, wherein the correspondence relationship between the filtering scheme and the keywords and user-characteristic data is configured using a procedure comprising:
setting targeted keywords and user-characteristic data from a data collection of targeted users;
generating user-characteristic parameters of a target user based on the targeted keywords and user-characteristic data;
filtering out anomalies in the user-characteristic parameters;
formulating the filtering scheme based on the filtered user-characteristic parameters; and
establishing a correspondence relationship between the filtering scheme and the targeted keywords and user-characteristic data.
3. The method as recited in claim 2, wherein generating user-characteristic parameters of the target user comprises:
identifying valid data in the extracted keywords and user-characteristic data;
selecting one or more sample users based on the valid data; and
obtaining the user-characteristic parameters of the target user based on the keywords and user-characteristic data of the one or more sample users.
4. The method as recited in claim 3, wherein obtaining the user-characteristic parameters of the target user comprises:
generating an aggregated variable of the target user based on a total frequency according to the keywords and user-characteristic data of the one or more sample users;
generating a ratio variable of the target user based on a send/receive ratio according to the user-characteristic data of the one or more sample users; and
generating an average variable based on an average frequency according to the keywords and user-characteristic data of the one or more sample users.
5. The method as recited in claim 2, wherein filtering out anomalies in the user-characteristic parameters comprises:
replacing a missing value in the user-characteristic parameters of the target user by a replacement value; and
replacing an irregular data in the user-characteristic parameters of the target user by a regular value.
6. The method as recited in claim 1, wherein the filtering scheme is modeled to filter targeted users using a procedure comprising:
setting keywords and user-characteristic data from a data collection of targeted users;
generating user-characteristic parameters of a target user based on the extracted keywords and user-characteristic data;
selecting one or more rule generation parameters from the user-characteristic parameters;
generating multiple filtering schemes based on the rule generation parameter; and
selecting the filtering scheme having the highest accuracy among the multiple filtering schemes to be the filtering scheme for the target user.
7. The method as recited in claim 1, further comprising:
scoring the user based on the selected filtering scheme, wherein the information of the user is filtered according to the filtering scheme only if the user's exceeds a preset threshold.
8. The method as recited in claim 1, wherein the user-characteristic data comprises user activity data, user information data and network user-characteristic data.
9. The method as recited in claim 8, wherein
the user activity data includes one or more of the following types of information: frequencies of characteristic word groups in information sent by the user within a time frame, the number of times the user sends/receives information, and the amount of information the user sends/receives;
the user information data includes one or more of the following types of information: initial login time of the user, user activities after login, and the number of contacts of the user; and
the network user-characteristic data includes one or more of the following types of information: the number of users within a same IP, and the number of users within a same machine ID.
10. An apparatus for filtering user information, the apparatus comprising:
a configuration module used for configuring a correspondence relationship between a targeted filtering scheme and targeted keywords and user-characteristic data from a data collection of targeted users;
an acquisition module used for acquiring keywords and user-characteristic data of a present user; and
a filtering module used for selecting the targeted filtering scheme according to the correspondence relationship based on the acquired keywords and user-characteristic data of the present user, and for filtering information of the present user according to the targeted filtering scheme.
11. The apparatus as recited in claim 10, wherein the configuration module comprises:
a parameter generation sub-module used for setting up the targeted keywords and user-characteristic data from the data collection of targeted users, and for generating user-characteristic parameters of a target user based on the targeted keywords and user-characteristic data;
a filtering sub-module used for filtering out anomalies in the user-characteristic parameters generated by the parameter generation sub-module; and
a rule generation sub-module used for generating the targeted filtering scheme for the target user based on the filtered user-characteristic parameters.
12. The apparatus as recited in claim 11, wherein the parameter generation sub-module comprises:
a recognition unit used for recognizing valid data in the targeted user-characteristic data;
a selection unit used for selecting one or more sample users based on the valid data; and
a computation unit used for computing the user-characteristic parameters of the target user based on the user-characteristic data of the one or more sample users selected by the selection unit.
13. The apparatus as recited in claim 11, wherein the filtering sub-module comprises:
a first filtering unit used for replacing a missing value in the user-characteristic parameters by a replacement value; and
a second filtering unit used for replacing an irregular value in the user-characteristic parameters by a regular value.
14. The apparatus as recited in claim 11, wherein the rule generation sub-module comprises:
a parameter selection unit used for selecting one or more rule generation parameters;
a rule computing unit used for generating multiple candidate filtering schemes based on the rule generation parameters selected by the parameter selection unit; and
a rule selection unit used for selecting a filtering scheme having the highest accuracy among the candidate filtering schemes to be the targeted filtering scheme of the target user.
15. The apparatus as recited in claim 11, wherein the filtering module comprises:
a searching sub-module used for finding the targeted filtering scheme according to the correspondence relationship based on the keywords and user-characteristic data of the present user; and
a filtering sub-module used for filtering the information of the present user according to the targeted filtering scheme.
16. The apparatus as recited in claim 15, wherein the filtering module further comprises:
a determination sub-module used for determining a score for the present user according to the target filtering scheme, and for triggering the filtering sub-module if the score of the present user exceeds a preset threshold.
17. The apparatus as recited in claim 10, and wherein the apparatus comprises a server computer.
US12/667,145 2008-06-26 2009-06-26 Filtering information using targeted filtering schemes Active 2031-05-03 US8725746B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/197,118 US9201953B2 (en) 2008-06-26 2014-03-04 Filtering information using targeted filtering schemes

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN200810126362.X 2008-06-26
CN200810126362 2008-06-26
CN200810126362XA CN101616101B (en) 2008-06-26 2008-06-26 Method and device for filtering user information
PCT/US2009/048817 WO2009158593A1 (en) 2008-06-26 2009-06-26 Filtering information using targeted filtering schemes

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2009/048817 A-371-Of-International WO2009158593A1 (en) 2008-06-26 2009-06-26 Filtering information using targeted filtering schemes

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US14/197,118 Continuation US9201953B2 (en) 2008-06-26 2014-03-04 Filtering information using targeted filtering schemes

Publications (2)

Publication Number Publication Date
US20110010374A1 true US20110010374A1 (en) 2011-01-13
US8725746B2 US8725746B2 (en) 2014-05-13

Family

ID=41444970

Family Applications (2)

Application Number Title Priority Date Filing Date
US12/667,145 Active 2031-05-03 US8725746B2 (en) 2008-06-26 2009-06-26 Filtering information using targeted filtering schemes
US14/197,118 Active US9201953B2 (en) 2008-06-26 2014-03-04 Filtering information using targeted filtering schemes

Family Applications After (1)

Application Number Title Priority Date Filing Date
US14/197,118 Active US9201953B2 (en) 2008-06-26 2014-03-04 Filtering information using targeted filtering schemes

Country Status (6)

Country Link
US (2) US8725746B2 (en)
EP (1) EP2291734A4 (en)
JP (1) JP5453410B2 (en)
CN (1) CN101616101B (en)
HK (1) HK1138957A1 (en)
WO (1) WO2009158593A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198161A (en) * 2013-04-28 2013-07-10 中国科学院计算技术研究所 Microblog ghostwriter identifying method and device
CN103678700A (en) * 2013-12-27 2014-03-26 纳容众慧(北京)科技有限公司 Web page data processing method and device
US20150339314A1 (en) * 2014-05-25 2015-11-26 Brian James Collins Compaction mechanism for file system
US9201953B2 (en) 2008-06-26 2015-12-01 Alibaba Group Holding Limited Filtering information using targeted filtering schemes
CN105608352A (en) * 2015-12-31 2016-05-25 联想(北京)有限公司 Information processing method and server
CN109840274A (en) * 2018-12-28 2019-06-04 北京百度网讯科技有限公司 Data processing method and device, storage medium
CN110457918A (en) * 2019-01-09 2019-11-15 腾讯科技(深圳)有限公司 Filter out method, apparatus, node and the medium of illegal contents in block chain data
US11283743B1 (en) * 2017-07-06 2022-03-22 Meta Platforms, Inc. Techniques for scam detection and prevention

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102202037A (en) * 2010-03-24 2011-09-28 北京创世网赢高科技有限公司 Information publishing system
CN102202036A (en) * 2010-03-24 2011-09-28 北京创世网赢高科技有限公司 Method for issuing information
CN102279893B (en) * 2011-09-19 2015-07-22 索意互动(北京)信息技术有限公司 Many-to-many automatic analysis method of document group
CN102571484B (en) * 2011-12-14 2014-08-27 上海交通大学 Method for detecting and finding online water army
US9100366B2 (en) 2012-09-13 2015-08-04 Cisco Technology, Inc. Early policy evaluation of multiphase attributes in high-performance firewalls
CN103136448B (en) * 2013-02-02 2015-12-02 深圳先进技术研究院 Measurement and Data Processing exact method and system and data processing method and system
CN104143148A (en) * 2013-05-07 2014-11-12 苏州精易会信息技术有限公司 Advertisement setting method applied to management software system
CN103581186B (en) * 2013-11-05 2016-09-07 中国科学院计算技术研究所 A kind of network security situational awareness method and system
CN104184653B (en) * 2014-07-28 2018-03-23 小米科技有限责任公司 A kind of method and apparatus of message screening
CN104539514B (en) * 2014-12-17 2018-07-17 广州酷狗计算机科技有限公司 Information filtering method and device
CN104915423B (en) * 2015-06-10 2018-06-26 深圳市腾讯计算机系统有限公司 The method and apparatus for obtaining target user
CN105930258B (en) * 2015-11-13 2019-04-26 中国银联股份有限公司 A kind of method and device of parameter filtering
CN106856598B (en) * 2015-12-08 2020-04-14 中国移动通信集团公司 Method and system for optimizing spam strategy
CN107809368B (en) * 2016-09-09 2019-01-29 腾讯科技(深圳)有限公司 Information filtering method and device
CN106789572B (en) * 2016-12-19 2019-09-24 重庆博琨瀚威科技有限公司 A kind of instant communicating system and instant communication method for realizing adaptive message screening
CN108322317B (en) * 2017-01-16 2022-07-29 腾讯科技(深圳)有限公司 Account identification association method and server
US10681024B2 (en) * 2017-05-31 2020-06-09 Konica Minolta Laboratory U.S.A., Inc. Self-adaptive secure authentication system
CN111078520B (en) * 2019-12-17 2023-04-11 四川新网银行股份有限公司 Method for judging panic and busy degree of bank user interface operation
CN112311933B (en) * 2020-10-27 2021-10-15 杭州天宽科技有限公司 Sensitive information shielding method and system
CN112257048B (en) * 2020-12-21 2021-10-08 南京韦科韬信息技术有限公司 Information security protection method and device

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050171955A1 (en) * 2004-01-29 2005-08-04 Yahoo! Inc. System and method of information filtering using measures of affinity of a relationship
US20050198159A1 (en) * 2004-03-08 2005-09-08 Kirsch Steven T. Method and system for categorizing and processing e-mails based upon information in the message header and SMTP session
US7113977B1 (en) * 2002-06-26 2006-09-26 Bellsouth Intellectual Property Corporation Blocking electronic mail content
US20070156886A1 (en) * 2005-12-29 2007-07-05 Microsoft Corporation Message Organization and Spam Filtering Based on User Interaction
US20070156677A1 (en) * 1999-07-21 2007-07-05 Alberti Anemometer Llc Database access system
US7260837B2 (en) * 2000-03-22 2007-08-21 Comscore Networks, Inc. Systems and methods for user identification, user demographic reporting and collecting usage data usage biometrics
US20070282770A1 (en) * 2006-05-15 2007-12-06 Nortel Networks Limited System and methods for filtering electronic communications
US7320020B2 (en) * 2003-04-17 2008-01-15 The Go Daddy Group, Inc. Mail server probability spam filter
US20080140781A1 (en) * 2006-12-06 2008-06-12 Microsoft Corporation Spam filtration utilizing sender activity data
US20090089279A1 (en) * 2007-09-27 2009-04-02 Yahoo! Inc., A Delaware Corporation Method and Apparatus for Detecting Spam User Created Content
US20090144276A1 (en) * 2003-11-24 2009-06-04 Feng-Wei Chen Russell Computerized data mining system and program product
US20090150365A1 (en) * 2007-12-05 2009-06-11 Palo Alto Research Center Incorporated Inbound content filtering via automated inference detection
US20090149203A1 (en) * 2007-12-10 2009-06-11 Ari Backholm Electronic-mail filtering for mobile devices
US7558832B2 (en) * 2003-03-03 2009-07-07 Microsoft Corporation Feedback loop for spam prevention
US20090187988A1 (en) * 2008-01-18 2009-07-23 Microsoft Corporation Cross-network reputation for online services
US20090222557A1 (en) * 2008-02-29 2009-09-03 Raymond Harry Putra Rudy Analysis system, information processing apparatus, activity analysis method and program product
US7831464B1 (en) * 2006-04-06 2010-11-09 ClearPoint Metrics, Inc. Method and system for dynamically representing distributed information

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002024274A (en) * 2000-07-06 2002-01-25 Oki Electric Ind Co Ltd Device and method for information filtering
JP2003067304A (en) * 2001-08-27 2003-03-07 Kddi Corp Electronic mail filtering system, electronic mail filtering method, electronic mail filtering program and recording medium recording it
CN1270258C (en) * 2002-12-20 2006-08-16 中国科学院计算技术研究所 Multi keyword matching method for rapid content analysis
JP2005332048A (en) * 2004-05-18 2005-12-02 Nippon Telegr & Teleph Corp <Ntt> Method for distributing content information, content distribution server, program for distributing content information, and recording medium with program recorded thereon
US8131747B2 (en) * 2006-03-15 2012-03-06 The Invention Science Fund I, Llc Live search with use restriction
CN101616101B (en) 2008-06-26 2012-01-18 阿里巴巴集团控股有限公司 Method and device for filtering user information

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070156677A1 (en) * 1999-07-21 2007-07-05 Alberti Anemometer Llc Database access system
US7260837B2 (en) * 2000-03-22 2007-08-21 Comscore Networks, Inc. Systems and methods for user identification, user demographic reporting and collecting usage data usage biometrics
US7113977B1 (en) * 2002-06-26 2006-09-26 Bellsouth Intellectual Property Corporation Blocking electronic mail content
US7558832B2 (en) * 2003-03-03 2009-07-07 Microsoft Corporation Feedback loop for spam prevention
US7320020B2 (en) * 2003-04-17 2008-01-15 The Go Daddy Group, Inc. Mail server probability spam filter
US20090144276A1 (en) * 2003-11-24 2009-06-04 Feng-Wei Chen Russell Computerized data mining system and program product
US20050171955A1 (en) * 2004-01-29 2005-08-04 Yahoo! Inc. System and method of information filtering using measures of affinity of a relationship
US20050198159A1 (en) * 2004-03-08 2005-09-08 Kirsch Steven T. Method and system for categorizing and processing e-mails based upon information in the message header and SMTP session
US20070156886A1 (en) * 2005-12-29 2007-07-05 Microsoft Corporation Message Organization and Spam Filtering Based on User Interaction
US7831464B1 (en) * 2006-04-06 2010-11-09 ClearPoint Metrics, Inc. Method and system for dynamically representing distributed information
US20070282770A1 (en) * 2006-05-15 2007-12-06 Nortel Networks Limited System and methods for filtering electronic communications
US20080140781A1 (en) * 2006-12-06 2008-06-12 Microsoft Corporation Spam filtration utilizing sender activity data
US20090089279A1 (en) * 2007-09-27 2009-04-02 Yahoo! Inc., A Delaware Corporation Method and Apparatus for Detecting Spam User Created Content
US20090150365A1 (en) * 2007-12-05 2009-06-11 Palo Alto Research Center Incorporated Inbound content filtering via automated inference detection
US20090149203A1 (en) * 2007-12-10 2009-06-11 Ari Backholm Electronic-mail filtering for mobile devices
US20090187988A1 (en) * 2008-01-18 2009-07-23 Microsoft Corporation Cross-network reputation for online services
US20090222557A1 (en) * 2008-02-29 2009-09-03 Raymond Harry Putra Rudy Analysis system, information processing apparatus, activity analysis method and program product

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Anirudh Ramachandran, Nick Feamster, and Santosh Vempala. Filtering spam with behavioral blacklisting. 2007. In Proceedings of the 14th ACM conference on Computer and communications security (CCS '07). ACM. 342-351. *
Shlomo Hershkop and Salvatore J. Stolfo. Combining email models for false positive reduction. 2005. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (KDD '05). ACM, 98-107. *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9201953B2 (en) 2008-06-26 2015-12-01 Alibaba Group Holding Limited Filtering information using targeted filtering schemes
CN103198161A (en) * 2013-04-28 2013-07-10 中国科学院计算技术研究所 Microblog ghostwriter identifying method and device
CN103678700A (en) * 2013-12-27 2014-03-26 纳容众慧(北京)科技有限公司 Web page data processing method and device
US20150339314A1 (en) * 2014-05-25 2015-11-26 Brian James Collins Compaction mechanism for file system
CN105608352A (en) * 2015-12-31 2016-05-25 联想(北京)有限公司 Information processing method and server
US11283743B1 (en) * 2017-07-06 2022-03-22 Meta Platforms, Inc. Techniques for scam detection and prevention
US11677704B1 (en) 2017-07-06 2023-06-13 Meta Platforms, Inc. Techniques for scam detection and prevention
CN109840274A (en) * 2018-12-28 2019-06-04 北京百度网讯科技有限公司 Data processing method and device, storage medium
CN110457918A (en) * 2019-01-09 2019-11-15 腾讯科技(深圳)有限公司 Filter out method, apparatus, node and the medium of illegal contents in block chain data

Also Published As

Publication number Publication date
CN101616101B (en) 2012-01-18
WO2009158593A1 (en) 2009-12-30
EP2291734A1 (en) 2011-03-09
JP2011526393A (en) 2011-10-06
US20140188913A1 (en) 2014-07-03
HK1138957A1 (en) 2010-09-03
CN101616101A (en) 2009-12-30
US9201953B2 (en) 2015-12-01
JP5453410B2 (en) 2014-03-26
US8725746B2 (en) 2014-05-13
EP2291734A4 (en) 2013-09-25

Similar Documents

Publication Publication Date Title
US9201953B2 (en) Filtering information using targeted filtering schemes
Shi et al. Detecting malicious social bots based on clickstream sequences
Sun et al. Security of online reputation systems: The evolution of attacks and defenses
CN107733854B (en) Management method of network virtual account
US20200322368A1 (en) Method and system for clustering darknet traffic streams with word embeddings
CN107483488A (en) A kind of malice Http detection methods and system
CN110956210B (en) Semi-supervised network water force identification method and system based on AP clustering
CN103457909A (en) Botnet detection method and device
CN107230090B (en) Method and device for classifying net recommendation value NPS
Lin et al. Machine learning with variational autoencoder for imbalanced datasets in intrusion detection
CN102902674B (en) Bundle of services component class method and system
Khalil et al. Feature selection for unsupervised bot detection
Özseyhan et al. An association rule-based recommendation engine for an online dating site
CN108650145A (en) Phone number characteristic automatic extraction method under a kind of home broadband WiFi
Wang et al. Profiling the followers of the most influential and verified users on Sina Weibo
CN112528325B (en) Data information security processing method and system
CN111885011A (en) Method and system for analyzing and mining safety of service data network
Lin et al. Finding the key users in Facebook fan pages via a clustering approach
CN114679600A (en) Data processing method and device
CN112465544A (en) User loss early warning method and device
CN114422168A (en) Malicious machine traffic identification method and system
CN102202036A (en) Method for issuing information
CN102202037A (en) Information publishing system
Weber et al. A general method to find highly coordinating communities in social media through inferred interaction links
Kiforchuk Frequency Analysis of Russian Propaganda Telegram Channels

Legal Events

Date Code Title Description
AS Assignment

Owner name: ALIBABA GROUP HOLDING LIMITED, CAYMAN ISLANDS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:YANG, JUNJIE;NI, LIANG;ZHANG, ZHENGHUA;AND OTHERS;REEL/FRAME:023716/0358

Effective date: 20091204

STCF Information on status: patent grant

Free format text: PATENTED CASE

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551)

Year of fee payment: 4

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8