WO2008066341A1 - Method and apparatus for preventing from abusing search logs - Google Patents

Method and apparatus for preventing from abusing search logs Download PDF

Info

Publication number
WO2008066341A1
WO2008066341A1 PCT/KR2007/006104 KR2007006104W WO2008066341A1 WO 2008066341 A1 WO2008066341 A1 WO 2008066341A1 KR 2007006104 W KR2007006104 W KR 2007006104W WO 2008066341 A1 WO2008066341 A1 WO 2008066341A1
Authority
WO
WIPO (PCT)
Prior art keywords
summary information
address
search
search word
abnormal action
Prior art date
Application number
PCT/KR2007/006104
Other languages
French (fr)
Inventor
Yong-Dai Kim
Jang Min O
Jae Geol Choi
Dong Wook Kim
Youn Sik Lee
Original Assignee
Nhn Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nhn Corporation filed Critical Nhn Corporation
Priority to JP2009539187A priority Critical patent/JP5118707B2/en
Publication of WO2008066341A1 publication Critical patent/WO2008066341A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/552Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Definitions

  • the present invention relates to Internet search, and more particularly to a method and apparatus for efficiently preventing abuse of search logs.
  • the search service is a service through which a search service provider provides, when a user inputs a search word into a search box of a search site provided by the search service provider, information corresponding to the input search word as a search result.
  • Search words which users input to use the search service, and information of search actions of users are stored in the form of search logs.
  • search service providers can provide various search services to users.
  • the amount to be charged is determined based on the popularities of keywords.
  • the popularities of search words are determined based on patterns of the search words obtained through analysis of search logs. Through these popularities, search service providers can present an unbiased and reasonable charging basis to each advertisement requester.
  • Search service providers provide a variety of primary and secondary services using search logs.
  • popular search word services and associated search word services provide, using search logs, search words currently appealing to users and their associated search words. These services have been successful since the assumption that the great amount of search logs is created by pure intention of Internet users was satisfied.
  • the present invention has been made in view of the above problems, and it is an object of the present invention to provide a method and apparatus for preventing abuse of search logs, wherein search logs are tracked and analyzed to detect abnormal actions and to remove contaminated parts from the search logs.
  • the above and other objects can be accomplished by the provision of a method for preventing abuse of a search log, the method including selecting a subject of abnormal action inspection from the search log, and detecting an abnormal action by scoring an extent of deviation of the selected subject from a normal.
  • the method may further include correcting the search log by removing the detected abnormal action from the search log using a predetermined point-deduction logic.
  • the step of selecting the subject of abnormal action inspection includes generating, from the search log, at least one of search word summary information by statistically analyzing an input count of a specific search word at each IP address within a predetermined time window and IP address summary by statistically analyzing an input count of each search word at a specific IP address within the predetermined time window, wherein the step of detecting the abnormal action includes detecting the abnormal action from the at least one of the search word summary information and the IP address summary information.
  • the step of generating the summary information includes generating, from the search log, at least one of an input count vector of a specific search word at each IP address within a predetermined time window and an input count vector of each search word at a specific IP address within the predetermined time window, and reducing a dimension of the input count vector of the specific search word at each IP address to generate the search word summary information or reducing a dimension of the input count vector of each search word at the specific IP address to generate the IP address summary information.
  • the step of reducing the dimension of the input count vector includes converting the input count vector of a specific search word at each IP addresses and the input count vector of each search words at a specific IP address into count vector of a limited number of hash buckets.
  • the search word summary information and the IP address summary information is modeled as a multidimensional distribution using a statistical method.
  • the step of detecting the abnormal action includes calculating a score corresponding to an extent of abnormality of at least one of the search word summary information and the IP address summary information modeled as the multidimensional distribution according to an extent of deviation from a center, and determining that an abnormal action is included in the at least one of the search word summary information and the IP address summary information whose calculated score is equal to or higher than a reference value.
  • the step of detecting the abnormal action further includes compressing data by reducing a dimension of the at least one of the modeled search word summary information and IP address summary information before calculating the score.
  • the step of calculating the score includes calculating the score corresponding to the extent of abnormality as a ratio to a reference value using a statistic modeled through a sum of samples of independent standard normal distributions of the reduced dimension.
  • the step of correcting the search log includes removing the abnormal action from at least one of the search word summary information and the IP address summary information in which the abnormal action has been detected using a point-deduction logic that is based on an information theory for measuring a difference of distributions.
  • an apparatus for preventing abuse of a search log including a preprocessor for selecting a subject of abnormal action inspection from a search log, an abnormal action detector for detecting an abnormal action by scoring an extent of deviation of the selected subject from a normal, and an abnormal action corrector for correcting the search log by removing the detected abnormal action from the search log using a predetermined point-deduction logic.
  • FIG. 1 is a schematic block diagram of an apparatus for preventing abuse of search logs according to one embodiment of the invention
  • FIG. 2 is a flow chart of a method for preventing abuse of search logs according to one embodiment of the invention
  • FIG. 3 is a detailed flow chart of a procedure for selecting a subject of abnormal action inspection according to one embodiment of the invention.
  • FIG. 4 is a detailed flow chart of a procedure for detecting abnormal actions according to one embodiment of the invention.
  • FIG. 5 illustrates a statistical method used in the abnormal action detection procedure
  • FIG. 6 is a detailed flow chart of a procedure for correcting a search log according to one embodiment of the invention.
  • FIG. 7 illustrates a point-deduction logic used in the search log correction procedure according to one embodiment of the invention
  • FIG. 8 illustrates a user interface screen according to one embodiment of the invention.
  • FIGS. 9 to 13 illustrate experimental results of the performance of the apparatus for preventing abuse of search logs according to the embodiment of the invention. Mode for the Invention
  • FIG. 1 is a schematic block diagram of an apparatus for preventing abuse of search logs according to one embodiment of the invention.
  • the apparatus for preventing abuse of search logs includes a preprocessor 10, an abnormal action detector 20, and an abnormal action corrector 30.
  • the preprocessor 10 selects subjects of abnormal action inspection from a search log.
  • the number of search logs is very great when considering the number of IP addresses at which search words have been input, the number of the search words, or combinations thereof, abnormal action inspection is not performed throughout the search logs but subjects of abnormal action inspection are selected through the preprocessor 10.
  • the preprocessor 10 first generates candidate IP addresses and search words which are remarkable at the time of inspection and generates an input value to be used at the inspection step.
  • the preprocessor 10 generates, from a search log, an input count vector of a specific search word at each IP address and/or an input count vector of each search word at a specific IP address within a predetermined time window and reduces the dimension of each of the generated input count vectors to generate search word summary information and/or IP address summary information.
  • the preprocessor 10 generates search word summary information by statistically analyzing the input count of a specific search word at each IP address, generates IP address summary information by statistically analyzing the input count of each search word at a specific IP address, or generates combinations thereof.
  • the generated search word summary information and/or IP address summary information can be modeled as a multidimensional distribution using a statistical method.
  • search words and/or IP addresses which are remarkable to some extent can be selected as subjects of inspection by applying the concept of a "method and system for detecting, in real time, search words whose popularities are on a rapid rise" described in Korean Registered Patent No. 522029 which was filed by this applicant.
  • the abnormal action detector 20 detects subjects with abnormal actions among the subjects selected by the preprocessor 10 by scoring the extents to which the selected subjects deviate from the normal. That is, the abnormal action detector 20 applies a scoring technique based on a statistical method to perform a procedure for calculating an abnormal action score of each IP address and/or search word.
  • the abnormal action detector 20 calculates, as a score, the extent of abnormality of the search word summary information and/or IP address summary information, which has been modeled as the multidimensional distribution, using a statistical method and determines that an abnormal action is included in the search word summary information and/or IP address summary information whose calculated score is equal to or higher than a reference value.
  • data can be compressed by reducing the dimension of the modeled search word summary information and/or IP address summary information before the score is calculated.
  • the abnormal action corrector 30 corrects the search log by removing the abnormal action detected by the abnormal action detector 20 from the search log using a predetermined point-deduction logic.
  • the abnormal action corrector 30 can remove contaminated parts from the search word summary information and/or IP address summary information in which an abnormal action has been detected using the point-deduction logic that is based on an information theory for measuring the difference of distributions.
  • the abnormal action corrector 30 performs a procedure for deducting the search count of the abnormal action using the point- deduction logic so that only normal actions remain in the search log. In this manner, abusive actions of search words for unfair purposes can be detected and treated to keep the search word clean.
  • FIG. 2 is a flow chart of the method for preventing abuse of search logs according to the embodiment of the invention.
  • search word summary information generated by statistically analyzing the input count of a specific search word at each IP address and/or IP address summary information generated by statistically analyzing the input count of each search word at a specific IP address are selected as subjects of abnormal action inspection from the search log.
  • search word summary information and the IP address summary information can be modeled as a multidimensional distribution using a statistical method.
  • the method may further include the step S300 of correcting the search log by removing the detected abnormal action using the predetermined point-deduction logic.
  • an input count vector of a specific search word at each IP address and/or an input count vector of each search word at a specific IP address are generated within a predetermined time window from the search log (Sl 10). Thereafter, the dimension of each of the generated input count vectors (i.e., the input count vector of the specific search word at each IP address and/or the input count vector of each search word at the specific IP address) is reduced to generate search word summary information and/or IP address summary information (S 120).
  • IP address summary information and search word summary information from a search log database.
  • a number of search words are input for a predetermined time.
  • IP address summary information In order to measure the extents to which the patterns of searches performed at the IP address differ from those of other IP addresses, it is necessary to generate IP address summary information.
  • one search word is inputted from various IP addresses. Accordingly, there is a need to generate summary information of each IP address at which the corresponding search word is inputted.
  • N 1 is the total number of IP addresses
  • N Q is the total number of search words
  • information of a specific IP address during a predetermined time window W at a specific search time can be represented by a vector indicating the input count of each search word at the specific IP address as follows.
  • ⁇ — k is the number of times the specific search word has been input at a kth IP address.
  • the number of different IP addresses among IP addresses at which a specific search word has been inputted within a predetermined time window W will also be very small compared to the total number of IP addresses
  • the memory problems described above can be overcome by generating summary information of the specific IP address and summary information of the specific search word. That is, the problems can be solved using a much smaller number of hash buckets than the total number of search words or the total number of IP addresses. [63] If the number of buckets
  • summary information of a specific IP address can be represented by a hash bucket count vector as follows. [64]
  • h J Q t is the number of hits of a kth bucket at the specific IP address.
  • information of the specific IP address can be represented in summary by a vector having a length corresponding to the number of buckets D as expressed in Mathematical Expression 3 to generate IP address summary information.
  • information of a search word can also be represented in summary by a vector having a length corresponding to the number of buckets D to generate search word summary information.
  • each of the IP address information and the search word information in summary by a vector having a length corresponding to the number of buckets D which is much smaller than the total number of IP addresses
  • the IP address summary information and the search word summary information generated through the above procedure can be modeled as a multidimensional distribution using a statistical method.
  • FIG. 4 is a detailed flow chart of the abnormal action detection procedure.
  • PCA principal components analysis
  • the extent of abnormality of the search word summary information and/or IP address summary information with the reduced dimension is calculated as a score according to the extent of deviation from the center (S220).
  • the score corresponding to the extent of abnormality can be calculated as a ratio to a reference value using a statistic modeled through the sum of samples of independent standard normal distributions of the reduced dimension.
  • search word summary information and/or IP address summary information whose calculated score is equal to or greater than a reference value (S230).
  • search word summary information and/or IP address summary information is detected as an abnormal action if its calculated score is equal to or greater than the reference level.
  • IP address summary information and search word summary information can be expressed, respectively, by a vector which includes, as elements, respective input count information of each search words at a specific IP address and a vector which includes, as elements, respective input count information of a specific search word at each IP addresses.
  • p is a probability vector which is calculated as follows.
  • IP address summary information and/or search word summary information is represented by a probability vector set as follows.
  • the invention suggests a method for scoring the extent of deviation of search word summary information and/or IP address summary information, represented using the probability vector p as in Mathematical Expression 6, from the normal action.
  • a data compression procedure is performed for more smooth data processing.
  • data is compressed by reducing the dimension D corresponding to the number of buckets using Principal components analysis (PCA).
  • PCA Principal components analysis
  • this method is to find principle component vectors which gives a high variance to mapped values from a discrete probability distribution Discrete (x
  • s* indicates the upper boundary of a normal range which does not exceed a threshold ⁇ and it can be considered that all
  • an abuse score is defined as follows.
  • FIG. 5 illustrates an example of a chi-square distribution with one degree of freedom.
  • s* indicates the upper boundary 902 of a normal range of a chi- square distribution when a threshold indicating an error or significance level is ⁇ and it can be considered that all
  • An abnormal action can be removed using a Kullback-Leibler (KL) distance indicating the difference between the distributions of a probability model of a mother group and a probability model of the search word summary information and/or IP address summary information in which the abnormal action has been detected.
  • KL Kullback-Leibler
  • the point-deduction logic which is used for search log correction according to the embodiment of the invention, uses a KL distance as means for measuring the difference between the distributions of a probability model of a mother group and a probability model of the search word summary information and/or IP address summary information in which the abnormal action has been detected.
  • the KL distance between the two distributions can be obtained as follows.
  • the KL distance has a value of zero when the two distributions are identical.
  • N data used to construct a model is a mother group and let us express the mother group by a NxD matrix M.
  • An Ith row mi of M is a vector storing the count of each hash bucket.
  • the matrix M is normalized based on rows to obtain a discrete probability model m.
  • FIG. 7 illustrates the point-deduction logic used in the search log correction procedure according to the embodiment of the invention.
  • the overall point-deduction logic is as follows. First, the input count of a specific search word at each IP address or the input count of each search word at a specific IP address is normalized to obtain a probability function and a KL distance is calculated based on a difference from the probability function of the mother group (904). Then, an index i, for which the obtained KL distance is greater than the threshold ⁇ , is then obtained. The obtained index indicates a search word or IP address including an abnormal action.
  • a search count corresponding to the obtained index is reduced (906) and the threshold ⁇ is adjusted. [148] This point-deduction logic is repeated until the score enters the normal range such that score ⁇ 1 or until there is no candidate which exceeds the threshold ⁇ .
  • the threshold ⁇ is increased at each repetition of the logic in order to apply a stricter standard for point-deduction at the next repetition since point deduction has already been done on an important abnormal action at an initial stage of the previous repetition.
  • FIG. 8 illustrates a user interface screen provided by an apparatus for preventing abuse of search logs according to one embodiment of the invention.
  • a search word list and an IP address list selected as subjects of inspection are displayed in a left window of the screen and a count to be subjected to point deduction according to an abuse score calculated as the extent of abnormality is displayed in a middle window.
  • FIGS. 9 to 13 illustrate experimental results according to a method for preventing abuse of search logs according to one embodiment of the invention.
  • FIG. 9 illustrates top 20 search words with high abuse scores which are calculated after the abuse inspection is performed.
  • a discrete probability model of each sample is expressed in the form of a histogram.
  • the vertical axis represents a probability value.
  • the scale of the vertical a xis is fixed to [0,1].
  • the horizontal axis represents a hash bucket index.
  • a search word name and an abuse score are recorded on the top of each figure.
  • the top 20 search words are each expected to include an abnormal action since their abuse scores are 3-9 which are all higher than 1.
  • FIG. 10 illustrates example results of the point-deduction processing of the top 20 abusive search words detected according to the invention.
  • Original hash buckets before the point-deduction and hash buckets to which the point-deduction logic has been applied are shown in pairs in each row. By comparing the scores, one can confirm that, after point-deduction, the abuse scores are less than 1 so that abnormal actions have been removed.
  • FIG. 11 illustrates example comparisons of the discrete probability distribution values with the point-deduction processing results according to the invention.
  • abnormal abuse scores of 3-9 of the search words were corrected to a normal range of 1 or less.
  • an abuse score of 9.673833 of a search word "type" was corrected to a level of about 0.211166 within the normal range after the point-deduction processing was performed to remove abnormal actions.
  • the vertical axis after point-deduction was scaled to [0,0.1].
  • FIG. 12 illustrates a probability model of a mother group which was a basis for calculating the KL distance at the point-deduction logic.
  • FIG. 13 illustrates example results of point-deduction processing of top 40 search words.
  • An abuse score of each search word is written on the left side.
  • a total search count before point-deduction and a point-deduction count calculated according to the point-deduction logic are written for each search word. That is, it is possible to remove contaminated parts from each search word detected as an abnormal action to correct the search log by subtracting a point-deduction count calculated according to the point- deduction logic from a total search count of the search word.
  • search word abuse problems can be overcome sufficiently through the abuse detection and treatment method using search word summary information as described above.
  • search word summary information may have an abuse score of less than 1 so that it is detected as a normal action although it actually involves an abnormal action due to search word abuse.
  • the abusive action can be additionally corrected using IP address summary information.
  • a detailed description of this method is omitted herein since it is similar to the abuse detection and treatment method using search word summary information.
  • the invention has suggested the method and apparatus for preventing abuse of search logs to maintain the search logs clean through diagnosis and post-processing of search abuses in the search logs.
  • a hash- bucket-based data structure is built to express IP address summary information and search word summary information and is then converted into a discrete probability model to express input data.
  • the invention has suggested a point-deduction technique which converts abnormal samples into normal samples based on an information theory.
  • the above method for preventing abuse of search logs can be written as a computer program. Codes and code segments which constitute the program can be easily inferred by computer programmers in the art of the invention.
  • the program is stored in a computer-readable medium. The stored program is read and executed by a computer to implement the method for preventing abuse of search logs.
  • the information storage medium includes a magnetic recording medium, an optical recording medium, and a carrier-wave medium.

Abstract

A method for preventing abuse of a search log is provided. The method includes selecting a subject of abnormal action inspection from the search log and detecting an abnormal action by scoring the extent of deviation of the selected subject from the normal. The method may further include correcting the search log by removing the detected abnormal action from the search log using a predetermined point-deduction logic. This method efficiently removes detected abnormal actions from search logs, thereby preventing abuse of search logs and maintaining the search logs clean.

Description

Description
METHOD AND APPARATUS FOR PREVENTING FROM
ABUSING SEARCH LOGS
Technical Field
[1] The present invention relates to Internet search, and more particularly to a method and apparatus for efficiently preventing abuse of search logs. Background Art
[2] Recently, various services are provided to users over the Internet along with the development of Internet technologies. The most typical is a search service. The search service is a service through which a search service provider provides, when a user inputs a search word into a search box of a search site provided by the search service provider, information corresponding to the input search word as a search result.
[3] Search words, which users input to use the search service, and information of search actions of users are stored in the form of search logs. By analyzing the search logs, search service providers can provide various search services to users.
[4] In the case of keyword advertisements, the amount to be charged is determined based on the popularities of keywords. The popularities of search words are determined based on patterns of the search words obtained through analysis of search logs. Through these popularities, search service providers can present an unbiased and reasonable charging basis to each advertisement requester.
[5] Search service providers provide a variety of primary and secondary services using search logs. For example, popular search word services and associated search word services provide, using search logs, search words currently appealing to users and their associated search words. These services have been successful since the assumption that the great amount of search logs is created by pure intention of Internet users was satisfied.
[6] However, attempts to distort search logs so as to reflect unfair intention of a specific individual or group have increased recently. The proportion of the distortion is expected to gradually increase in the future. Such abusive actions of search logs contaminate the search logs and reduce the reliability of business models based on the search logs and the quality of such services. Disclosure of Invention
Technical Problem
[7] Therefore, the present invention has been made in view of the above problems, and it is an object of the present invention to provide a method and apparatus for preventing abuse of search logs, wherein search logs are tracked and analyzed to detect abnormal actions and to remove contaminated parts from the search logs. Technical Solution
[8] In accordance with one aspect of the present invention, the above and other objects can be accomplished by the provision of a method for preventing abuse of a search log, the method including selecting a subject of abnormal action inspection from the search log, and detecting an abnormal action by scoring an extent of deviation of the selected subject from a normal. In one embodiment, the method may further include correcting the search log by removing the detected abnormal action from the search log using a predetermined point-deduction logic.
[9] The step of selecting the subject of abnormal action inspection includes generating, from the search log, at least one of search word summary information by statistically analyzing an input count of a specific search word at each IP address within a predetermined time window and IP address summary by statistically analyzing an input count of each search word at a specific IP address within the predetermined time window, wherein the step of detecting the abnormal action includes detecting the abnormal action from the at least one of the search word summary information and the IP address summary information.
[10] Here, the step of generating the summary information includes generating, from the search log, at least one of an input count vector of a specific search word at each IP address within a predetermined time window and an input count vector of each search word at a specific IP address within the predetermined time window, and reducing a dimension of the input count vector of the specific search word at each IP address to generate the search word summary information or reducing a dimension of the input count vector of each search word at the specific IP address to generate the IP address summary information.
[11] The step of reducing the dimension of the input count vector includes converting the input count vector of a specific search word at each IP addresses and the input count vector of each search words at a specific IP address into count vector of a limited number of hash buckets.
[12] Here, the search word summary information and the IP address summary information is modeled as a multidimensional distribution using a statistical method.
[13] On the other hand, the step of detecting the abnormal action includes calculating a score corresponding to an extent of abnormality of at least one of the search word summary information and the IP address summary information modeled as the multidimensional distribution according to an extent of deviation from a center, and determining that an abnormal action is included in the at least one of the search word summary information and the IP address summary information whose calculated score is equal to or higher than a reference value. Here, the step of detecting the abnormal action further includes compressing data by reducing a dimension of the at least one of the modeled search word summary information and IP address summary information before calculating the score.
[14] The step of calculating the score includes calculating the score corresponding to the extent of abnormality as a ratio to a reference value using a statistic modeled through a sum of samples of independent standard normal distributions of the reduced dimension.
[15] The step of correcting the search log includes removing the abnormal action from at least one of the search word summary information and the IP address summary information in which the abnormal action has been detected using a point-deduction logic that is based on an information theory for measuring a difference of distributions.
[16] In accordance with another aspect of the present invention, the above and other objects can also be accomplished by the provision of an apparatus for preventing abuse of a search log, the apparatus including a preprocessor for selecting a subject of abnormal action inspection from a search log, an abnormal action detector for detecting an abnormal action by scoring an extent of deviation of the selected subject from a normal, and an abnormal action corrector for correcting the search log by removing the detected abnormal action from the search log using a predetermined point-deduction logic. Brief Description of the Drawings
[17] FIG. 1 is a schematic block diagram of an apparatus for preventing abuse of search logs according to one embodiment of the invention;
[18] FIG. 2 is a flow chart of a method for preventing abuse of search logs according to one embodiment of the invention;
[19] FIG. 3 is a detailed flow chart of a procedure for selecting a subject of abnormal action inspection according to one embodiment of the invention;
[20] FIG. 4 is a detailed flow chart of a procedure for detecting abnormal actions according to one embodiment of the invention;
[21] FIG. 5 illustrates a statistical method used in the abnormal action detection procedure;
[22] FIG. 6 is a detailed flow chart of a procedure for correcting a search log according to one embodiment of the invention;
[23] FIG. 7 illustrates a point-deduction logic used in the search log correction procedure according to one embodiment of the invention;
[24] FIG. 8 illustrates a user interface screen according to one embodiment of the invention; and [25] FIGS. 9 to 13 illustrate experimental results of the performance of the apparatus for preventing abuse of search logs according to the embodiment of the invention. Mode for the Invention
[26] Preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may obscure the subject matter of the present invention. Also, the terms used in the following description are defined taking into consideration the functions obtained in accordance with the present invention. The definitions of these terms should be determined based on the whole content of this specification because they may be changed in accordance with the intention of a user or operator or a usual practice.
[27] FIG. 1 is a schematic block diagram of an apparatus for preventing abuse of search logs according to one embodiment of the invention. As shown, the apparatus for preventing abuse of search logs includes a preprocessor 10, an abnormal action detector 20, and an abnormal action corrector 30.
[28] The preprocessor 10 selects subjects of abnormal action inspection from a search log. Here, since the number of search logs is very great when considering the number of IP addresses at which search words have been input, the number of the search words, or combinations thereof, abnormal action inspection is not performed throughout the search logs but subjects of abnormal action inspection are selected through the preprocessor 10.
[29] To accomplish this, the preprocessor 10 first generates candidate IP addresses and search words which are remarkable at the time of inspection and generates an input value to be used at the inspection step.
[30] The preprocessor 10 generates, from a search log, an input count vector of a specific search word at each IP address and/or an input count vector of each search word at a specific IP address within a predetermined time window and reduces the dimension of each of the generated input count vectors to generate search word summary information and/or IP address summary information.
[31] In this manner, the preprocessor 10 generates search word summary information by statistically analyzing the input count of a specific search word at each IP address, generates IP address summary information by statistically analyzing the input count of each search word at a specific IP address, or generates combinations thereof. The generated search word summary information and/or IP address summary information can be modeled as a multidimensional distribution using a statistical method.
[32] In another embodiment of the invention, in order to reduce the number of the subjects of abnormal action inspection, search words and/or IP addresses which are remarkable to some extent can be selected as subjects of inspection by applying the concept of a "method and system for detecting, in real time, search words whose popularities are on a rapid rise" described in Korean Registered Patent No. 522029 which was filed by this applicant.
[33] The abnormal action detector 20 detects subjects with abnormal actions among the subjects selected by the preprocessor 10 by scoring the extents to which the selected subjects deviate from the normal. That is, the abnormal action detector 20 applies a scoring technique based on a statistical method to perform a procedure for calculating an abnormal action score of each IP address and/or search word.
[34] The abnormal action detector 20 calculates, as a score, the extent of abnormality of the search word summary information and/or IP address summary information, which has been modeled as the multidimensional distribution, using a statistical method and determines that an abnormal action is included in the search word summary information and/or IP address summary information whose calculated score is equal to or higher than a reference value. In order to increase the efficiency of data processing, data can be compressed by reducing the dimension of the modeled search word summary information and/or IP address summary information before the score is calculated.
[35] The abnormal action corrector 30 corrects the search log by removing the abnormal action detected by the abnormal action detector 20 from the search log using a predetermined point-deduction logic. In one embodiment, the abnormal action corrector 30 can remove contaminated parts from the search word summary information and/or IP address summary information in which an abnormal action has been detected using the point-deduction logic that is based on an information theory for measuring the difference of distributions. Specifically, the abnormal action corrector 30 performs a procedure for deducting the search count of the abnormal action using the point- deduction logic so that only normal actions remain in the search log. In this manner, abusive actions of search words for unfair purposes can be detected and treated to keep the search word clean.
[36] A method for preventing abuse of search logs according to one embodiment of the invention will now be described in detail with reference to the configuration of the apparatus for preventing abuse of search logs according to the embodiment of the invention.
[37] FIG. 2 is a flow chart of the method for preventing abuse of search logs according to the embodiment of the invention.
[38] As shown in FIG. 2, in order to prevent abuse of search logs, first, subjects of abnormal action inspection are selected from a search log (SlOO). In one embodiment, search word summary information generated by statistically analyzing the input count of a specific search word at each IP address and/or IP address summary information generated by statistically analyzing the input count of each search word at a specific IP address are selected as subjects of abnormal action inspection from the search log. Here, the search word summary information and the IP address summary information can be modeled as a multidimensional distribution using a statistical method.
[39] Then, the extents to which the selected search word summary information and/or IP address summary information deviates from the normal are scored to detect an abnormal action (S200). In one embodiment, the method may further include the step S300 of correcting the search log by removing the detected abnormal action using the predetermined point-deduction logic.
[40] A procedure for selecting a subject of inspection will now be described in detail with reference to FIG. 3.
[41] As shown in FIG. 3, in order to select a subject of abnormal action inspection, first, an input count vector of a specific search word at each IP address and/or an input count vector of each search word at a specific IP address are generated within a predetermined time window from the search log (Sl 10). Thereafter, the dimension of each of the generated input count vectors (i.e., the input count vector of the specific search word at each IP address and/or the input count vector of each search word at the specific IP address) is reduced to generate search word summary information and/or IP address summary information (S 120).
[42] The above inspection subject selection procedure will now be described in more detail with reference to a specific embodiment. Of course, this is just one embodiment of the inspection subject selecting method and various modifications are possible.
[43]
[44] 1. First Step - Preprocessing Step
[45] In order to perform search word abuse inspection, there is a need to generate IP address summary information and search word summary information from a search log database. At one IP address, a number of search words are input for a predetermined time. In order to measure the extents to which the patterns of searches performed at the IP address differ from those of other IP addresses, it is necessary to generate IP address summary information. In addition, one search word is inputted from various IP addresses. Accordingly, there is a need to generate summary information of each IP address at which the corresponding search word is inputted.
[46] However, there is a need to select IP addresses and search words which are subjects of inspection since the number of IP addresses, the number of search words, and the number of combinations thereof are very great. Processing all of them may cause memory problems. [48] 1) Representation of Input Vector
[49] The following vector representations may be employed to generate IP address summary information and search word summary information. [50] When
N1 is the total number of IP addresses and
NQ is the total number of search words, information of a specific IP address during a predetermined time window W at a specific search time can be represented by a vector indicating the input count of each search word at the specific IP address as follows.
[51]
[52] MATHEMATICAL EXPRESSION 1
[53]
( V C^ q 1 ' = C*- q NQ Λ)
[54] where
*— k is the number of times a kth search word has been input at the specific IP address. [55]
[56] MATHEMATICAL EXPRESSION 2
[57]
(C1,. c 'N)
[58] where
<— k is the number of times the specific search word has been input at a kth IP address. [59] However, maintaining all the vector representations must face memory problems since the total number of IP addresses
N1 and the total number of search words NQ are very great. [60] [61] 2) Selection of To-Be-Inspected IP addresses and Search Words using Hash
Buckets [62] The number of different search words among search words that have been input at a specific IP address within a predetermined time window W will be very small compared to the total number of search words
NQ
. The number of different IP addresses among IP addresses at which a specific search word has been inputted within a predetermined time window W will also be very small compared to the total number of IP addresses
N1
. Based on these characteristics, the memory problems described above can be overcome by generating summary information of the specific IP address and summary information of the specific search word. That is, the problems can be solved using a much smaller number of hash buckets than the total number of search words or the total number of IP addresses. [63] If the number of buckets
D€ NJ, NQ
, summary information of a specific IP address can be represented by a hash bucket count vector as follows. [64]
[65] MATHEMATICAL EXPRESSION 3
[66] q q
Ch 1 ? >h D)
[67] where h J Q t is the number of hits of a kth bucket at the specific IP address. When a search word q has been inputted at the specific IP address, the index k of a bucket associated with the search word q is calculated using a hash function as follows. [68]
[69] MATHEMATICAL EXPRESSION 4
[70] k=hash(q)%D
[71] Then, the count of the bucket corresponding to the calculated index k is incremented.
[72] Through this procedure, information of the specific IP address can be represented in summary by a vector having a length corresponding to the number of buckets D as expressed in Mathematical Expression 3 to generate IP address summary information. In the same manner, information of a search word can also be represented in summary by a vector having a length corresponding to the number of buckets D to generate search word summary information.
[73] Thus, the memory problems can be overcome by representing each of the IP address information and the search word information in summary by a vector having a length corresponding to the number of buckets D which is much smaller than the total number of IP addresses
and/or the total number of search words
NQ
[74] On the other hand, the IP address summary information and the search word summary information generated through the above procedure can be modeled as a multidimensional distribution using a statistical method.
[75] A method for scoring the extents of abnormal actions based on vector representations using hash buckets expressed in Mathematical Expression 3 will now be described in detail with reference to FIG. 4 which is a detailed flow chart of the abnormal action detection procedure.
[76] As shown in FIG. 4, in order to detect an abnormal action, first, the dimension of search word summary information and/or IP address summary information modeled as a multidimensional distribution using a statistical method is reduced to compress data (S210). In one embodiment, principal components analysis (PCA), which maps input data to orthogonal coordinates, can be used as a method for compressing data.
[77] Then, the extent of abnormality of the search word summary information and/or IP address summary information with the reduced dimension is calculated as a score according to the extent of deviation from the center (S220). In one embodiment, the score corresponding to the extent of abnormality can be calculated as a ratio to a reference value using a statistic modeled through the sum of samples of independent standard normal distributions of the reduced dimension.
[78] Finally, it is determined that an abnormal action is included in search word summary information and/or IP address summary information whose calculated score is equal to or greater than a reference value (S230). In other words, search word summary information and/or IP address summary information is detected as an abnormal action if its calculated score is equal to or greater than the reference level.
[79] The above abnormal action detection procedure will now be described in more detail with reference to a specific embodiment. Of course, this is just one embodiment of the inspection subject selecting method and various modifications are possible.
[80]
[81] 2. Second Step - Abnormal Action Detection Step
[82] As expressed above in Mathematical Expression 3, IP address summary information and search word summary information can be expressed, respectively, by a vector which includes, as elements, respective input count information of each search words at a specific IP address and a vector which includes, as elements, respective input count information of a specific search word at each IP addresses.
[83] If the vector is
A=( A i 9' ", /! /))
, this exhibits a discrete distribution and can be expressed by Discrete (x |p)
[84] Here, p is a probability vector which is calculated as follows.
[85]
[86] MATHEMATICAL EXPRESSION 5
[87]
Figure imgf000012_0001
[88] Finally, using the probability vector p, IP address summary information and/or search word summary information is represented by a probability vector set as follows. [89]
[90] MATHEMATICAL EXPRESSION 6
[91] P=(V P u ,y P D)
[92] In the following, the invention suggests a method for scoring the extent of deviation of search word summary information and/or IP address summary information, represented using the probability vector p as in Mathematical Expression 6, from the normal action.
[93]
[94] 1) Data Compression using PCA
[95] According to one embodiment of the invention, a data compression procedure is performed for more smooth data processing. Specifically, data is compressed by reducing the dimension D corresponding to the number of buckets using Principal components analysis (PCA). More specifically, this method is to find principle component vectors which gives a high variance to mapped values from a discrete probability distribution Discrete (x |p) representing IP address summary information or search word summary information. That is, this method is to find some eigenvectors which most closely exhibit the characteristics of the discrete probability distribution.
[96] In this PCA method, generally, only d principle component vectors (d<D), which closely exhibit the variance of the corresponding discrete probability distribution among the entire variances, are used as the principle component vectors. Here, for input data mapped to the d principal component vectors, there is no correlation between mapped values with different variances of the components of the principal component vectors, which are orthogonal to each other. A detailed description of the PCA is omitted herein since it uses a widely known method.
[97] Using this PCA method, the dimension D of the discrete probability distribution indicating IP address summary information or search word summary information, whose bucket number is D-dimensional, is reduced to d dimension, which is much lower than the D dimension, to compress data, thereby increasing the data processing efficiency.
[98] A method for scoring the extent of deviation from the normal action using the d- dimensional input data for which PCA has been done will now be described in detail.
[99]
[100] 2) Scoring Method to measure Extent of Abnormality
[101] It will be understood that the components of the input data mapped to the d- dimensional principal component vectors through the above PCA method have different variances. This indicates that the scaling of each dimension is different. In this case, a prewhitening method can be used to scale the principal component vector so that the variance of each dimension becomes 1 in order to help achieve the visualization and post-processing. [102] Given a prewhitened mapping matrix
E
, let us express the mapped value of the input vector x by a d-dimensional vector
X
. Here,
and
have no correlation and the variance var{Si} = 1
[103] Now, let us assume the following in order to score the abnormal action according to the invention. [104] 1) Each
complies with a standard normal distribution N(0, 1). [105] 2)
and
are independent of each other when i≠j. [106] Here, although the lack of correlation does not mean independence, the invention uses the strong assumption in order to increase the efficiency of data processing in the present invention.
[107] We can define the following statistic under these assumptions.
[108]
[109] MATHEMATICAL EXPRESSION 7
[HO]
Figure imgf000015_0001
i=l
[111] Generally, in the statistics, a chi-square (χ ) distribution with d degrees of freedom is modeled through the sum of samples of d independent standard normal distributions. Thus, we can assume that the statistic
S under the assumption as expressed in Mathematical Expression 7 complies with the chi-square distribution with d degrees of freedom. [112] Now let us define s* as the smallest value s which satisfies
Figure imgf000015_0002
. Here,
Figure imgf000015_0003
represents a cumulative probability distribution value up to the boundary s and α is an error or significance level which is preferably set to 0.05 or 0.01. As a result, s* indicates the upper boundary of a normal range which does not exceed a threshold α and it can be considered that all
S values exceeding s* are included in an abnormal range. [113] Accordingly, in the invention, to score the extent of deviation from the normal action, an abuse score is defined as follows. [114]
[115] MATHEMATICAL EXPRESSION 8
[116] score = s/s*
[117] That is, as
increases above 1, the value of the probability vector p = l - cd£{χ2(s~: d) \
decreases below the threshold α. This provides a basis on which to determine whether the corresponding action is a very rare occasion under the given assumption. That is, if the abuse score defined according to Mathematical Expression 8 is higher than 1, it can be determined that the corresponding action is a rare occasion out of the normal range and thus that it is an abnormal action.
[118] The statistical method will now be described in detail with reference to an example of FIG. 5. FIG. 5 illustrates an example of a chi-square distribution with one degree of freedom. In FIG. 5, s* indicates the upper boundary 902 of a normal range of a chi- square distribution when a threshold indicating an error or significance level is α and it can be considered that all
values exceeding s* are included in an abnormal range.
[119] That is, a region 904, which is obtained by subtracting a cumulative probability distribution cdf\ χ2{s; d) }
from the probability vector 1, is an abnormal region and it can be considered that all
values included in the region 904 are included in the abnormal range.
[120] A procedure for correcting a search log according to one embodiment of the invention will now be described with reference to FIG. 6.
[121] As shown in FIG. 6, in order to correct a search log, contaminated parts are removed from search word summary information and/or IP address summary information, which has been detected as an abnormal action, using the point-deduction logic which is based on an information theory for measuring the difference of distributions (S310).
[122] An abnormal action can be removed using a Kullback-Leibler (KL) distance indicating the difference between the distributions of a probability model of a mother group and a probability model of the search word summary information and/or IP address summary information in which the abnormal action has been detected.
[123] The above search log correction procedure will now be described in more detail with reference to a specific embodiment. Of course, this is just one embodiment of the search log correction method and various modifications are possible.
[124]
[125] 3. Third Step - Search Log Correction Step
[126] 1) Means for Measuring Difference of Distributions - KL Distance
[127] As described above, the point-deduction logic, which is used for search log correction according to the embodiment of the invention, uses a KL distance as means for measuring the difference between the distributions of a probability model of a mother group and a probability model of the search word summary information and/or IP address summary information in which the abnormal action has been detected.
[128] This KL distance is based on an information theory (Cover and Thomas (1991)).
For example, given two distributions p and q, the KL distance between the two distributions can be obtained as follows.
[129]
[ 130] MATHEMATICAL EXPRESSION 9
[131]
Figure imgf000017_0001
[132] Accordingly, the KL distance has a value of zero when the two distributions are identical. [133]
[134] 2) Point-Deduction Logic
[135] For the sake of convenience, let us assume that N data used to construct a model is a mother group and let us express the mother group by a NxD matrix M. An Ith row mi of M is a vector storing the count of each hash bucket. The matrix M is normalized based on rows to obtain a discrete probability model m. [136]
[137] MATHEMATICAL EXPRESSION 10
[138]
Figure imgf000017_0002
[139] When h is a hash bucket vector of an abnormal pattern and p is a discrete probability model of the same, a KL distance between the discrete probability distribution m of the mother group and the discrete probability distribution p which is a subject of inspection is calculated as follows.
[140]
[141] MATHEMATICAL EXPRESSION 11
[142]
Figure imgf000017_0003
[143] By reducing the number of specific elements in the hash bucket vector h using Mathematical Expression 11, it is possible to reduce the difference between the discrete probability model of the mother group and the changed discrete probability model. [144] Specifically, as the
Figure imgf000018_0001
value of a hash bucket i has a higher positive value, the KL distance between two distributions increases to make the distribution p abnormal. Accordingly, when a threshold is β, hash buckets, which satisfy
/7 ,. log >β m i
, are candidates for correction, to which the point-deduction logic is to be applied, in order to remove the abnormal action to maintain the search log clean.
[145] FIG. 7 illustrates the point-deduction logic used in the search log correction procedure according to the embodiment of the invention.
[146] An overall point-deduction logic is shown in FIG. 7. Here, a function "find()" is a function to retrieve an index of an element which satisfies a condition in "()". A function "ceil()" is a function to retrieve a smallest integer greater than an argument in "()". An operator ".*" is to perform multiplication of elements of the vector and "score" indicates an abuse score defined in the above Mathematical Expression 8. "
P
" indicates a search word input count, p indicates a normalized probability function of '
P
", "β" indicates a threshold for selecting a candidate for correction, and "f" indicates the KL distance between the discrete probability distribution m of the mother group and the discrete probability distribution p which is a subject of inspection. [147] The overall point-deduction logic is as follows. First, the input count of a specific search word at each IP address or the input count of each search word at a specific IP address is normalized to obtain a probability function and a KL distance is calculated based on a difference from the probability function of the mother group (904). Then, an index i, for which the obtained KL distance is greater than the threshold β, is then obtained. The obtained index indicates a search word or IP address including an abnormal action. A search count corresponding to the obtained index is reduced (906) and the threshold β is adjusted. [148] This point-deduction logic is repeated until the score enters the normal range such that score < 1 or until there is no candidate which exceeds the threshold β. The threshold β is increased at each repetition of the logic in order to apply a stricter standard for point-deduction at the next repetition since point deduction has already been done on an important abnormal action at an initial stage of the previous repetition.
[149] FIG. 8 illustrates a user interface screen provided by an apparatus for preventing abuse of search logs according to one embodiment of the invention.
[150] As shown in FIG. 8, a search word list and an IP address list selected as subjects of inspection are displayed in a left window of the screen and a count to be subjected to point deduction according to an abuse score calculated as the extent of abnormality is displayed in a middle window.
[151] FIGS. 9 to 13 illustrate experimental results according to a method for preventing abuse of search logs according to one embodiment of the invention.
[152] Let us consider results on July 7, 2006 at around 12:30 in order to check the performance of the method for preventing abuse of search logs according to the embodiment of the invention. In this experiment, a time window W is set to one hour and the number of hash buckets D is set to 32. In addition, a threshold α = 0.01, β=log(1.8), scale=log(1.3) are set.
[153] A model is built from search word summary information. The abuse inspection described above is performed on a candidate set for inspection. FIG. 9 illustrates top 20 search words with high abuse scores which are calculated after the abuse inspection is performed. A discrete probability model of each sample is expressed in the form of a histogram. The vertical axis represents a probability value. The scale of the vertical a xis is fixed to [0,1]. The horizontal axis represents a hash bucket index. A search word name and an abuse score are recorded on the top of each figure. The top 20 search words are each expected to include an abnormal action since their abuse scores are 3-9 which are all higher than 1.
[154] FIG. 10 illustrates example results of the point-deduction processing of the top 20 abusive search words detected according to the invention. Original hash buckets before the point-deduction and hash buckets to which the point-deduction logic has been applied are shown in pairs in each row. By comparing the scores, one can confirm that, after point-deduction, the abuse scores are less than 1 so that abnormal actions have been removed.
[155] FIG. 11 illustrates example comparisons of the discrete probability distribution values with the point-deduction processing results according to the invention. One can confirm that abnormal abuse scores of 3-9 of the search words were corrected to a normal range of 1 or less. For example, it can seen from FIG. 11 that an abuse score of 9.673833 of a search word "type" was corrected to a level of about 0.211166 within the normal range after the point-deduction processing was performed to remove abnormal actions. For the sake of convenience, the vertical axis after point-deduction was scaled to [0,0.1].
[156] FIG. 12 illustrates a probability model of a mother group which was a basis for calculating the KL distance at the point-deduction logic.
[157] When comparing scores before and after point-deduction, one can tell that abnormal actions were removed through the point-deduction logic so that the abnormal abuse scores of search words were recovered to normal levels.
[158] FIG. 13 illustrates example results of point-deduction processing of top 40 search words. An abuse score of each search word is written on the left side. A total search count before point-deduction and a point-deduction count calculated according to the point-deduction logic are written for each search word. That is, it is possible to remove contaminated parts from each search word detected as an abnormal action to correct the search log by subtracting a point-deduction count calculated according to the point- deduction logic from a total search count of the search word.
[159] The above description has been given of the method in which a search count of a search word, which is determined to be abnormal using search word summary information, is deducted through point-deduction to maintain information of normal actions only in a search log. Most search word abuse problems can be overcome sufficiently through the abuse detection and treatment method using search word summary information as described above. However, in rare cases, search word summary information may have an abuse score of less than 1 so that it is detected as a normal action although it actually involves an abnormal action due to search word abuse. In this case, the abusive action can be additionally corrected using IP address summary information. A detailed description of this method is omitted herein since it is similar to the abuse detection and treatment method using search word summary information.
[160] In the above description, the invention has suggested the method and apparatus for preventing abuse of search logs to maintain the search logs clean through diagnosis and post-processing of search abuses in the search logs. Specifically, a hash- bucket-based data structure is built to express IP address summary information and search word summary information and is then converted into a discrete probability model to express input data.
[161] We have also suggested a technique capable of detecting samples which are abnormal compared to normal samples. Additionally, the invention has suggested a statistics-based scoring technique in which input data is transformed to a space of orthogonal principal component vectors through a PCA method and the extent of deviation from the center is then measured.
[162] Finally, the invention has suggested a point-deduction technique which converts abnormal samples into normal samples based on an information theory.
[163] The above method for preventing abuse of search logs can be written as a computer program. Codes and code segments which constitute the program can be easily inferred by computer programmers in the art of the invention. The program is stored in a computer-readable medium. The stored program is read and executed by a computer to implement the method for preventing abuse of search logs. The information storage medium includes a magnetic recording medium, an optical recording medium, and a carrier-wave medium.
[164] Although the invention has been described focusing on the preferred embodiments, those skilled in the art will appreciate that the invention may be carried out in modified forms without departing from the essential characteristics of the present invention. Therefore, the above embodiments should be construed in all aspects as illustrative and not restrictive. The scope of the invention should be determined by the appended claims and their legal equivalents, not by the above description, and all changes coming within the equivalency range of the appended claims should be construed as being embraced in the invention.
[165]
Industrial Applicability
[166] As described above, according to the invention, it is possible to efficiently detect abusive search words having abnormal actions from a search log. It is also possible to deduct the search count of each search word determined to be abusive so that only a valid search count of each search word remains to remove the abusive action, thereby preventing abuse of search logs and maintaining the search logs clean.

Claims

Claims
[1] A method for preventing abuse of a search log, the method comprising: selecting a subject of abnormal action inspection from the search log; and detecting an abnormal action by scoring an extent of deviation of the selected subject from a normal.
[2] The method according to claim 1, further comprising correcting the search log by removing the detected abnormal action from the search log using a predetermined point-deduction logic.
[3] The method according to claim 2, wherein the step of selecting the subject of abnormal action inspection includes: generating, from the search log, at least one of search word summary information by statistically analyzing an input count of a specific search word at each IP address within a predetermined time window and IP address summary information by statistically analyzing an input count of each search word at a specific IP address within the specific time window, wherein the step of detecting the abnormal action includes detecting the abnormal action from the at least one of the search word summary information and the IP address summary information.
[4] The method according to claim 3, wherein the step of generating the summary information includes: generating, from the search log, at least one of an input count vector of a specific search word at each IP address within a predetermined time window and an input count vector of each search word at a specific IP address within the predetermined time window; and reducing a dimension of the input count vector of the specific search word at each IP address to generate the search word summary information or reducing a dimension of the input count vector of each search word at the specific IP address to generate the IP address summary information.
[5] The method according to claim 4, wherein the step of reducing the dimension of the input count vector includes: converting the input count vector of a specific search word at each IP addresses and the input count vector of each search words at a specific IP address into count vector of a limited number of hash buckets.
[6] The method according to claim 3, wherein the search word summary information and the IP address summary information is modeled as a multidimensional distribution using a statistical method.
[7] The method according to claim 6, wherein the step of detecting the abnormal action includes: calculating a score corresponding to an extent of abnormality of at least one of the search word summary information and the IP address summary information modeled as the multidimensional distribution according to an extent of deviation from a center; and determining that an abnormal action is included in the at least one of the search word summary information and the IP address summary information whose calculated score is equal to or higher than a reference value.
[8] The method according to claim 7, wherein the step of detecting the abnormal action further includes: compressing data by reducing a dimension of the at least one of the modeled search word summary information and IP address summary information before calculating the score.
[9] The method according to claim 8, wherein the step of compressing the data is performed using a principal components analysis method which maps input data to orthogonal coordinates.
[10] The method according to claim 8, wherein the step of calculating the score includes calculating the score corresponding to the extent of abnormality as a ratio to a reference value using a statistic modeled through a sum of samples of independent standard normal distributions of the reduced dimension.
[11] The method according to claim 10, wherein the score of the extent of abnormality is calculated using an equation of score = sfs*
, where a statistic
complies with a chi-square distribution with d degrees of freedom modeled through a sum of samples of independent standard normal distributions and s* indicates the upper boundary of a normal range which does not exceed a threshold α.
[12] The method according to claim 11, wherein it is determined that all statistics
S exceeding s* are included in an abnormal range.
[13] The method according to claim 6, wherein the point-deduction logic removes the abnormal action using a Kullback-Leibler (KL) distance indicating a difference between respective distributions of a probability model of a mother group and a probability model of at least one of the search word summary information and the IP address summary information in which the abnormal action has been detected.
[14] The method according to claim 3, wherein the step of correcting the search log includes: removing the abnormal action from at least one of the search word summary information and the IP address summary information in which the abnormal action has been detected using a point-deduction logic that is based on an information theory for measuring a difference of distributions.
[15] The method according to claim 1, wherein the step of selecting the subject of abnormal action inspection includes selecting a search word and/or an IP address as a subject of abnormal action inspection, and the step of detecting the abnormal action includes detecting the abnormal action from the selected search word and/or IP address.
[16] A computer-readable storage medium storing a program for causing a computer to perform the method according to any one of claims 1 to 15.
[17] An apparatus for preventing abuse of a search log, the apparatus comprising: a preprocessor for selecting a subject of abnormal action inspection from a search log; an abnormal action detector for detecting an abnormal action by scoring an extent of deviation of the selected subject from a normal; and an abnormal action corrector for correcting the search log by removing the detected abnormal action from the search log using a predetermined point- deduction logic.
[18] The apparatus according to claim 17, wherein, in order to select the subject of abnormal action inspection, the preprocessor generates, from the search log, search word summary information by statistically analyzing an input count of a specific search word at each IP address within a predetermined time window and/or IP address summary information by statistically analyzing an input count of each search word at a specific IP address within the predetermined time window, and the abnormal action detector detects the abnormal action from at least one of the search word summary information and/or the IP address summary information.
[19] The apparatus according to claim 18, wherein the search word summary information and/or the IP address summary information is modeled as a multidimensional distribution using a statistical method.
[20] The apparatus according to claim 19, wherein the preprocessor reduces a dimension of an input count vector of the specific search word at each IP address to generate the search word summary information and reduces a dimension of an input count vector of each search word at the specific IP address to generate the IP address summary information.
[21] The apparatus according to claim 19, wherein the abnormal action detector calculates a score corresponding to an extent of abnormality of the modeled search word summary information and/or IP address summary information according to an extent of deviation from a center, and determines that an abnormal action is included in the search word summary information and/or the IP address summary information whose calculated score is equal to or higher than a reference level.
[22] The apparatus according to claim 21, wherein the abnormal action detector compresses data by reducing a dimension of the modeled search word summary information and/or IP address summary information before calculating the score.
[23] The apparatus according to claim 19, wherein the abnormal action corrector removes the abnormal action from the search word summary information and/or the IP address summary information in which the abnormal action has been detected using a point-deduction logic that is based on an information theory for measuring a difference of distributions.
PCT/KR2007/006104 2006-11-29 2007-11-29 Method and apparatus for preventing from abusing search logs WO2008066341A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2009539187A JP5118707B2 (en) 2006-11-29 2007-11-29 Search log misuse prevention method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2006-0119284 2006-11-29
KR1020060119284A KR100837334B1 (en) 2006-11-29 2006-11-29 Method and apparatus for preventing from abusing search logs

Publications (1)

Publication Number Publication Date
WO2008066341A1 true WO2008066341A1 (en) 2008-06-05

Family

ID=39468078

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2007/006104 WO2008066341A1 (en) 2006-11-29 2007-11-29 Method and apparatus for preventing from abusing search logs

Country Status (3)

Country Link
JP (1) JP5118707B2 (en)
KR (1) KR100837334B1 (en)
WO (1) WO2008066341A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210035025A1 (en) * 2019-07-29 2021-02-04 Oracle International Corporation Systems and methods for optimizing machine learning models by summarizing list characteristics based on multi-dimensional feature vectors

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101358266B1 (en) * 2012-03-30 2014-02-20 (주)네오위즈게임즈 Method of detecting game abuser and game abuser server performing the same

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006079454A (en) * 2004-09-10 2006-03-23 Fujitsu Ltd Search keyword analysis method, search keyword analysis program and search keyword analysis apparatus
US20060224554A1 (en) * 2005-03-29 2006-10-05 Bailey David R Query revision using known highly-ranked queries
US7136860B2 (en) * 2000-02-14 2006-11-14 Overture Services, Inc. System and method to determine the validity of an interaction on a network

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100516929B1 (en) * 2002-10-23 2005-09-23 한국과학기술정보연구원 Apparatus and method for analyzing task management, and storage media having program thereof
US7681181B2 (en) * 2004-09-30 2010-03-16 Microsoft Corporation Method, system, and apparatus for providing custom product support for a software program based upon states of program execution instability
US7848501B2 (en) * 2005-01-25 2010-12-07 Microsoft Corporation Storage abuse prevention

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7136860B2 (en) * 2000-02-14 2006-11-14 Overture Services, Inc. System and method to determine the validity of an interaction on a network
JP2006079454A (en) * 2004-09-10 2006-03-23 Fujitsu Ltd Search keyword analysis method, search keyword analysis program and search keyword analysis apparatus
US20060224554A1 (en) * 2005-03-29 2006-10-05 Bailey David R Query revision using known highly-ranked queries

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210035025A1 (en) * 2019-07-29 2021-02-04 Oracle International Corporation Systems and methods for optimizing machine learning models by summarizing list characteristics based on multi-dimensional feature vectors

Also Published As

Publication number Publication date
KR100837334B1 (en) 2008-06-12
KR20080048827A (en) 2008-06-03
JP5118707B2 (en) 2013-01-16
JP2010511246A (en) 2010-04-08

Similar Documents

Publication Publication Date Title
US10572832B2 (en) Systems and methods for calibrating a machine learning model
WO2019218699A1 (en) Fraud transaction determining method and apparatus, computer device, and storage medium
KR20180041174A (en) Risk Assessment Methods and Systems
CN107229627B (en) Text processing method and device and computing equipment
CN108984708B (en) Dirty data identification method and device, data cleaning method and device, and controller
US10452627B2 (en) Column weight calculation for data deduplication
CN110442516B (en) Information processing method, apparatus, and computer-readable storage medium
JPH06250861A (en) Method and equipment for evaluating and sampling computer virus and signature of other undesired existing software
CN110674865B (en) Rule learning classifier integration method oriented to software defect class distribution unbalance
CN107168995B (en) Data processing method and server
CN112560450B (en) Text error correction method and device
CN108304328B (en) Text description generation method, system and device for crowdsourcing test report
CN110164454B (en) Formant deviation-based audio identity discrimination method and device
CN111460250A (en) Image data cleaning method, image data cleaning device, image data cleaning medium, and electronic apparatus
CN112785420A (en) Credit scoring model training method and device, electronic equipment and storage medium
CN113158777A (en) Quality scoring method, quality scoring model training method and related device
CN111291824A (en) Time sequence processing method and device, electronic equipment and computer readable medium
CN112328499A (en) Test data generation method, device, equipment and medium
WO2008066341A1 (en) Method and apparatus for preventing from abusing search logs
CN115423600B (en) Data screening method, device, medium and electronic equipment
CN114697127B (en) Service session risk processing method based on cloud computing and server
CN110472416A (en) A kind of web virus detection method and relevant apparatus
JP2008282111A (en) Similar document retrieval method, program and device
CN114510720A (en) Android malicious software classification method based on feature fusion and NLP technology
CN113268419A (en) Method, device, equipment and storage medium for generating test case optimization information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07834389

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2009539187

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07834389

Country of ref document: EP

Kind code of ref document: A1