WO2010021530A1

WO2010021530A1 - System and method for displaying relevant textual advertising based on semantic similarity

Info

Publication number: WO2010021530A1
Application number: PCT/MX2008/000109
Authority: WO
Inventors: Ramón Felipe BRENA PINERO; Eduardo Héctor RAMIREZ RANGEL
Original assignee: Instituto Tecnologico Y De Estudios Superiores De Monterrey
Priority date: 2008-08-20
Filing date: 2008-08-20
Publication date: 2010-02-25

Abstract

The invention described presents a method for finding conceptually related advertisements which are semantically related to other documents being consulted. To that end, the invention presents a method which selects the most relevant semantically related advertisements from a collection of possible advertisements given a document which is consulted by an Internet user. For that purpose, the invention presents a method which calculates structures called "semantic context" which represents topics or contexts. The invention likewise presents a method which uses the semantic contexts to measure the conceptual proximity between a document and an advertisement. Since this method does not depend on an exact word match like many other methods in the prior art, the method presented is less vulnerable to synonymy, polysemy and word omissions.

Description

SYSTEM AND METHOD TO SHOW RELEVANT TEXTUAL ADVERTISING BASED ON SEMANTIC SEMEJANZA

FIELD OF THE INVENTION

The aim of the invention is to provide Internet users with semantically related advertisements for the documents being consulted. For this purpose, a method is presented that selects the most relevant ads from a collection of possible ads.

BACKGROUND OF THE INVENTION

In the last 10 years, the advertising of products and services has followed a trend towards migration from traditional media such as radio, television and press to computer networks such as GSM and the Internet. It is expected that this trend will continue to grow at an accelerated rate in the following years because electronic media offer greater possibilities to direct and measure the effectiveness of advertising and marketing campaigns.

Techniques in the state of the art define relevance as a function of lexical similarity between a couple of documents. Such a definition is effective in applications where a document is actually an explicit query provided by the user, but it is not effective when it comes to automatically relating two documents, such as a web page and a short textual announcement, of less than 20 words. A problem that arises in both cases is that the words used in both documents must be highly similar, or the same. This situation presents a limitation to information retrieval techniques and has been characterized by Furnast, et. al (You smoke 1987) as the "problem of vocabulary incompatibility" (vocabulary mismatch problem). In 1989, Scott Deerwester, et. al (US-4839853) presented a method to solve the problem of incompatibility of vocabulary in information retrieval, based on the calculation of a latent semantic structure. The method, usually known as Latent Semantic Analysis (LSA) is a corpus-based method that begins by creating a matrix of document-terms, then, taking each of the "lines" (ti ^Λ T) of the matrix, A new matrix (X ^Λ T) (X) is produced. The new matrix contains information on how each term relates to the others, in terms of their total frequency in the documents. Finally, by factoring the matrix (X ^A T) (X) by the Sigular Value Decomposition (SVD) method and assuming a fixed number of dimensions, three matrices derived U, V and Sigma are obtained, where: X = (U ) (Sigma) (V ^Λ T)

In the resulting expression, the matrices U and V, provide a spatial representation of the semantic relationship between terms and documents, so that the semantic similarity of the terms can be calculated as cosine distance in the U matrix, and the semantic similarity of the documents as cosine distance of vectors in the matrix V ^Λ T.

One of the main contributions of the LSA method is that it showed the feasibility of the solution to the problem of vocabulary incompatibility using a latent semantic structure, however, the spatial representation of the semantic structure is computationally complex and limits the application of the method in collections larger than a few thousand documents.

In addition, the method has the disadvantage of the high cost of updates, since when introducing a new document into the collection, the entire matrix must be recalculated. Moreover, by virtue of spatial representation, the LSA method cannot handle the Polysemy, that is, the fact that one word appears near others does not allow us to conclude that the word has different meanings.

Subsequently, building on the basis of the LSA method, Hofϊman (US06687696) developed a new latent structure extraction system that can also be used to improve information retrieval and generate personalized recommendations. Hoffmann's model was called Probabilistic Semantic Latent Indexing (PLSI) and it was inspired by the LSA principle but rethink it by applying statistical fundamentals. In PLSI, each document is modeled as a "bag of words", where it is assumed that each word was generated with a certain probability by a hidden topic, and consequently, the document would have been generated by a certain number of topics from a probability distribution. Under this assumption, PLSI is considered a "generative model", which can be expressed as follows: P (d, w) = [Sum of topics z] P (z) P (w | z) P (d | z) Thus, the problem of finding the semantic structure becomes the problem of defining a probability distribution for each latent class, P (z) and for each of the words in class P (w | z). With these entries, it is also feasible to calculate the mix of topics or classes for a document, that is P (z | d). In order to perform this task, the PLSI method proposes the maximization of the likelihood function using an algorithm of maximization of expectations, EM. The EM algorithm tries to maximize: L = [Sum of documents] [Sum of words] n (d, w) log P (d, w) Although PLSI implies some improvements in terms of perplexity with respect to LSA and has the important advantage of not requiring the complete reconstruction of the model to perform analysis on unseen documents, the construction of the model is computationally expensive and not feasible to analyze collections in the order of millions of documents, such as the internet. Another limitation of PLSI is that the number of latent or topical classes is an arbitrary number, and that said number needs to be small since such an amount is determinant of the computational complexity of the method.

Therefore, the proposed invention fulfills a comparable objective in that it allows the latent structure of document collections to be extracted and calculations of semantic similarity, by means of an algorithm and a simplified representation of topics defined as "semantic contexts". The presented method makes use of information theory metrics, search indexes and local optimization algorithms to extract an unknown number of topics and can be scaled to much larger document collections.

One of the challenges in online advertising is to provide the client with very relevant advertisements. The more relevant an ad is for the person who surfs the internet, the more likely it is that the person will follow the link of that ad and finally make a commercial transaction. Currently, the most modern systems operate under an auction scheme in which advertisers select keywords, and place bids in the auction to get their advertising displayed. The system tries to maximize the relevance of the ads, based on the content of the electronic document being read by the user at that time or by the queries placed by users in Internet search engines.

The process of creating ad campaigns is not trivial for the advertiser, as they are asked to manually choose the variants of the keywords that will trigger the display of the ad. In this process, the following three problems may occur: 1. Keyword selection is difficult. For example, the advertiser often does not choose enough related keywords for their campaign. This leads to a low exposure of the campaign.

2. Keyword selection is subject to ambiguity, because the advertiser can choose keywords that have multiple meanings, that is, polysemic words.

This situation may cause the ad to be presented in situations where it is not relevant.

3. The advertiser may mistakenly choose unrelated words. As in the previous case, this may lead to the presentation of irrelevant advertisements, which results in a cost for lost opportunities for both the advertiser and the system operator.

The system and method presented in this invention increases the relevance of the advertisements presented to the user, by semantically relating the advertisements to the electronic documents that are being read by a user at a given time.

The semantic relationship method that is performed by the system uses the statistical properties of the language, and therefore is able to detect semantic similarity of a given pair of documents (one of which may be an advertisement), which do not necessarily share terms in common, but in fact they relate to the same concepts.

BRIEF DESCRIPTION OF THE FIGURES Figure 1. Flowchart illustrating the general method of ad printing,

Figure 2. Detailed flow chart illustrating the process,

Figure 3. Detailed flow chart illustrating the process of extracting topics from the collection. DETAILED DESCRIPTION OF THE INVENTION

Figure 1 is a flow chart illustrating the general method of printing partners. It assumes or prior processing (described in Figure 3) of identification of topics (1), which produces a data structure identified as "Topic Structure" (2). It also assumes that a collection of candidate advertisements has been stored in the database (3), and that the topics have been identified using the methods described later in this document. Once this has been done, the system can be executed by a user as follows: Suppose the user consults an electronic document, typically a web page (4). Next, the system associates (5) the topics of the candidate announcements with those related to the document in question and generates a list of related announcements (6), which correspond to the same topics as the document consulted by the user (4) . Figure 2 is a detailed flow chart illustrating the process followed by the system presented in this invention. The first step to perform is the pre-processing of the terms of the documents (8). The pre-processing is done sequentially, taking each document from the collection and applying the following transformations. When a document is preprocessed, the first phase consists of separating the document into sentences, according to the punctuation and hypertext separators such as line breaks, tables and title tags. Then, sentences are reduced to word lists, eliminating those with linguistic functions, such as articles, pronouns, adverbs and the like, usually known as "stop-words." For example, the English language sentence: "The quick brown fox jumps over the lazy dog" is reduced to the list: {quick, brown, fox, jumps, lazy, dog The set of all relevant terms included in the document is called "vocabulary."

Subsequently, as part of the preprocessing phase (8), an inverted index is created. The inverted index is a mapping between each term and the identifiers of the documents that contain that term. Inverted indexes are a general domain technique in the field of information retrieval to efficiently locate documents that contain a specific term. Additionally, a table of terms is constructed. Each record in the table of terms contains additional information about each of them, for example, its unique numerical identifier (called, term-id), the frequency of the term (number of documents in which said term appears) and the frequency by sentence of the term (number of sentences in which said term appears).

Another necessary preparation phase is the generation of a matrix of co-occurrences of terms. In said matrix, both columns (j) and lines (i) correspond to the vocabulary terms, and in the cells (i, j) of the same the number of occurrences is stored in the same sentence of the terms ie j . The appearance of two terms in the same sentence is called co-occurrence. Only those terms with a frequency higher than a certain level are taken into account to feed the matrix, in other words, only those terms that appear a minimum number of sentences are stored in the database.

Once the construction of the matrix is finished, it is stored in the database (3) in such a way that its information can be used by the precesses mentioned in Figure 2 (7) and the previous step (8) ends processing The next step is the construction of the set of Topics (9), which is prior to the use of the system by the end user. The construction of the set of topics is illustrated in the Figure 3 and will be described later in this document. For the moment, consider each topic or topic in the document represented by a "semantic context" that is defined by a set of k terms W = {wl, ... wk}. The terms in a semantic context are the words that together describe the "best" form of a given topic, where the exact meaning of "best" will be explained shortly. The set of k words W is also called "core". The terms in a core do not contain general elements of language, such as articles, prepositions or adverbs as a result of the preprocessing described in (8). DW represents the set of documents that contain all the terms in W. Documents in DW are considered semantically close to each other. The main characteristic that distinguishes a "core" from an arbitrary set of vocabulary terms k is that the metric called force is maximal when applied to them. Then, the "force" formula is the criterion for determining what a core is. The force is defined in turn, using the following formula:

D (W)

In the aforementioned formula, c is a constant of scale, J (W) is the joint frequency of the words, which is the number of documents in which all the words in the set W co-occur. The term D (W) represents the amount defined as "disjoint frequency", which is the sum of the magnitudes of each of the disjoint sets of documents where the ith term occurs without co-occurring with any of the remaining words in the set W.

The process for obtaining the cores, that is, the sets of k terms with maximum force, is explained by figure 3 and is presented at the end of the explanation of figure 2. Assuming that the calculation of the cores has been completed, and that the information of the cores has been stored in the database (3), the process continues in Figure 2, with the calculation of the vector of topic weights (10) . For each core discovered, a vector with weights will be calculated in order to determine its similarity to any document, as will be explained later.

In this phase, a vector with weights (ti, W ₁ ), (t ₂ , w ₂ ), ..., (t _n , w _n ) of the terms for each of the topics is calculated, where for each term t¡, its weight w¡ represents the importance of the term t¡ in the topic considered. To calculate the vector of weights for the topical issue, the documents that match the query represented by the corresponding "core" (ie, the set of DW documents containing all the core words) are retrieved. To carry out this calculation, each document is represented as a vector of terms with the frequency of each term in the document, that is, [(ti, f] _j ), (t ₂ , f _2j ), ..., (t _n , f _nj )] for a document j. Then, all frequencies for documents in DW are added, obtaining a vector [(ti, fi, i + fi _{) 2} + ...),

..., (t _n , fn, i + fn, 2 + ...)] - In this vector, the standard TF-IDF formula is applied to calculate the weight of each term with respect to the core. The TF-IDF formula is:

Where Wy is the weight of the term i in document j, and tflij) is the number of occurrences of the term i in document j; N represents the total number of documents in the corpus and n¡ is the number of documents in which the term i occurs; log is a logarithmic function. Once this step is completed, a normalization is carried out by dividing each weight by the sum of weights, resulting in a unit vector. In the next phase, the system calculates the similarity of the ads with the topics (11). For this, a vector of terms with weights is calculated for each of the advertisements, using a process similar to the one that constructs the vectors for each topic (10) described above. Subsequently, the similarity between the ad vector and the topic vector is calculated for each of the topic vectors. This similarity is obtained with the standard "cosine distance", which is nothing but the scalar product of the vectors divided by the product of their magnitudes. This number provides a measure of the similarity of each ad with each topic. Then, a database (3) is formed with the similarities between each ad with each of the topics. For an ad "d", a "Topic similarity vector" Td will be a vector of the form (T ₁₅ Wi), (T ₂ , w ₂ ), ..., (T _n , w _n ), where T Are the topics and w weights, which are reciprocals of the cosine distance between the ad d and the topic T¡. This concludes the calculation of the similarity between advertisements and topics (11). After the previously described phases have been completed, the system can receive web documents through the network (12). It may be that the user's request contains the address of the remote document residing in the network, or that the full text of the document is locally available, therefore, to determine the case, a test is performed to verify if the document is available ( 13) in the database (documents that were at some time in the database, but expired, are not considered locally available). If the document is effectively in the database, the method retrieves its topic vector (16) from the topic base per document. If not, the new document is stored in the index and in the database

(14) and the method is used to calculate its similarity with the topics (15), that is: r construct a vector of terms with weights for the document, calculate the similarity of the Document vector with each of the topic vectors and store the results in the topic-document base.

In either case, after the calculation of the similarity of the documents with the topics (15) or the recovery of the topic vector of the document (16), the method proceeds to order the announcements (17) for the document consulted by the user; which will be referred to as "d". For this purpose, the method first selects the candidate advertisements using a pre-selection criterion. For each of these candidate advertisements, its topic vector is retrieved from the database. Finally, the cosine distance is calculated between each ad-topic vector and the topic vector of the document "d", and the results (distances) ordered in ascending order, so that the smallest distances will appear first. The procedure ends when the ordered list of ads (18) is generated.

Figure 3 is a flow chart illustrating the process of extracting topics from the collection. It begins with a given set of pre-processed documents (19), which may be part of an organization's repository or be a sample of a very large collection such as the Internet; Preprocessing was described in previous sections (8), including the elimination of non-essential terms, separation in sentences, construction of frequency vectors of terms and construction of the matrix of terms co-occurrences, for example. The result of this process is a set of "cores" (that is, sets of k terms, where k is a small integer, typically 3 or 4) of maximum force, using the measure defined in the formula described above.

Next, in the calculation of seeds (20), for each document in the collection, an initial group of k terms called "seed" is obtained by taking the k terms with the highest TF-IDF for that document. Then, the central part of the method, which is the process of refinement of cores (21). The initial cores are the seeds calculated in the previous phase. In the current phase, each of the cores is systematically modified, changing one of its terms to test if the strength of the resulting variant is increased; if this is the case, the variant takes the place of the core from which it comes and the original core is undone; If this is not the case, a new variant is tested. The complexity of this step is to avoid trying too many variations, since in principle, if there are n terms in the vocabulary (typically several thousand), then there are n! / k! (nk)! possible variants, which is an intractable number even for a small value of k. At this point, the co-occurrence matrix serves to avoid testing every possible combination of terms; the procedure described considers only the terms with a significant level of co-occurrences with the k-1 terms remaining in the core, that is, only the terms with co-occurrences above a predetermined level are candidates to replace a core term . Once all viable candidate terms have been tested for each of the core terms, without increasing the strength, it is ensured that the core has maximum strength. When two or more cores being refined are identical, then these cores are integrated into one. Thus, the procedure produces as a final result a collection of unique cores with maximum strength (22).

Claims

CLAIMS Having presented the invention that is novel and describing it sufficiently, we reinvindicated as our exclusive property:

1. A method to retrieve a relevant subset of ads, having an information retrieval system that retrieves the set of textual ads given the content of a document, which is characterized by comprising the following stages:

(a) Identify the existing topics in a collection of web documents;

(b) Associate textual ads with the topics extracted by applying a semantic similarity metric;

(c) Associate the document with the mentioned topics by applying a semantic similarity metric;

(d) Semantically order the retrieved ads for a given document.

2. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 1, in its sub-step (a) consisting In identifying the existing topics in a collection of web documents, it includes the following sub-stages: (a) Compile a collection of documents; (b) Build an index of terms per document;

(c) Build a term-by-term matrix;

(d) Extract the topics from each of the documents;

(e) Construct a vector with weights for each of the topics in the database, Tv.

3. The method to retrieve a relevant subset of ads, having an information retrieval system that retrieves the set of textual ads given the The content of a document, according to claim 2, in its sub-stage (b) consisting of constructing an index of terms per document, comprises the following sub-stages:

(a) Identify the sentences existing in each of the documents in the collection; (b) Remove non-significant words (stop-words) from the terms of each sentence;

(c) Accumulate the sum of sentences in which each term occurs;

(d) Accumulate the sum of documents in which each term occurs;

(e) Maintain the list of documents in which each term occurs.

4. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 2, in its sub-step (c) consisting in accumulating the sum of sentences in which each term occurs, it comprises the following sub-stages:

(a) Generate term-to-term mappings for each word combination of each sentence;

(b) Accumulate the sum of the term-to-term co-occurrences in the corresponding matrix cell;

(c) Accumulate the sum of the co-occurrences per document in the cell of the term-to-term matrix;

5. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 2, in its sub-step (d) consisting In Extracting the topics from each of the documents, it includes the following sub-stages: (a) Calculate a frequency vector of terms, with each of the terms of the document;

(b) Calculate a new standardized vector with weights, for each of the terms in the frequency vector of terms; (c) Generate a seed set of terms;

(d) I iteratively replace each of the terms of the seed set with the term that produces the greatest evaluation of strength;

(e) Store the combination of 3-terms with maximum strength evaluation in the topic database.

6. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 5, in its sub-step (d) consisting iteratively replacing each of the terms of the seed set with the term that produces the greatest force evaluation includes the use of the term-to-term matrix to select the k terms ordered by the sum of their co-occurrences by sentence in descending order, being k an arbitrary integer constant.

7. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 5, in its sub-step (d) consisting in iteratively replacing each of the terms of the seed set by the term that produces the greatest evaluation of force includes the calculation of the force metric for each of the candidate replacements, which consists of the following sub-stages: (a) Count the number of documents where the 3 words appear simultaneously, identified as J;

(b) Count the number of documents where the first word occurs, but the second and third do not occur, identifying that number as di; (c) Count the number of documents where the second word occurs, but the first and third do not occur, identifying that amount as d ₂ ;

(d) Count the number of documents where the third word occurs, but the first and second words do not occur, identifying that number as d ₃ ;

(e) Calculate the force of the set, identified as F, by dividing J by the result of the sum of di + d ₂ + d3.

8. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 5, in its sub-step (d) consisting in iteratively replacing each of the terms of the seed set by the term that produces the greatest evaluation of force includes the use of the index of terms-by-document.

9. The method of recovering a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 5, in its sub-step (b), which It consists of calculating a new standardized vector with weights, for each of the terms in the frequency vector of terms, it comprises the following sub-stages:

(a) Retrieve the total number of existing documents in the terms-per-document index, N;

(b) Retrieve the total number of documents where said term occurs, F; (c) Assign the result of the formula w * log (N / F), where w represents the current weight of the vector as a new weight for the vector.

10. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 5, in its sub-step (c) consisting In Generating a seed set of 3 terms, it comprises the following sub-stages:

(a) Sort the terms by the weight mentioned in descending order;

(b) Remove those whose total number of occurrences in the index is greater than 5; (c) Select the 3 largest as a seed set.

11. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 1, in its sub-step (b) consisting in associating textual ads with the topics extracted by applying a semantic similarity metric; It comprises the following sub-stages:

(a) Construct a vector of terms with weights for each ad that will be analyzed, including the title, text, links and keywords provided by the user, Av;

(b) Calculate the cosine distance of the aforementioned ad vector Av, with each of the topic vectors, Tv;

(c) Store the resulting topic similarity vector in the ad-topic database;

12. The method of recovering a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 1, in its sub-step (c) that It consists in associating the document with the mentioned topics by applying a semantic similarity metric, comprising the following sub-stages:

(a) Construct a vector of terms with weights for the document to be analyzed, Dv;

(b) Calculate the cosine distance of said document vector Dv, with each of the topic vectors, Tv;

(c) Store the topic similarity vector column in the document-topic database.

13. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 2, in its sub-stage (e) consisting In Constructing a vector with weights for each of the topics in the database, Tv, comprises the following sub-stages:

(a) Find all the documents in which the 3 words of the topic co-occur,

D; (b) Construct a frequency vector of terms for each of the retrieved documents;

(c) Calculate the vector sum of each of the mentioned frequency vectors, and obtain a new frequency vector Tfv, where each of the weights of the terms is the sum of the term frequencies in the set D; (d) Calculate a new set of weights W, applying a normalization function to each of the weights of the vector Tfv.

14. The method of recovering a relevant subset of advertisements, having an information retrieval system that recovers the set of textual ads given the content of a document, according to claim 13, in its sub-step (d) consisting in Calculate a new set of weights W, applying a function of Normalization to each of the weights of the TfV vector, comprises the following sub-stages:

(a) Retrieve the total number of existing documents in the terms-per-document index, N; (b) Retrieve the total number of documents in which the given term occurs, F;

(c) Assign the result of the formula w * log (N / F), where w represents the current weight of the term, to the new weight of the term in the vector.

15. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 1, in its sub-step (d) consisting In Semantically sorting the retrieved ads for a given document, it comprises the following sub-stages:

(a) Generate a list of candidate announcements, selecting those that belong to the same topics as the document; (b) Retrieve the standardized column vector for each of the candidate advertisements from the advertisement-topic database;

(c) Recover the topic vectors associated with the document under analysis, V;

(d) Construct the similarity matrix of ad-topics A, transposing all similarity vectors of ad-topics, that is, [f (ai), f (a ₂ ) ... f (a ₃ )] ^Λ T ; (e) Retrieve the document-topical similarity column vector from the document-topical database, for the document under consideration, T;

(f) Calculate the column vector R, multiplying the ad-topic matrix A, by the column vector of topic documents T, that is, R = AxT;

(g) Obtain the order of the ads semantically by ordering the elements of the vector column R.