WO2010021530A1 - System and method for displaying relevant textual advertising based on semantic similarity - Google Patents

System and method for displaying relevant textual advertising based on semantic similarity Download PDF

Info

Publication number
WO2010021530A1
WO2010021530A1 PCT/MX2008/000109 MX2008000109W WO2010021530A1 WO 2010021530 A1 WO2010021530 A1 WO 2010021530A1 MX 2008000109 W MX2008000109 W MX 2008000109W WO 2010021530 A1 WO2010021530 A1 WO 2010021530A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
terms
vector
term
documents
Prior art date
Application number
PCT/MX2008/000109
Other languages
Spanish (es)
French (fr)
Inventor
Ramón Felipe BRENA PINERO
Eduardo Héctor RAMIREZ RANGEL
Original Assignee
Instituto Tecnologico Y De Estudios Superiores De Monterrey
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Instituto Tecnologico Y De Estudios Superiores De Monterrey filed Critical Instituto Tecnologico Y De Estudios Superiores De Monterrey
Priority to PCT/MX2008/000109 priority Critical patent/WO2010021530A1/en
Publication of WO2010021530A1 publication Critical patent/WO2010021530A1/en
Priority to MX2010011323A priority patent/MX2010011323A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising

Definitions

  • the aim of the invention is to provide Internet users with semantically related advertisements for the documents being consulted.
  • a method is presented that selects the most relevant ads from a collection of possible ads.
  • LSA Latent Semantic Analysis
  • the matrices U and V provide a spatial representation of the semantic relationship between terms and documents, so that the semantic similarity of the terms can be calculated as cosine distance in the U matrix, and the semantic similarity of the documents as cosine distance of vectors in the matrix V ⁇ T.
  • the method has the disadvantage of the high cost of updates, since when introducing a new document into the collection, the entire matrix must be recalculated.
  • the LSA method cannot handle the Polysemy, that is, the fact that one word appears near others does not allow us to conclude that the word has different meanings.
  • z) [Sum of topics z] P (z) P (w
  • the proposed invention fulfills a comparable objective in that it allows the latent structure of document collections to be extracted and calculations of semantic similarity, by means of an algorithm and a simplified representation of topics defined as "semantic contexts".
  • the presented method makes use of information theory metrics, search indexes and local optimization algorithms to extract an unknown number of topics and can be scaled to much larger document collections.
  • Keyword selection is difficult. For example, the advertiser often does not choose enough related keywords for their campaign. This leads to a low exposure of the campaign.
  • Keyword selection is subject to ambiguity, because the advertiser can choose keywords that have multiple meanings, that is, polysemic words.
  • This situation may cause the ad to be presented in situations where it is not relevant.
  • the advertiser may mistakenly choose unrelated words. As in the previous case, this may lead to the presentation of irrelevant advertisements, which results in a cost for lost opportunities for both the advertiser and the system operator.
  • the system and method presented in this invention increases the relevance of the advertisements presented to the user, by semantically relating the advertisements to the electronic documents that are being read by a user at a given time.
  • the semantic relationship method that is performed by the system uses the statistical properties of the language, and therefore is able to detect semantic similarity of a given pair of documents (one of which may be an advertisement), which do not necessarily share terms in common, but in fact they relate to the same concepts.
  • FIGURES Figure 1 Flowchart illustrating the general method of ad printing
  • Figure 1 is a flow chart illustrating the general method of printing partners. It assumes or prior processing (described in Figure 3) of identification of topics (1), which produces a data structure identified as "Topic Structure” (2). It also assumes that a collection of candidate advertisements has been stored in the database (3), and that the topics have been identified using the methods described later in this document.
  • the system can be executed by a user as follows: Suppose the user consults an electronic document, typically a web page (4). Next, the system associates (5) the topics of the candidate announcements with those related to the document in question and generates a list of related announcements (6), which correspond to the same topics as the document consulted by the user (4) .
  • FIG. 2 is a detailed flow chart illustrating the process followed by the system presented in this invention.
  • the first step to perform is the pre-processing of the terms of the documents (8).
  • the pre-processing is done sequentially, taking each document from the collection and applying the following transformations.
  • the first phase consists of separating the document into sentences, according to the punctuation and hypertext separators such as line breaks, tables and title tags.
  • sentences are reduced to word lists, eliminating those with linguistic functions, such as articles, pronouns, adverbs and the like, usually known as "stop-words.”
  • stop-words For example, the English language sentence: "The quick brown fox jumps over the lazy dog” is reduced to the list: ⁇ quick, brown, fox, jumps, lazy, dog The set of all relevant terms included in the document is called "vocabulary.”
  • an inverted index is created.
  • the inverted index is a mapping between each term and the identifiers of the documents that contain that term.
  • Inverted indexes are a general domain technique in the field of information retrieval to efficiently locate documents that contain a specific term.
  • a table of terms is constructed. Each record in the table of terms contains additional information about each of them, for example, its unique numerical identifier (called, term-id), the frequency of the term (number of documents in which said term appears) and the frequency by sentence of the term (number of sentences in which said term appears).
  • Another necessary preparation phase is the generation of a matrix of co-occurrences of terms.
  • both columns (j) and lines (i) correspond to the vocabulary terms, and in the cells (i, j) of the same the number of occurrences is stored in the same sentence of the terms ie j .
  • the appearance of two terms in the same sentence is called co-occurrence. Only those terms with a frequency higher than a certain level are taken into account to feed the matrix, in other words, only those terms that appear a minimum number of sentences are stored in the database.
  • the next step is the construction of the set of Topics (9), which is prior to the use of the system by the end user.
  • the construction of the set of topics is illustrated in the Figure 3 and will be described later in this document.
  • W ⁇ wl, ... wk ⁇ .
  • the terms in a semantic context are the words that together describe the "best” form of a given topic, where the exact meaning of "best” will be explained shortly.
  • the set of k words W is also called "core".
  • the terms in a core do not contain general elements of language, such as articles, prepositions or adverbs as a result of the preprocessing described in (8).
  • DW represents the set of documents that contain all the terms in W. Documents in DW are considered semantically close to each other.
  • the main characteristic that distinguishes a "core" from an arbitrary set of vocabulary terms k is that the metric called force is maximal when applied to them.
  • the "force" formula is the criterion for determining what a core is. The force is defined in turn, using the following formula:
  • c is a constant of scale
  • J (W) is the joint frequency of the words, which is the number of documents in which all the words in the set W co-occur.
  • D (W) represents the amount defined as "disjoint frequency", which is the sum of the magnitudes of each of the disjoint sets of documents where the ith term occurs without co-occurring with any of the remaining words in the set W.
  • a vector with weights (ti, W 1 ), (t 2 , w 2 ), ..., (t n , w n ) of the terms for each of the topics is calculated, where for each term t ⁇ , its weight w ⁇ represents the importance of the term t ⁇ in the topic considered.
  • the documents that match the query represented by the corresponding "core” ie, the set of DW documents containing all the core words
  • each document is represented as a vector of terms with the frequency of each term in the document, that is, [(ti, f] j ), (t 2 , f 2j ), ..., (t n , f nj )] for a document j.
  • all frequencies for documents in DW are added, obtaining a vector [(ti, fi, i + fi ) 2 + ...), ..., (t n , fn, i + fn, 2 + ...)] -
  • the standard TF-IDF formula is applied to calculate the weight of each term with respect to the core.
  • the TF-IDF formula is:
  • the similarity between the ad vector and the topic vector is calculated for each of the topic vectors.
  • This similarity is obtained with the standard "cosine distance", which is nothing but the scalar product of the vectors divided by the product of their magnitudes. This number provides a measure of the similarity of each ad with each topic.
  • a database (3) is formed with the similarities between each ad with each of the topics.
  • a “Topic similarity vector” Td will be a vector of the form (T 15 Wi), (T 2 , w 2 ), ..., (T n , w n ), where T Are the topics and w weights, which are reciprocals of the cosine distance between the ad d and the topic T ⁇ . This concludes the calculation of the similarity between advertisements and topics (11). After the previously described phases have been completed, the system can receive web documents through the network (12).
  • the user's request contains the address of the remote document residing in the network, or that the full text of the document is locally available, therefore, to determine the case, a test is performed to verify if the document is available ( 13) in the database (documents that were at some time in the database, but expired, are not considered locally available). If the document is effectively in the database, the method retrieves its topic vector (16) from the topic base per document. If not, the new document is stored in the index and in the database
  • the method is used to calculate its similarity with the topics (15), that is: r construct a vector of terms with weights for the document, calculate the similarity of the Document vector with each of the topic vectors and store the results in the topic-document base.
  • the method proceeds to order the announcements (17) for the document consulted by the user; which will be referred to as "d".
  • the method first selects the candidate advertisements using a pre-selection criterion. For each of these candidate advertisements, its topic vector is retrieved from the database. Finally, the cosine distance is calculated between each ad-topic vector and the topic vector of the document "d", and the results (distances) ordered in ascending order, so that the smallest distances will appear first. The procedure ends when the ordered list of ads (18) is generated.
  • Figure 3 is a flow chart illustrating the process of extracting topics from the collection. It begins with a given set of pre-processed documents (19), which may be part of an organization's repository or be a sample of a very large collection such as the Internet; Preprocessing was described in previous sections (8), including the elimination of non-essential terms, separation in sentences, construction of frequency vectors of terms and construction of the matrix of terms co-occurrences, for example.
  • the result of this process is a set of "cores" (that is, sets of k terms, where k is a small integer, typically 3 or 4) of maximum force, using the measure defined in the formula described above.
  • seed an initial group of k terms called "seed" is obtained by taking the k terms with the highest TF-IDF for that document.
  • the initial cores are the seeds calculated in the previous phase.
  • each of the cores is systematically modified, changing one of its terms to test if the strength of the resulting variant is increased; if this is the case, the variant takes the place of the core from which it comes and the original core is undone; If this is not the case, a new variant is tested.
  • the complexity of this step is to avoid trying too many variations, since in principle, if there are n terms in the vocabulary (typically several thousand), then there are n! / k! (nk)! possible variants, which is an intractable number even for a small value of k.
  • the co-occurrence matrix serves to avoid testing every possible combination of terms; the procedure described considers only the terms with a significant level of co-occurrences with the k-1 terms remaining in the core, that is, only the terms with co-occurrences above a predetermined level are candidates to replace a core term . Once all viable candidate terms have been tested for each of the core terms, without increasing the strength, it is ensured that the core has maximum strength. When two or more cores being refined are identical, then these cores are integrated into one. Thus, the procedure produces as a final result a collection of unique cores with maximum strength (22).

Abstract

The invention described presents a method for finding conceptually related advertisements which are semantically related to other documents being consulted. To that end, the invention presents a method which selects the most relevant semantically related advertisements from a collection of possible advertisements given a document which is consulted by an Internet user. For that purpose, the invention presents a method which calculates structures called "semantic context" which represents topics or contexts. The invention likewise presents a method which uses the semantic contexts to measure the conceptual proximity between a document and an advertisement. Since this method does not depend on an exact word match like many other methods in the prior art, the method presented is less vulnerable to synonymy, polysemy and word omissions.

Description

SISTEMA Y MÉTODO PARA MOSTRAR PUBLICIDAD RELEVANTE TEXTUAL BASADA EN SEMEJANZA SEMÁNTICA SYSTEM AND METHOD TO SHOW RELEVANT TEXTUAL ADVERTISING BASED ON SEMANTIC SEMEJANZA
CAMPO DE LA INVENCIÓNFIELD OF THE INVENTION
El objetivo del invento es proveer a los usuarios de Internet anuncios semánticamente relacionados con los documentos que se estén consultando. Para tal efecto, se presenta un método que selecciona los anuncios más relevantes de una colección de anuncios posibles.The aim of the invention is to provide Internet users with semantically related advertisements for the documents being consulted. For this purpose, a method is presented that selects the most relevant ads from a collection of possible ads.
ANTECEDENTES DE LA INVENCIÓNBACKGROUND OF THE INVENTION
En los últimos 10 años, la publicidad de productos y servicios ha seguido una tendencia hacia la migración desde los medios tradicionales como radio, televisión y prensa hacia redes de computadoras como GSM y la Internet. Se espera que dicha tendencia se mantenga creciendo a un ritmo acelerado en los siguientes años debido a que los medios electrónicos ofrecen mayores posibilidades para dirigir y medir la efectividad de las campañas publicitarias y de mercadeo.In the last 10 years, the advertising of products and services has followed a trend towards migration from traditional media such as radio, television and press to computer networks such as GSM and the Internet. It is expected that this trend will continue to grow at an accelerated rate in the following years because electronic media offer greater possibilities to direct and measure the effectiveness of advertising and marketing campaigns.
Las técnicas en el estado del arte definen la relevancia como una función de la similaridad léxica entre un par de documentos. Tal definición es efectiva en aplicaciones donde un documento es en realidad una consulta explícita proporcionada por el usuario, pero no es efectiva cuando se trata de relacionar automáticamente dos documentos, tales como una página web y un anuncio textual corto, de menos de 20 palabras. Un problema que surge en ambos casos es que las palabras utilizadas en ambos documentos deben ser altamente similares, o las mismas. Dicha situación presenta una limitación a las técnicas de recuperación de información y ha sido caracterizada por Furnast, et. al (Fumas 1987) como el "problema de la incompatibilidad del vocabulario" (vocabulary mismatch problem). En 1989, Scott Deerwester, et. al (US-4839853) presentaron un método para solucionar el problema de la incompatibilidad del vocabulario en recuperación de información, basado en el cálculo de una estructura semántica latente. El método, usualmente conocido como Análisis Semántico Latente (LSA) es un método basado en corpus que comienza por la creación de una matriz de términos-documentos, entonces, tomando cada uno de los "renglones" (tiΛT) de la matriz, una nueva matriz (XΛT)(X) es producida. La nueva matriz contiene información sobre cómo cada término se relaciona con los otros, en términos de su frecuencia total en los documentos. Finalmente, al factorizar la matriz (XAT)(X) por el método de Descomposición de Valor Sigular (SVD) y asumiendo un número fijo de dimensiones, se obtienen tres matrices derivadas U, V y Sigma, donde: X = (U) (Sigma) (VΛT)Techniques in the state of the art define relevance as a function of lexical similarity between a couple of documents. Such a definition is effective in applications where a document is actually an explicit query provided by the user, but it is not effective when it comes to automatically relating two documents, such as a web page and a short textual announcement, of less than 20 words. A problem that arises in both cases is that the words used in both documents must be highly similar, or the same. This situation presents a limitation to information retrieval techniques and has been characterized by Furnast, et. al (You smoke 1987) as the "problem of vocabulary incompatibility" (vocabulary mismatch problem). In 1989, Scott Deerwester, et. al (US-4839853) presented a method to solve the problem of incompatibility of vocabulary in information retrieval, based on the calculation of a latent semantic structure. The method, usually known as Latent Semantic Analysis (LSA) is a corpus-based method that begins by creating a matrix of document-terms, then, taking each of the "lines" (ti Λ T) of the matrix, A new matrix (X Λ T) (X) is produced. The new matrix contains information on how each term relates to the others, in terms of their total frequency in the documents. Finally, by factoring the matrix (X A T) (X) by the Sigular Value Decomposition (SVD) method and assuming a fixed number of dimensions, three matrices derived U, V and Sigma are obtained, where: X = (U ) (Sigma) (V Λ T)
En la expresión resultante, las matrices U y V, proporcionan una representación espacial de la relación semántica entre términos y documentos, de tal forma que la similaridad semántica de los términos puede ser calculada como distancia coseno en la matriz U, y la similaridad semántica de los documentos como distancia coseno de vectores en la matriz VΛT.In the resulting expression, the matrices U and V, provide a spatial representation of the semantic relationship between terms and documents, so that the semantic similarity of the terms can be calculated as cosine distance in the U matrix, and the semantic similarity of the documents as cosine distance of vectors in the matrix V Λ T.
Una de las principales contribuciones del método de LSA es que mostró la factibilidad de la solución al problema de la incompatibilidad de vocabulario utilizando una estructura semántica latente, sin embargo, la representación espacial de la estructura semántica es computacionalmente compleja y limita la aplicación del método en colecciones mayores a unos cuantos miles de documentos.One of the main contributions of the LSA method is that it showed the feasibility of the solution to the problem of vocabulary incompatibility using a latent semantic structure, however, the spatial representation of the semantic structure is computationally complex and limits the application of the method in collections larger than a few thousand documents.
Además, el método tiene como desventaja el alto costo de las actualizaciones, ya que al introducir un nuevo documento en la colección, toda la matriz debe ser recalculada. Mas aún, en virtud de la representación espacial, el método LSA no puede manejar la polisemia, es decir, el hecho de que una palabra aparezca cerca de otras no permite concluir que dicha palabra posea distintos significados.In addition, the method has the disadvantage of the high cost of updates, since when introducing a new document into the collection, the entire matrix must be recalculated. Moreover, by virtue of spatial representation, the LSA method cannot handle the Polysemy, that is, the fact that one word appears near others does not allow us to conclude that the word has different meanings.
Posteriormente, construyendo sobre las bases del método LSA, Hofϊman (US06687696) desarrolló un nuevo sistema de extracción de la estructura latente que también puede ser utilizado para mejorar la recuperación de información y generar recomendaciones personalizadas. El modelo de Hoffmann fue llamado Indexamiento Latente Semántico Probabilístico (PLSI) y se aunque se inspiró en el principio de LSA pero lo replanteó aplicando fundamentos estadísticos. En PLSI, cada documento es modelado como una "bolsa de palabras" (bag of words), donde se asume que cada palabra fue generada con una cierta probabilidad por un tópico oculto, y consecuentemente, el documento habría sido generado por un cierto número de tópicos a partir de una distribución de probabilidad. En virtud de éste supuesto, PLSI es considerado un "modelo generativo", que puede ser expresado de la siguiente manera: P(d,w) = [Suma sobre tópicos z] P(z)P(w|z)P(d|z) Asi, el problema de encontrar la estructura semántica se convierte en el problema de definir una distribución de probabilidad para cada clase latente, P(z) y para cada una de las palabras en la clase P(w|z). Con dichas entradas, también es factible calcular la mezcla de tópicos o clases para un documento, esto es P(z|d). A fin de realizar esta tarea, el método PLSI propone la maximizacion de la función de verosimilitud (likelihood function) utilizando un algoritmo de maximizacion de expectativas, EM. El algoritmo EM intenta maximizar: L = [Suma sobre documentos] [Suma sobre palabras] n(d,w) log P(d,w) Aunque PLSI implica algunas mejoras en términos de perplejidad con respecto a LSA y tiene la importante ventaja de no requerir la reconstrucción completa del modelo para realizar análisis sobre documentos no vistos, la construcción del modelo es computacionalmente costosa y no es factible para analizar colecciones en el orden de millones de documentos, como el internet. Otra limitación de PLSI es que el numero de clases latentes o tópicos es un número arbitrario, y que dicho número requiere ser pequeño ya que tal cantidad es determinante de la complejidad computacional del método.Subsequently, building on the basis of the LSA method, Hofϊman (US06687696) developed a new latent structure extraction system that can also be used to improve information retrieval and generate personalized recommendations. Hoffmann's model was called Probabilistic Semantic Latent Indexing (PLSI) and it was inspired by the LSA principle but rethink it by applying statistical fundamentals. In PLSI, each document is modeled as a "bag of words", where it is assumed that each word was generated with a certain probability by a hidden topic, and consequently, the document would have been generated by a certain number of topics from a probability distribution. Under this assumption, PLSI is considered a "generative model", which can be expressed as follows: P (d, w) = [Sum of topics z] P (z) P (w | z) P (d | z) Thus, the problem of finding the semantic structure becomes the problem of defining a probability distribution for each latent class, P (z) and for each of the words in class P (w | z). With these entries, it is also feasible to calculate the mix of topics or classes for a document, that is P (z | d). In order to perform this task, the PLSI method proposes the maximization of the likelihood function using an algorithm of maximization of expectations, EM. The EM algorithm tries to maximize: L = [Sum of documents] [Sum of words] n (d, w) log P (d, w) Although PLSI implies some improvements in terms of perplexity with respect to LSA and has the important advantage of not requiring the complete reconstruction of the model to perform analysis on unseen documents, the construction of the model is computationally expensive and not feasible to analyze collections in the order of millions of documents, such as the internet. Another limitation of PLSI is that the number of latent or topical classes is an arbitrary number, and that said number needs to be small since such an amount is determinant of the computational complexity of the method.
Por lo tanto, la invención propuesta cumple un objetivo comparable en tanto que permite extraer la estructura latente de colecciones de documentos y efectuar cálculos de similaridad semántica, mediante un algoritmo y una representación simplificada de tópicos definida como "contextos semánticos". El método presentado hace uso de métricas de teoría de información, índices de búsqueda y algoritmos de optimización local para extraer un número desconocido de tópicos y puede ser escalado a colecciones de documentos mucho más grandes.Therefore, the proposed invention fulfills a comparable objective in that it allows the latent structure of document collections to be extracted and calculations of semantic similarity, by means of an algorithm and a simplified representation of topics defined as "semantic contexts". The presented method makes use of information theory metrics, search indexes and local optimization algorithms to extract an unknown number of topics and can be scaled to much larger document collections.
Uno de los retos en la publicidad en linea es proveer al cliente anuncios muy relevantes. Mientras más relevante sea un anuncio para la persona que navega en internet, más probable es que dicha persona siga la liga de dicho anuncio y finalmente haga alguna transacción comercial. Actualmente, los sistemas más modernos operan bajo un esquema de subasta en el cual los anunciantes seleccionan palabras clave, y colocan pujas en la subasta para conseguir que su publicidad sea desplegada. El sistema trata de maximizar la relevancia de los anuncios, con base en el contenido del documento electrónico siendo leído por el usuario en ese momento o bien por las consultas colocadas por usuarios en los buscadores de Internet.One of the challenges in online advertising is to provide the client with very relevant advertisements. The more relevant an ad is for the person who surfs the internet, the more likely it is that the person will follow the link of that ad and finally make a commercial transaction. Currently, the most modern systems operate under an auction scheme in which advertisers select keywords, and place bids in the auction to get their advertising displayed. The system tries to maximize the relevance of the ads, based on the content of the electronic document being read by the user at that time or by the queries placed by users in Internet search engines.
El proceso de crear campañas de anuncios no es trivial para el anunciante, ya que se le pide que escoja manualmente las variantes de las palabras clave que dispararán el despliegue del anuncio. En este proceso, los siguientes tres problemas pueden ocurrir: 1. La selección de palabras clave es difícil. Por ejemplo, frecuentemente el anunciante no elige suficientes palabras clave relacionadas para su campaña. Esto conlleva una baja exposición de la campaña.The process of creating ad campaigns is not trivial for the advertiser, as they are asked to manually choose the variants of the keywords that will trigger the display of the ad. In this process, the following three problems may occur: 1. Keyword selection is difficult. For example, the advertiser often does not choose enough related keywords for their campaign. This leads to a low exposure of the campaign.
2. La selección de palabras clave está sujeta a ambigüedad, porque el anunciante puede elegir palabras clave que tienen múltiples significados, esto es, palabras polisémicas.2. Keyword selection is subject to ambiguity, because the advertiser can choose keywords that have multiple meanings, that is, polysemic words.
Esta situación puede causar que el anuncio sea presentado en situaciones donde no es relevante.This situation may cause the ad to be presented in situations where it is not relevant.
3. El anunciante puede elegir por error palabras no relacionadas. Como en el caso anterior, esto puede llevar a presentar anuncios irrelevantes, lo que redunda en un costo por oportunidades perdidas tanto para el anunciante como para el operador del sistema.3. The advertiser may mistakenly choose unrelated words. As in the previous case, this may lead to the presentation of irrelevant advertisements, which results in a cost for lost opportunities for both the advertiser and the system operator.
El sistema y método presentado en esta invención incrementa la relevancia de los anuncios presentados al usuario, al relacionar semánticamente los anuncios con los documentos electrónicos que están siendo leidos por un usuario en un momento dado.The system and method presented in this invention increases the relevance of the advertisements presented to the user, by semantically relating the advertisements to the electronic documents that are being read by a user at a given time.
El método de relación semántica que es realizado por el sistema utiliza las propiedades estadísticas del lenguaje, y por ello es capaz de detectar similaridad semántica de un par dado de documentos (uno de los cuales puede ser un anuncio), que no necesariamente comparten términos en común, pero que de hecho se relacionan con los mismos conceptos.The semantic relationship method that is performed by the system uses the statistical properties of the language, and therefore is able to detect semantic similarity of a given pair of documents (one of which may be an advertisement), which do not necessarily share terms in common, but in fact they relate to the same concepts.
BREVE DESCRIPCIÓN DE LAS FIGURAS Figura 1. Diagrama de flujo ilustrando el método general de impresión de anuncios,BRIEF DESCRIPTION OF THE FIGURES Figure 1. Flowchart illustrating the general method of ad printing,
Figura 2. Diagrama de flujo detallado ilustrando el proceso,Figure 2. Detailed flow chart illustrating the process,
Figura 3. Diagrama de flujo detallado ilustrando el proceso de extracción de tópicos de la colección. DESCRIPCIÓN DETALLADA DE LA INVENCIÓNFigure 3. Detailed flow chart illustrating the process of extracting topics from the collection. DETAILED DESCRIPTION OF THE INVENTION
La Figura 1 es un diagrama de flujo que ilustra el método general de impresión de auncios. Asume u procesamiento previo (descrito, en la figura 3) de identificación de tópicos (1), que produce una estructura de datos identificada como "Estructura de tópicos" (2). También asume que una colección de anuncios candidatos ha sido almacenada en la base de datos (3), y que los tópicos han sido identificados utilizando los métodos descritos posteriormente en este documento. Una vez que esto ha sido realizado, el sistema puede ser ejecutado por un usuario de la siguiente manera: Supóngase que el usuario consulta un documento electrónico, típicamente una página web (4). En seguida, el sistema asocia (5) los tópicos de los anuncios candidatos con aquellos relacionados con el documento en cuestión y genera una lista de anuncios relacionados (6), que corresponden a los mismos tópicos que el documento consultado por el usuario (4). La Figura 2 es un diagrama de flujo detallado ilustrando el proceso seguido por el sistema presentado en ésta invención. El primer paso a realizar es el pre-procesamiento de los términos de los documentos (8). El pre-procesamiento se realiza secuencialmente, tomando cada documento de la colección y aplicando las siguientes transformaciones. Cuando un documento es pre-procesado, la primera fase consiste en separar el documento en sentencias, de acuerdo a la puntuación y a los separadores de hypertexto tales como saltos de línea, tablas y etiquetas de título. Entonces, las sentencias , son reducidas a listas de palabras, eliminando aquellas con funciones lingüísticas, tales como artículos, pronombres, adverbios y similares, normalmente conocidas como "stop-words". Por ejemplo, la sentencia en idioma inglés: "The quick brown fox jumps over the lazy dog" se reduce a la lista: {quick, brown, fox, jumps, lazy, dog}. Al conjunto de todos los términos relevantes incluidos en el documento se le llama "vocabulario".Figure 1 is a flow chart illustrating the general method of printing partners. It assumes or prior processing (described in Figure 3) of identification of topics (1), which produces a data structure identified as "Topic Structure" (2). It also assumes that a collection of candidate advertisements has been stored in the database (3), and that the topics have been identified using the methods described later in this document. Once this has been done, the system can be executed by a user as follows: Suppose the user consults an electronic document, typically a web page (4). Next, the system associates (5) the topics of the candidate announcements with those related to the document in question and generates a list of related announcements (6), which correspond to the same topics as the document consulted by the user (4) . Figure 2 is a detailed flow chart illustrating the process followed by the system presented in this invention. The first step to perform is the pre-processing of the terms of the documents (8). The pre-processing is done sequentially, taking each document from the collection and applying the following transformations. When a document is preprocessed, the first phase consists of separating the document into sentences, according to the punctuation and hypertext separators such as line breaks, tables and title tags. Then, sentences are reduced to word lists, eliminating those with linguistic functions, such as articles, pronouns, adverbs and the like, usually known as "stop-words." For example, the English language sentence: "The quick brown fox jumps over the lazy dog" is reduced to the list: {quick, brown, fox, jumps, lazy, dog The set of all relevant terms included in the document is called "vocabulary."
Posteriormente, como parte de la fase de pre-procesamiento (8), se crea un índice invertido. El índice invertido es un mapeo entre cada término y los identifícadores de los documentos que contienen ese término. Los índices invertidos son una técnica de dominio general en el campo de recuperación de información para localizar eficientemente los documentos que contienen un término determinado. Adicionalmente, una tabla de términos es construida. Cada registro en la tabla de términos contiene información adicional sobre cada uno de ellos, por ejemplo, su identificador único numérico (llamado, term-id), la frecuencia del término (número de documentos en los que aparece dicho término) y la frecuencia por sentencia del término (número de sentencias en las que dicho término aparece).Subsequently, as part of the preprocessing phase (8), an inverted index is created. The inverted index is a mapping between each term and the identifiers of the documents that contain that term. Inverted indexes are a general domain technique in the field of information retrieval to efficiently locate documents that contain a specific term. Additionally, a table of terms is constructed. Each record in the table of terms contains additional information about each of them, for example, its unique numerical identifier (called, term-id), the frequency of the term (number of documents in which said term appears) and the frequency by sentence of the term (number of sentences in which said term appears).
Otra fase de preparación necesaria es la generación de una matriz de co-ocurrencias de términos. En dicha matriz, tanto las columnas (j) como los renglones (i) corresponden a los términos del vocabulario, y en las celdas (i, j) de la misma se almacena el número de ocurrencias en la misma sentencia de los términos i e j. A la aparición de dos términos en una misma sentencia se le llama co-ocurrencia. Solo aquellos términos con una frecuencia superior a un cierto nivel son tomadas en cuenta para alimentar la matriz, en otras palabras, solo aquellos términos que aparecen un mínimo número de sentencias son almacenadas en la base de datos.Another necessary preparation phase is the generation of a matrix of co-occurrences of terms. In said matrix, both columns (j) and lines (i) correspond to the vocabulary terms, and in the cells (i, j) of the same the number of occurrences is stored in the same sentence of the terms ie j . The appearance of two terms in the same sentence is called co-occurrence. Only those terms with a frequency higher than a certain level are taken into account to feed the matrix, in other words, only those terms that appear a minimum number of sentences are stored in the database.
Una vez finalizada la construcción de la matriz, esta es almacenada en la base de datos (3) de tal manera que su información pueda ser utilizada por los precesos mencionados en la figura 2 (7) y finaliza el paso (8) de pre-procesamiento. El siguiente paso es la construcción del conjunto de Tópicos (9), que es previo al uso del sistema por el usuario final. La construcción del conjunto de tópicos se ilustra en la figura 3 y será descrita posteriormente en este documento. Por el momento, considerar cada tópico o tema en el documento representado por un "Contexto semántico" que esa definido por un conjunto de k términos W = {wl, ... wk}. Los términos en un contexto semántico son las palabras que juntas describen de la "mejor" forma un tópico dado, en donde el significado exacto de "mejor" será explicado en breve. Al conjunto de k palabras W también se le llama "core". Los términos en un core no contienen elementos generales del lenguaje, como artículos, preposiciones o adverbios como resultado del pre-procesamiento descrito en (8). DW representa el conjunto de documentos que contienen todos los términos en W. Los documentos en DW se consideran semánticamente cercanos entre sí. La principal característica que distingue un "core" de un conjunto arbitrario de términos k del vocabulario es que la métrica llamada fuerza es máxima cuando se aplica a ellos. Entones, la fórmula de "fuerza" es el criterio para determinar lo que es un core. La fuerza se define a su vez, mediante la siguiente fórmula:Once the construction of the matrix is finished, it is stored in the database (3) in such a way that its information can be used by the precesses mentioned in Figure 2 (7) and the previous step (8) ends processing The next step is the construction of the set of Topics (9), which is prior to the use of the system by the end user. The construction of the set of topics is illustrated in the Figure 3 and will be described later in this document. For the moment, consider each topic or topic in the document represented by a "semantic context" that is defined by a set of k terms W = {wl, ... wk}. The terms in a semantic context are the words that together describe the "best" form of a given topic, where the exact meaning of "best" will be explained shortly. The set of k words W is also called "core". The terms in a core do not contain general elements of language, such as articles, prepositions or adverbs as a result of the preprocessing described in (8). DW represents the set of documents that contain all the terms in W. Documents in DW are considered semantically close to each other. The main characteristic that distinguishes a "core" from an arbitrary set of vocabulary terms k is that the metric called force is maximal when applied to them. Then, the "force" formula is the criterion for determining what a core is. The force is defined in turn, using the following formula:
D(W)D (W)
En la fórmula mencionada, c es una constante de escala, J(W) es la frecuencia conjunta de las palabras, que es el número de documentos en las cuales todas las palabras del conjunto W co-ocurren. El término D(W) representa la cantidad definida como "frecuencia disjunta", que es la suma de las magnitudes de cada uno de los conjuntos disjuntos de documentos donde el iésimo-término ocurre sin que co-ocurra con ninguna de las palabras restantes en el conjunto W.In the aforementioned formula, c is a constant of scale, J (W) is the joint frequency of the words, which is the number of documents in which all the words in the set W co-occur. The term D (W) represents the amount defined as "disjoint frequency", which is the sum of the magnitudes of each of the disjoint sets of documents where the ith term occurs without co-occurring with any of the remaining words in the set W.
El proceso para obtener los cores, es decir, los conjuntos de k términos con fuerza máxima, se explica mediante la figura 3 y se presenta al término de la explicación de la figura 2. Asumiendo que el cálculo de los cores se ha completado, y que la información de los cores ha sido almacenada en la base de datos (3), el proceso continúa en la figura 2, con el cálculo del vector de pesos de tópicos (10). Para cada core descubierto, un vector con pesos se calculará a fin de determinar su similaridad con cualquier documento, como se explicará después.The process for obtaining the cores, that is, the sets of k terms with maximum force, is explained by figure 3 and is presented at the end of the explanation of figure 2. Assuming that the calculation of the cores has been completed, and that the information of the cores has been stored in the database (3), the process continues in Figure 2, with the calculation of the vector of topic weights (10) . For each core discovered, a vector with weights will be calculated in order to determine its similarity to any document, as will be explained later.
En esta fase, un vector con pesos (ti ,W1), (t2,w2), ..., (tn,wn) de los términos para cada uno de los tópicos es calculado, donde para cada término t¡, su peso w¡ representa la importancia del término t¡ en el tópico considerado. Para calcular el vector de pesos para ada tópico, se recuperan los documentos que coinciden con la consulta representada por el "core" correspondiente (es decir, el conjunto de documentos DW que contienen todas las palabras del core). Para llevar a cabo este cálculo, cada documento se representa como un vector de términos con la frecuencia de cada término en el documento, es decir, [(ti,f]j), (t2,f2j), ..., (tn,fnj)] para un documento j. Luego, todas las frecuencias para los documentos en DW son sumadas, obteniendo un vector [(ti, fi,i+fi)2+...),
Figure imgf000011_0001
..., (tn,fn,i+fn,2+...)]- En este vector, se aplica la fórmula estándar TF-IDF para calcular el peso de cada término con respecto al core. La fórmula de TF-IDF es:
Figure imgf000011_0002
In this phase, a vector with weights (ti, W 1 ), (t 2 , w 2 ), ..., (t n , w n ) of the terms for each of the topics is calculated, where for each term t¡, its weight w¡ represents the importance of the term t¡ in the topic considered. To calculate the vector of weights for the topical issue, the documents that match the query represented by the corresponding "core" (ie, the set of DW documents containing all the core words) are retrieved. To carry out this calculation, each document is represented as a vector of terms with the frequency of each term in the document, that is, [(ti, f] j ), (t 2 , f 2j ), ..., (t n , f nj )] for a document j. Then, all frequencies for documents in DW are added, obtaining a vector [(ti, fi, i + fi ) 2 + ...),
Figure imgf000011_0001
..., (t n , fn, i + fn, 2 + ...)] - In this vector, the standard TF-IDF formula is applied to calculate the weight of each term with respect to the core. The TF-IDF formula is:
Figure imgf000011_0002
Donde Wy es el peso del término i en el documento j, y tflij) es la cantidad de ocurrencias del término i en el documento j; N representa el número total de documentos en el corpus y n¡ es el número de documentos en los que el término i ocurre; log es una función logarítmica. Una vez que se ha completado este paso, se lleva a cabo una normalización dividiendo cada peso entre la suma de pesos, lo que resulta en un vector unitario. En la siguiente fase, el sistema calcula la similaridad de los anuncios con los tópicos (11). Para ello se calcula un vector de términos con pesos para cada uno de los anuncios, utilizando un proceso similar al que construye los vectores para cada tópico (10) descrito anteriormente. Posteriormente, la similaridad entre el vector del anuncio y el vector del tópico es calculada ara cada uno de los vectores de tópicos. Esta similaridad se obtiene con la "distancia coseno" estándar, que no es sino el producto escalar de los vectores dividido entre el producto de sus magnitudes. Dicho número provee una medida de la similaridad de cada anuncio con cada tópico. Luego, una base de datos (3) se forma con las similaridades entre cada anuncio con cada uno de los tópicos. Para un anuncio "d", un "Vector de similaridad de tópicos" Td será un vector de la forma (T15Wi), (T2,w2), ..., (Tn,wn), donde T¡ son los tópicos y w¡ los pesos, que son recíprocos de la distancia coseno entre el anuncio d y el tópico T¡. Esto finaliza el cálculo de la similaridad entre anuncios y tópicos (11). Después de que se han completado las fases previamente descritas, el sistema puede recibir documentos web a través de la red (12). Puede ser que la petición del usuario contenga la dirección del documento remoto residente en la red, o que el texto completo del documento esté localmente disponible, por lo tanto, para determinar el caso, se realiza una prueba para verificar si el documento está disponible (13) en la base de datos (los documentos que estuvieron en algún momento en la base, pero expiraron, no se consideran disponibles localmente). Si el documento se encuentra efectivamente en la base de datos, el método recupera su vector de tópicos (16) de la base de tópicos por documento. Si no, el nuevo documento se almacena en el índice y en la base de datosWhere Wy is the weight of the term i in document j, and tflij) is the number of occurrences of the term i in document j; N represents the total number of documents in the corpus and n¡ is the number of documents in which the term i occurs; log is a logarithmic function. Once this step is completed, a normalization is carried out by dividing each weight by the sum of weights, resulting in a unit vector. In the next phase, the system calculates the similarity of the ads with the topics (11). For this, a vector of terms with weights is calculated for each of the advertisements, using a process similar to the one that constructs the vectors for each topic (10) described above. Subsequently, the similarity between the ad vector and the topic vector is calculated for each of the topic vectors. This similarity is obtained with the standard "cosine distance", which is nothing but the scalar product of the vectors divided by the product of their magnitudes. This number provides a measure of the similarity of each ad with each topic. Then, a database (3) is formed with the similarities between each ad with each of the topics. For an ad "d", a "Topic similarity vector" Td will be a vector of the form (T 15 Wi), (T 2 , w 2 ), ..., (T n , w n ), where T Are the topics and w weights, which are reciprocals of the cosine distance between the ad d and the topic T¡. This concludes the calculation of the similarity between advertisements and topics (11). After the previously described phases have been completed, the system can receive web documents through the network (12). It may be that the user's request contains the address of the remote document residing in the network, or that the full text of the document is locally available, therefore, to determine the case, a test is performed to verify if the document is available ( 13) in the database (documents that were at some time in the database, but expired, are not considered locally available). If the document is effectively in the database, the method retrieves its topic vector (16) from the topic base per document. If not, the new document is stored in the index and in the database
(14) y se utiliza el método para calcular su similaridad con los tópicos (15), esto es: r construir un vector de términos con pesos para el documento, calcular la similaridad del vector del documento con cada uno de los vectores de tópicos y almacenar los resultados en la base de tópicos-documentos.(14) and the method is used to calculate its similarity with the topics (15), that is: r construct a vector of terms with weights for the document, calculate the similarity of the Document vector with each of the topic vectors and store the results in the topic-document base.
En cualquiera de los dos casos, después de que el cálculo de la similaridad de los documentos con los tópicos (15) o la recuperación del vector de tópicos del documento (16), el método procede a ordenar los anuncios (17) para el documento consultado por el usuario; que será referido como "d". Para tal efecto, el método primero selecciona los anuncios candidatos usando un criterio de pre-selección. Para cada uno de estos anuncios candidatos, se recupera su vector de tópicos de la base de datos. Finalmente, la distancia coseno es calculada entre cada vector anuncio-tópicos y el vector de tópicos del documento "d", y los resultados (distancias) ordenados en forma ascendente, de tal forma que las distancias más pequeñas aparecerán primero. El procedimiento termina cuando se genera la lista ordenada de anuncios (18).In either case, after the calculation of the similarity of the documents with the topics (15) or the recovery of the topic vector of the document (16), the method proceeds to order the announcements (17) for the document consulted by the user; which will be referred to as "d". For this purpose, the method first selects the candidate advertisements using a pre-selection criterion. For each of these candidate advertisements, its topic vector is retrieved from the database. Finally, the cosine distance is calculated between each ad-topic vector and the topic vector of the document "d", and the results (distances) ordered in ascending order, so that the smallest distances will appear first. The procedure ends when the ordered list of ads (18) is generated.
La Figura 3 es un diagrama de flujo que ilustra el proceso de extracción de tópicos de la colección. Comienza con un conjunto dado de documentos pre-procesados (19), que pueden ser parte del repositorio de una organización o ser una muestra de una colección muy grande como la Internet; el pre-procesamiento fue descrito en secciones anteriores (8), incluyendo la eliminación de términos no esenciales, separación en frases, construcción de vectores de frecuencias de términos y construcción de la matriz de coocurrencias de términos, por ejemplo. El resultado de este proceso es un conjunto de "cores" (esto es, conjuntos de k términos, donde k es un entero pequeño, típicamente 3 o 4) de fuerza máxima, empleando la medida definida en la fórmula anteriormente descrita.Figure 3 is a flow chart illustrating the process of extracting topics from the collection. It begins with a given set of pre-processed documents (19), which may be part of an organization's repository or be a sample of a very large collection such as the Internet; Preprocessing was described in previous sections (8), including the elimination of non-essential terms, separation in sentences, construction of frequency vectors of terms and construction of the matrix of terms co-occurrences, for example. The result of this process is a set of "cores" (that is, sets of k terms, where k is a small integer, typically 3 or 4) of maximum force, using the measure defined in the formula described above.
En seguida, en el cálculo de semillas (20), para cada documento en la colección, un grupo inicial de k términos llamado "semilla" es obtenido tomando los k términos con mayor TF-IDF para dicho documento. Luego, se lleva a cabo la parte central del método, que es el proceso de refinamiento de cores (21). Los cores iniciales son las semillas calculadas en la fase previa. En la fase actual, cada uno de los cores es sistemáticamente modificado, cambiando uno de sus términos para probar si la fuerza de la variante resultante se incrementa; de ser el caso, la variante toma el lugar del core del que procede y el core original es deshechado; de no ser el caso, una nueva variante es probada. La complejidad de éste paso radica en evitar probar demasiadas variaciones, ya que en principio, si existen n términos en el vocabulario (típicamente varios miles), entonces existen n! / k! (n-k)! posibles variantes, que es un número intratable inclusive para un valor pequeño de k. En este punto, la matriz de co-ocurrencias sirve para evitar probar cada combinación posible de términos; el procedimiento descrito considera únicamente los términos con un nivel significativo de co-ocurrencias con los k-1 términos restantes en el core, esto es, solo los términos con co-ocurrencias por encima de un nivel predeterminado son candidatos a reemplazar un término del core. Una vez que todos los términos candidatos viables han sido probados para cada uno de los términos del core, sin conseguir incrementar la fuerza, se asegura que el core tiene máxima fuerza. Cuando dos o más cores siendo refinados resultan idénticos, entonces dichos cores son integrados en uno solo. Así, el procedimiento produce como resultado final una colección de cores únicos con máxima fuerza (22). Next, in the calculation of seeds (20), for each document in the collection, an initial group of k terms called "seed" is obtained by taking the k terms with the highest TF-IDF for that document. Then, the central part of the method, which is the process of refinement of cores (21). The initial cores are the seeds calculated in the previous phase. In the current phase, each of the cores is systematically modified, changing one of its terms to test if the strength of the resulting variant is increased; if this is the case, the variant takes the place of the core from which it comes and the original core is undone; If this is not the case, a new variant is tested. The complexity of this step is to avoid trying too many variations, since in principle, if there are n terms in the vocabulary (typically several thousand), then there are n! / k! (nk)! possible variants, which is an intractable number even for a small value of k. At this point, the co-occurrence matrix serves to avoid testing every possible combination of terms; the procedure described considers only the terms with a significant level of co-occurrences with the k-1 terms remaining in the core, that is, only the terms with co-occurrences above a predetermined level are candidates to replace a core term . Once all viable candidate terms have been tested for each of the core terms, without increasing the strength, it is ensured that the core has maximum strength. When two or more cores being refined are identical, then these cores are integrated into one. Thus, the procedure produces as a final result a collection of unique cores with maximum strength (22).

Claims

REIVINDICACIONESHabiendo presentado la invención que es novedosa y describiéndola suficientemente, reinvindicamos como nuestra propiedad exclusiva: CLAIMS Having presented the invention that is novel and describing it sufficiently, we reinvindicated as our exclusive property:
1. Un método para recuperar un subconjunto relevante de anuncios, teniendo un sistema de recuperación de información que recupera el conjunto de anuncios textuales dado el contenido de un documento, que está caracterizado por comprender las siguientes etapas:1. A method to retrieve a relevant subset of ads, having an information retrieval system that retrieves the set of textual ads given the content of a document, which is characterized by comprising the following stages:
(a) Identificar los tópicos existentes en una colección de documentos web;(a) Identify the existing topics in a collection of web documents;
(b) Asociar anuncios textuales con los tópicos extraidos aplicando una métrica de similaridad semántica;(b) Associate textual ads with the topics extracted by applying a semantic similarity metric;
(c) Asociar el documento con los tópicos mencionados aplicando una métrica de similaridad semántica;(c) Associate the document with the mentioned topics by applying a semantic similarity metric;
(d) Ordenar semánticamente los anuncios recuperados para un documento dado.(d) Semantically order the retrieved ads for a given document.
2. El método para recuperar un subconjunto relevante de anuncios, teniendo un sistema de recuperación de información que recupera el conjunto de anuncios textuales dado el contenido de un documento, de acuerdo a la reivindicación 1, en su sub-etapa (a) que consiste en identificar los tópicos existentes en una colección de documentos web, comprende las siguientes sub-etapas: (a) Compilar una colección de documentos; (b) Construir un índice de términos por documento;2. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 1, in its sub-step (a) consisting In identifying the existing topics in a collection of web documents, it includes the following sub-stages: (a) Compile a collection of documents; (b) Build an index of terms per document;
(c) Construir una matriz de término-por-término;(c) Build a term-by-term matrix;
(d) Extraer los tópicos de cada uno de los documentos;(d) Extract the topics from each of the documents;
(e) Construir un vector con pesos para cada uno de los tópicos en la base de datos, Tv.(e) Construct a vector with weights for each of the topics in the database, Tv.
3. El método para recuperar un subconjunto relevante de anuncios, teniendo un sistema de recuperación de información que recupera el conjunto de anuncios textuales dado el contenido de un documento, de acuerdo a la reivindicación 2, en su sub-etapa (b) que consiste en construir un índice de términos por documento, comprende las siguientes sub-etapas:3. The method to retrieve a relevant subset of ads, having an information retrieval system that retrieves the set of textual ads given the The content of a document, according to claim 2, in its sub-stage (b) consisting of constructing an index of terms per document, comprises the following sub-stages:
(a) Identificar las sentencias existentes en cada uno de los documentos de la colección; (b) Eliminar palabras no significativas (stop-words) de los términos de cada sentencia;(a) Identify the sentences existing in each of the documents in the collection; (b) Remove non-significant words (stop-words) from the terms of each sentence;
(c) Acumular la suma de sentencias en que ocurre cada término;(c) Accumulate the sum of sentences in which each term occurs;
(d) Acumular la suma de documentos en las que ocurre cada término;(d) Accumulate the sum of documents in which each term occurs;
(e) Mantener la lista de documentos en los que cada término ocurre.(e) Maintain the list of documents in which each term occurs.
4. El método para recuperar un subconjunto relevante de anuncios, teniendo un sistema de recuperación de información que recupera el conjunto de anuncios textuales dado el contenido de un documento, de acuerdo a la reivindicación 2, en su sub-etapa (c) que consiste en acumular la suma de sentencias en que ocurre cada término, comprende las siguientes sub-etapas:4. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 2, in its sub-step (c) consisting in accumulating the sum of sentences in which each term occurs, it comprises the following sub-stages:
(a) Generar mapeos término-a-término para cada una de las combinaciones de palabras de cada sentencia;(a) Generate term-to-term mappings for each word combination of each sentence;
(b) Acumular la suma de las co-ocurrencias término-a-término en la celda corrrespondiente de la matriz;(b) Accumulate the sum of the term-to-term co-occurrences in the corresponding matrix cell;
(c) Acumular la suma de las co-ocurréncias por documento en la celda de la matriz de término-a-término ;(c) Accumulate the sum of the co-occurrences per document in the cell of the term-to-term matrix;
5. El método para recuperar un subconjunto relevante de anuncios, teniendo un sistema de recuperación de información que recupera el conjunto de anuncios textuales dado el contenido de un documento, de acuerdo a la reivindicación 2, en su sub-etapa (d) que consiste en Extraer los tópicos de cada uno de los documentos, comprende las siguientes sub-etapas: (a) Calcular un vector de frecuencias de términos, con cada uno de los términos del documento;5. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 2, in its sub-step (d) consisting In Extracting the topics from each of the documents, it includes the following sub-stages: (a) Calculate a frequency vector of terms, with each of the terms of the document;
(b) Calcular un nuevo vector normalizado con pesos, para cada uno de los términos en el vector de frecuencias de términos; (c) Generar un conjunto semilla de términos;(b) Calculate a new standardized vector with weights, for each of the terms in the frequency vector of terms; (c) Generate a seed set of terms;
(d) Reemplazar iterativamente cada uno de los términos del conjunto semilla por el término que produzca la mayor evaluación de fuerza;(d) I iteratively replace each of the terms of the seed set with the term that produces the greatest evaluation of strength;
(e) Almacenar la combinación de 3-términos con máxima evaluación de fuerza en la base de datos de tópicos. (e) Store the combination of 3-terms with maximum strength evaluation in the topic database.
6. El método para recuperar un subconjunto relevante de anuncios, teniendo un sistema de recuperación de información que recupera el conjunto de anuncios textuales dado el contenido de un documento, de acuerdo a la reivindicación 5, en su sub-etapa (d) que consiste en reemplazar iterativamente cada uno de los términos del conjunto semilla por el término que produzca la mayor evaluación de fuerza comprende la utilización de la matriz de término-a-término para para seleccionar los k términos ordenados por la suma de sus co-ocurrencias por sentencia en orden descendiente, siendo k una constante entera arbitraria.6. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 5, in its sub-step (d) consisting iteratively replacing each of the terms of the seed set with the term that produces the greatest force evaluation includes the use of the term-to-term matrix to select the k terms ordered by the sum of their co-occurrences by sentence in descending order, being k an arbitrary integer constant.
7. El método para recuperar un subconjunto relevante de anuncios, teniendo un sistema de recuperación de información que recupera el conjunto de anuncios textuales dado el contenido de un documento, de acuerdo a la reivindicación 5, en su sub-etapa (d) que consiste en reemplazar iterativamente cada uno de los términos del conjunto semilla por el término que produzca la mayor evaluación de fuerza comprende el cálculo de la métrica de fuerza para cada uno de los reemplazos candidatos, que consiste en las siguientes sub-etapas: (a) Contar el número de documentos en donde las 3 palabras aparecen simultáneamente, identificado como J;7. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 5, in its sub-step (d) consisting in iteratively replacing each of the terms of the seed set by the term that produces the greatest evaluation of force includes the calculation of the force metric for each of the candidate replacements, which consists of the following sub-stages: (a) Count the number of documents where the 3 words appear simultaneously, identified as J;
(b) Contar el número de documentos en donde la primer palabra ocurre, pero la segunda y tercera no ocurren, identificado dicha cantidad como di; (c) Contar el número de documentos en donde la segunda palabra ocurre, pero la primera y la tercera no ocurren, identificado dicha cantidad como d2;(b) Count the number of documents where the first word occurs, but the second and third do not occur, identifying that number as di; (c) Count the number of documents where the second word occurs, but the first and third do not occur, identifying that amount as d 2 ;
(d) Contar el número de documentos en donde la tercera palabra ocurre, pero la primera y la segunda palabra no ocurren, identificando dicha cantidad como d3;(d) Count the number of documents where the third word occurs, but the first and second words do not occur, identifying that number as d 3 ;
(e) Calcular la fuerza del conjunto, identificada como F, dividiendo J entre el resultado de la suma de di + d2 + d3.(e) Calculate the force of the set, identified as F, by dividing J by the result of the sum of di + d 2 + d3.
8. El método para recuperar un subconjunto relevante de anuncios, teniendo un sistema de recuperación de información que recupera el conjunto de anuncios textuales dado el contenido de un documento, de acuerdo a la reivindicación 5, en su sub-etapa (d) que consiste en reemplazar iterativamente cada uno de los términos del conjunto semilla por el término que produzca la mayor evaluación de fuerza comprende la utilización del índice de términos-por-documento.8. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 5, in its sub-step (d) consisting in iteratively replacing each of the terms of the seed set by the term that produces the greatest evaluation of force includes the use of the index of terms-by-document.
9. El método para recuperar un subconjunto relevante de anuncios, teniendo un sistema de recuperación de información que recupera el conjunto de anuncios textuales dado el contenido de un documento, de acuerdo a la reivindicación 5, en su sub-etapa (b), que consiste en calcular un nuevo vector normalizado con pesos, para cada uno de los términos en el vector de frecuencias de términos, comprende las siguientes sub-etapas:9. The method of recovering a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 5, in its sub-step (b), which It consists of calculating a new standardized vector with weights, for each of the terms in the frequency vector of terms, it comprises the following sub-stages:
(a) Recuperar el número total de documentos existentes en el índice de términos-por- documento, N;(a) Retrieve the total number of existing documents in the terms-per-document index, N;
(b) Recuperar el número total de documentos en donde el dicho término ocurre, F; (c) Asignar el resultado de la fórmula w*log(N/F), donde w representa el peso actual del vector como nuevo peso para el vector.(b) Retrieve the total number of documents where said term occurs, F; (c) Assign the result of the formula w * log (N / F), where w represents the current weight of the vector as a new weight for the vector.
10. El método para recuperar un subconjunto relevante de anuncios, teniendo un sistema de recuperación de información que recupera el conjunto de anuncios textuales dado el contenido de un documento, de acuerdo a la reivindicación 5, en su sub-etapa (c) que consiste en Generar un conjunto semilla de 3 términos, comprende las siguientes sub- etapas:10. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 5, in its sub-step (c) consisting In Generating a seed set of 3 terms, it comprises the following sub-stages:
(a) Ordenar los términos por el peso mencionado en orden descendiente;(a) Sort the terms by the weight mentioned in descending order;
(b) Remover aquellos cuyo número total de ocurrencias en el índice es mayor que 5; (c) Seleccionar los 3 mayores como conjunto semilla.(b) Remove those whose total number of occurrences in the index is greater than 5; (c) Select the 3 largest as a seed set.
11. El método para recuperar un subconjunto relevante de anuncios, teniendo un sistema de recuperación de información que recupera el conjunto de anuncios textuales dado el contenido de un documento, de acuerdo a la reivindicación 1, en su sub-etapa (b) que consiste en asociar anuncios textuales con los tópicos extraídos aplicando una métrica de similaridad semántica; comprende las siguientes sub-etapas:11. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 1, in its sub-step (b) consisting in associating textual ads with the topics extracted by applying a semantic similarity metric; It comprises the following sub-stages:
(a) Construir un vector de términos con pesos para cada los anuncios que serán analizados, incluyendo el titulo, texto, ligas y palabras clave proporcionadas por el usuario, Av;(a) Construct a vector of terms with weights for each ad that will be analyzed, including the title, text, links and keywords provided by the user, Av;
(b) Calcular la distancia coseno del mencionado vector de anuncios Av, con cada uno de los vectores de tópicos, Tv;(b) Calculate the cosine distance of the aforementioned ad vector Av, with each of the topic vectors, Tv;
(c) Almacenar el vector de similaridad de tópicos resultante en la base de datos anuncios-tópicos;(c) Store the resulting topic similarity vector in the ad-topic database;
12. El método para recuperar un subconjunto relevante de anuncios, teniendo un sistema de recuperación de información que recupera el conjunto de anuncios textuales dado el contenido de un documento, de acuerdo a la reivindicación 1, en su sub-etapa (c) que consiste en Asociar el documento con los tópicos mencionados aplicando una métrica de similaridad semántica, comprende las siguientes sub-etapas:12. The method of recovering a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 1, in its sub-step (c) that It consists in associating the document with the mentioned topics by applying a semantic similarity metric, comprising the following sub-stages:
(a) Construir un vector de términos con pesos para el documento a ser analizado, Dv;(a) Construct a vector of terms with weights for the document to be analyzed, Dv;
(b) Calcular la distancia coseno del mencionado vector de documentos Dv, con cada uno de los vectores de tópicos, Tv;(b) Calculate the cosine distance of said document vector Dv, with each of the topic vectors, Tv;
(c) Almacenar el vector-columna de similaridad de tópicos en la base de datos de documentos-tópicos.(c) Store the topic similarity vector column in the document-topic database.
13. El método para recuperar un subconjunto relevante de anuncios, teniendo un sistema de recuperación de información que recupera el conjunto de anuncios textuales dado el contenido de un documento, de acuerdo a la reivindicación 2, en su sub-etapa (e) que consiste en Construir un vector con pesos para cada uno de los tópicos en la base de datos, Tv, comprende las siguientes sub-etapas:13. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 2, in its sub-stage (e) consisting In Constructing a vector with weights for each of the topics in the database, Tv, comprises the following sub-stages:
(a) Encontrar todos los documentos en los cuales las 3 palabras del tópico co-ocurren,(a) Find all the documents in which the 3 words of the topic co-occur,
D; (b) Construir un vector de frecuencias de términos para cada uno de los documentos recuperados;D; (b) Construct a frequency vector of terms for each of the retrieved documents;
(c) Calcular la suma vectorial de cada uno de los vectores de frecuencia mencionados, y obtener un nuevo vector de frecuencias Tfv, donde cada uno de los pesos de los términos es la suma de frecuencias del término en el conjunto D; (d) Calcular un nuevo conjunto de pesos W, aplicando una función de normalización a cada uno de los pesos del vector Tfv.(c) Calculate the vector sum of each of the mentioned frequency vectors, and obtain a new frequency vector Tfv, where each of the weights of the terms is the sum of the term frequencies in the set D; (d) Calculate a new set of weights W, applying a normalization function to each of the weights of the vector Tfv.
14. El método para recuperar un subconjunto relevante de anuncios, teniendo un sistema de recuperación de información que recupera el conjunto de anuncios textuales dado el contenido de un documento, de acuerdo a la reivindicación 13, en su sub-etapa (d) que consiste en Calcular un nuevo conjunto de pesos W, aplicando una función de normalización a cada uno de los pesos del vector TfV, comprende las siguientes sub- etapas:14. The method of recovering a relevant subset of advertisements, having an information retrieval system that recovers the set of textual ads given the content of a document, according to claim 13, in its sub-step (d) consisting in Calculate a new set of weights W, applying a function of Normalization to each of the weights of the TfV vector, comprises the following sub-stages:
(a) Recuperar el número total de documentos existentes en el índice de términos-por- documento, N; (b) Recuperar el número total de documentos en los cuales el término dado ocurre, F;(a) Retrieve the total number of existing documents in the terms-per-document index, N; (b) Retrieve the total number of documents in which the given term occurs, F;
(c) Asignar el resultado de la fórmula w*log(N/F), donde w representa el peso actual del término, al nuevo peso del mismo en el vector.(c) Assign the result of the formula w * log (N / F), where w represents the current weight of the term, to the new weight of the term in the vector.
15. El método para recuperar un subconjunto relevante de anuncios, teniendo un sistema de recuperación de información que recupera el conjunto de anuncios textuales dado el contenido de un documento, de acuerdo a la reivindicación 1, en su sub-etapa (d) que consiste en Ordenar semánticamente los anuncios recuperados para un documento dado, comprende las siguientes sub-etapas:15. The method to recover a relevant subset of advertisements, having an information retrieval system that retrieves the set of textual ads given the content of a document, according to claim 1, in its sub-step (d) consisting In Semantically sorting the retrieved ads for a given document, it comprises the following sub-stages:
(a) Generar una lista de anuncios candidatos, seleccionando aquellos que pertenecen a los mismos tópicos que el documento; (b) Recuperar el vector columna normalizado para cada uno de los anuncios candidatos de la base de datos anuncios-tópicos;(a) Generate a list of candidate announcements, selecting those that belong to the same topics as the document; (b) Retrieve the standardized column vector for each of the candidate advertisements from the advertisement-topic database;
(c) Recuperar los vectores de tópicos asociados al documento bajo análisis, V;(c) Recover the topic vectors associated with the document under analysis, V;
(d) Construir la matriz de similiridad de anuncios-tópicos A, trasponiendo todos los vectores de similaridad de anuncios-tópicos, es decir, [f(ai), f(a2) ... f(a3) ]ΛT; (e) Recuperar el vector columna de similaridad documento-tópico de la base de datos documentos-tópicos, para el documento en consideración, T;(d) Construct the similarity matrix of ad-topics A, transposing all similarity vectors of ad-topics, that is, [f (ai), f (a 2 ) ... f (a 3 )] Λ T ; (e) Retrieve the document-topical similarity column vector from the document-topical database, for the document under consideration, T;
(f) Calcular el vector columna R, multiplicando la matriz de anuncios-tópicos A, por el vector columna de tópicos documentos T, es decir, R = AxT;(f) Calculate the column vector R, multiplying the ad-topic matrix A, by the column vector of topic documents T, that is, R = AxT;
(g) Obtener el orden de los anuncios semánticamente al ordenar los elementos del vector columna R. (g) Obtain the order of the ads semantically by ordering the elements of the vector column R.
PCT/MX2008/000109 2008-08-20 2008-08-20 System and method for displaying relevant textual advertising based on semantic similarity WO2010021530A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/MX2008/000109 WO2010021530A1 (en) 2008-08-20 2008-08-20 System and method for displaying relevant textual advertising based on semantic similarity
MX2010011323A MX2010011323A (en) 2008-08-20 2010-10-14 System and method for displaying relevant textual advertising based on semantic similarity.

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/MX2008/000109 WO2010021530A1 (en) 2008-08-20 2008-08-20 System and method for displaying relevant textual advertising based on semantic similarity

Publications (1)

Publication Number Publication Date
WO2010021530A1 true WO2010021530A1 (en) 2010-02-25

Family

ID=41707313

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MX2008/000109 WO2010021530A1 (en) 2008-08-20 2008-08-20 System and method for displaying relevant textual advertising based on semantic similarity

Country Status (1)

Country Link
WO (1) WO2010021530A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4839853A (en) * 1988-09-15 1989-06-13 Bell Communications Research, Inc. Computer information retrieval using latent semantic structure
US6615208B1 (en) * 2000-09-01 2003-09-02 Telcordia Technologies, Inc. Automatic recommendation of products using latent semantic indexing of content
US6816857B1 (en) * 1999-11-01 2004-11-09 Applied Semantics, Inc. Meaning-based advertising and document relevance determination
US20070265996A1 (en) * 2002-02-26 2007-11-15 Odom Paul S Search engine methods and systems for displaying relevant topics
US20080133516A1 (en) * 2006-12-05 2008-06-05 Oded Itzhak Method and system for dynamic matching or distribution of documents via a web site

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4839853A (en) * 1988-09-15 1989-06-13 Bell Communications Research, Inc. Computer information retrieval using latent semantic structure
US6816857B1 (en) * 1999-11-01 2004-11-09 Applied Semantics, Inc. Meaning-based advertising and document relevance determination
US6615208B1 (en) * 2000-09-01 2003-09-02 Telcordia Technologies, Inc. Automatic recommendation of products using latent semantic indexing of content
US20070265996A1 (en) * 2002-02-26 2007-11-15 Odom Paul S Search engine methods and systems for displaying relevant topics
US20080133516A1 (en) * 2006-12-05 2008-06-05 Oded Itzhak Method and system for dynamic matching or distribution of documents via a web site

Similar Documents

Publication Publication Date Title
Deveaud et al. Accurate and effective latent concept modeling for ad hoc information retrieval
Dalvi et al. Websets: Extracting sets of entities from the web using unsupervised information extraction
US8874568B2 (en) Systems and methods regarding keyword extraction
US7519588B2 (en) Keyword characterization and application
Bansal et al. Hybrid attribute based sentiment classification of online reviews for consumer intelligence
JP2013168186A (en) Review processing method and system
US20080065621A1 (en) Ambiguous entity disambiguation method
Sun et al. CWS: a comparative web search system
US8812504B2 (en) Keyword presentation apparatus and method
JP2011529600A (en) Method and apparatus for relating datasets by using semantic vector and keyword analysis
Hu et al. Enhancing accessibility of microblogging messages using semantic knowledge
WO2015035401A1 (en) Automated discovery using textual analysis
Dorji et al. Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary
Danilova Cross-language plagiarism detection methods
Ullah et al. A framework for extractive text summarization using semantic graph based approach
Vu et al. Interest mining from user tweets
US20050102619A1 (en) Document processing device, method and program for summarizing evaluation comments using social relationships
KR101928074B1 (en) Server and method for content providing based on context information
Paul et al. An affix removal stemmer for natural language text in nepali
Nguyen et al. New york university 2014 knowledge base population systems
Dianati et al. Words stemming based on structural and semantic similarity
Kalloubi et al. Graph based tweet entity linking using DBpedia
Gupta et al. Document summarisation based on sentence ranking using vector space model
WO2010021530A1 (en) System and method for displaying relevant textual advertising based on semantic similarity
Zhang et al. A semantics-based method for clustering of Chinese web search results

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08812629

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: MX/A/2010/011323

Country of ref document: MX

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08812629

Country of ref document: EP

Kind code of ref document: A1