WO2016193797A1 - Method of and system for generating a hashed complex vector - Google Patents

Method of and system for generating a hashed complex vector Download PDF

Info

Publication number
WO2016193797A1
WO2016193797A1 PCT/IB2015/058957 IB2015058957W WO2016193797A1 WO 2016193797 A1 WO2016193797 A1 WO 2016193797A1 IB 2015058957 W IB2015058957 W IB 2015058957W WO 2016193797 A1 WO2016193797 A1 WO 2016193797A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
hashed
vector
complex
hash function
Prior art date
Application number
PCT/IB2015/058957
Other languages
French (fr)
Inventor
Vyacheslav Vyacheslavovoich ALIPOV
Andrey Vladimirovich GULIN
Egor Aleksandrovich Samosvat
Andrey Sergeevich MISHCHENKO
Original Assignee
Yandex Europe Ag
Yandex Llc
Yandex Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yandex Europe Ag, Yandex Llc, Yandex Inc. filed Critical Yandex Europe Ag
Publication of WO2016193797A1 publication Critical patent/WO2016193797A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present technology relates to systems and methods for generating a hashed complex vector.
  • the hashed complex vector may be indicative of an association between a document and a parameter of interest.
  • the document may be associated with first data and second data which may be processed to associate a parameter of interest to the document.
  • data items are indexed according to some or all of the possible search terms that may be contained in search queries.
  • an "inverted index" of the data collection is created, maintained, and updated by the system.
  • the inverted index will comprise a large number of "posting lists" to be reviewed during execution of a search query.
  • Each posting list corresponds to a potential search term and contains "postings", which are references to the data items in the data collection that include that search term (or otherwise satisfy some other condition that is expressed by the search term).
  • search terms are individual words (and/or some of their most often used combinations), and the inverted index comprises one posting list for every word that has been encountered in at least one of the documents.
  • Search queries especially those made by human users, typically have the form of a simple list of one or more words, which are the "search terms" of the search query. Every such search query may be understood as a request to the search engine to locate every data item in the data collection containing each and every one of the search terms specified in the search query. Processing of a search query will involve searching through one or more posting lists of the inverted index. As was discussed above, typically there will be a posting list corresponding to each of the search terms in the search query. Posting lists are searched as they can be easily stored and manipulated in a fast access memory device, whereas the data items themselves cannot (the data items are typically stored in a slower access storage device). This generally allows search queries to be performed at a much higher speed.
  • each data item in a data collection is numbered.
  • data items are commonly ordered (and thus numbered) within the data collection in descending order of what is known in the art as their "query- independent relevance" (hereinafter abbreviated to "QIR").
  • QIR is a system-calculated heuristic parameter defined in such a way that the data items with a higher QIR value are statistically more likely to be considered by a search requester of any search query as sufficiently relevant to them.
  • the data items in the data collection will be ordered so that those with a higher QIR value will be found first when a search is done.
  • each posting list in the inverted index will contain postings, a list of references to data items containing the term with which that posting list is associated, with the postings being ordered in descending QIR value order. (This is very commonly the case in respect of web search engines.).
  • QSR query- specific relevance
  • the overall process of executing a search query can be considered as having two broad distinct stages: A first stage wherein all of the search results are collected based (in part) on their QIR values, aggregated and ordered in descending QIR order; and a second stage wherein at least some of the search results are reordered according to their QSR. Afterwards a new QSR-ordered list of the search results is created and delivered to the search requester.
  • the search result list is typically delivered in parts, starting with the part containing the search results with the highest QSR.
  • the collecting of the search results stops after some predefined maximum number of results has been attained or some predefined minimum QIR threshold has been reached. This is known in the art as "pruning"; and it occurs, as once the pruning condition has been reached, it is very likely that the relevant data items have already been located.
  • a shorter, QSR-ordered, list (which is a subset of the search results of the first stage) is produced.
  • a conventional web search engine when conducting a search of its data collection (which contains several billions of data items) for data items satisfying a given search query, may easily produce a list of tens of thousands of search results (and even more in some cases).
  • the search requester cannot be provided with such an amount of search results.
  • MLR Machine-learned ranking
  • the ranking generated by a MLR model may consist of a ranking value associated with a document that may equally be referred to as a "parameter of interest" or "label".
  • the document may equally be referred to as a file.
  • the ranking may consist of an "absolute ranking" of an absolute order of a first document compared to a second document such as, for example, the QIR value.
  • the ranking may consist of a "relative ranking" of a relative order of the first document compared to the second document given a particular context such as, for example, the QSR value.
  • MLR models may, in some instances, be generated and maintained through machine-learning algorithms relying on one or more training samples. The number of MLR models that is required for particular application may greatly vary. Conventional web search engines such as YandexTM may rely on several thousands of MLR models that may be used in parallel during processing of search queries.
  • MLR models may rely on tree models and feature vectors comprising one or more parameters to associate a document (e.g., a web page) and a parameter of interest (e.g., a ranking value).
  • the one or more parameters of a feature vector may be used to define a specific path in a tree model thereby allowing identification of which parameter of interest is to be associated with a particular document.
  • the one or more parameters may be of different types such as binary type, integer type and/or category type.
  • a system hosting a tree model for the purpose of associating a document to a parameter of interest may receive a feature vector comprising parameters associated with the document.
  • the parameters may be implemented as first data associated with the document and second data also associated with the document.
  • the first data may be of a binary type and/or of a real number type and the second data may be of a category type.
  • the second data may be indicative of multiple features, for example a feature f 1 representative of an URL and a feature f2 representative of a key word.
  • the system hosting the tree model may establish that at least a portion of the feature vector is unknown to the tree model.
  • f2 is established as being the portion of the feature vector that is unknown to the tree model.
  • the system may establish that it is necessary to add one or more new fields to a data structure representing the tree model, namely f2.
  • a conventional approach consists of allocating, in a memory of the system wherein the data modelizing the tree model is stored, additional memory arrays.
  • the size of the memory arrays to be allocated may dependent both from the data modelizing the tree model and f2.
  • the size of the memory arrays to be allocated may depend on a number of possible values that f2 may take.
  • such approach may present ineffiencies in how the memory of the system is used as the memory arrays allocated as a result of the number of possible values of f2 may represent a substantial size, in particular if the number of possible values for f2 is high.
  • the memory arrays being allocated based on the number of possible value of f2 it might often be that the memory arrays have only a limited number of actual entries of f2 therefore resulting in memory arrays being allocated but not actually used at a given time.
  • the present technology arises from an observation made by the inventor(s) that data of a category type representing one or more parameters of a feature vector may be processed by applying a hash function to generate a hash vector. Multiple hash vectors may be generated, for example, a hash vector may be generated for each one of the parameters of category type. The generated hash vectors may then be used in the generation of a hashed complex vector that then become a key of a hash table. The key may then be retrieved to allow identifying multiple parameters of category type.
  • adding new data of category type to a data structure representing a tree model may be limited to creating a new entry to the hash table thereby limiting (if not avoiding, under certain circumstances) allocation of memory arrays depending on a number of possible values the new data may take.
  • the present technology may allow allocation of memory arrays without having to enumerate all possible combinations associated with a category type beforehand. The present technology thereby results in a more efficient and flexible use of the memory of the system while allowing association between a document and a parameter of interest based on a tree model.
  • various implementations of the present technology provide a computer-implemented method of generating a hashed complex vector indicative of an association between a document and a parameter of interest, the document being associated with first data and second data, the method comprising: • accessing, from a non-transitory computer-readable medium, the first data associated with the document, the first data being at least one of a binary type and a real number type;
  • a processor • generating, by a processor, a mask vector based on the first data, the mask vector comprising a plurality of numbers corresponding to a path in a tree model, each one of the plurality of numbers being indicative of a branch associated with a node of the tree model;
  • the method further comprises: accessing, from the non-transitory computer-readable medium, a collection of previously generated hashed complex vectors; if the hashed complex vector corresponds to one of the previously generated hashed complex vectors, associating a parameter of interest associated to the corresponding previously generated hashed complex vectors to the document; and if the hashed complex vector does not correspond to any of the previously generated hashed complex vectors, adding the hashed complex vector to the collection of previously generated hashed complex vectors.
  • adding the hashed complex vector to the collection of previously generated hashed complex vectors further comprises associating a parameter of interest to the hashed complex vector.
  • the leaf of the tree model is associated with a parameter of interest based on a machine learning algorithm using a training document.
  • the first data is indicative of at least one of a number of clicks, a number of views and a document ranking and wherein the second data is indicative of at least one of a URL, a domain name, an IP address, a search query and a key word.
  • the tree model is an oblivious tree model.
  • each one of the plurality of numbers comprised by the mask vector is a binary number.
  • the first data comprises at least one integer variable and wherein generating the mask comprises applying a translation function to generate a binary number associated with the integer variable.
  • the second data comprises a first categorical variable and a second categorical variable.
  • generating the hash vector comprises applying the first hash function to the first categorical variable and a third hash function to the second categorical variable.
  • the first hash function and the third hash function are one of same hash function and distinct hash function.
  • the first hash function and the second hash function are one of same hash function and distinct hash function.
  • the node of the tree model corresponds to a condition, the condition having been determined based on a machine learning algorithm.
  • adding the hashed complex vector to the collection of previously generated hashed complex vectors further comprises adding a node to the tree model.
  • the parameter of interest is indicative of at least one of a search result prediction, a probability of click, a document relevance, a user interest, a URL and a number of clicks.
  • various implementations of the present technology provide a non- transitory computer-readable medium storing program instructions for generating a hashed vector indicative of an association between a document and a parameter of interest, the program instructions being executable by a processor of a computer-based system to carry out one or more of the above-recited methods.
  • various implementations of the present technology provide a computer-based system, such as, for example, but without being limitative, an electronic device comprising at least one processor and a memory storing program instructions for generating a hashed vector indicative of an association between a document and a parameter of interest, the program instructions being executable by one or more processors of the computer-based system to carry out one or more of the above-recited methods.
  • an "electronic device”, an “electronic device”, a “server”, a, “remote server”, and a “computer- based system” are any hardware and/or software appropriate to the relevant task at hand.
  • some non-limiting examples of hardware and/or software include computers (servers, desktops, laptops, netbooks, etc.), smartphones, tablets, network equipment (routers, switches, gateways, etc.) and/or combination thereof.
  • computer-readable medium and “memory” are intended to include media of any nature and kind whatsoever, non-limiting examples of which include RAM, ROM, disks (CD- ROMs, DVDs, floppy disks, hard disk drives, etc.), USB keys, flash memory cards, solid state- drives, and tape drives.
  • an "indication" of an information element may be the information element itself or a pointer, reference, link, or other indirect mechanism enabling the recipient of the indication to locate a network, memory, database, or other computer-readable medium location from which the information element may be retrieved.
  • an indication of a document could include the document itself (i.e. its contents), or it could be a unique document descriptor identifying a file with respect to a particular file system, or some other means of directing the recipient of the indication to a network location, memory address, database table, or other location where the file may be accessed.
  • the degree of precision required in such an indication depends on the extent of any prior understanding about the interpretation to be given to information being exchanged as between the sender and the recipient of the indication. For example, if it is understood prior to a communication between a sender and a recipient that an indication of an information element will take the form of a database key for an entry in a particular table of a predetermined database containing the information element, then the sending of the database key is all that is required to effectively convey the information element to the recipient, even though the information element itself was not transmitted as between the sender and the recipient of the indication.
  • references to a "first" element and a “second” element does not preclude the two elements from being the same actual real- world element.
  • a “first” server and a “second” server may be the same software and/or hardware, in other cases they may be different software and/or hardware.
  • Implementations of the present technology each have at least one of the above- mentioned object and/or aspects, but do not necessarily have all of them. It should be understood that some aspects of the present technology that have resulted from attempting to attain the above-mentioned object may not satisfy this object and/or may satisfy other objects not specifically recited herein.
  • Figure 1 is a diagram of a computer system suitable for implementing the present technology and/or being used in conjunction with implementations of the present technology;
  • Figure 2 is a diagram of a networked computing environment in accordance with an embodiment of the present technology
  • Figure 3 is a diagram illustrating a tree model and two exemplary feature vectors in accordance with an embodiment of the present technology
  • Figure 4 is a diagram illustrating a generation of a hashed complex vector in accordance with an embodiment of the present technology
  • Figure 5 is a diagram illustrating a generation of a hashed complex vector in accordance with another embodiment of the present technology.
  • Figure 6 is a diagram illustrating a flowchart illustrating a computer-implemented method implementing embodiments of the present technology.
  • any functional block labeled as a "processor” or a "graphics processing unit” may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software.
  • the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared.
  • the processor may be a general purpose processor, such as a central processing unit (CPU) or a processor dedicated to a specific purpose, such as a graphics processing unit (GPU).
  • CPU central processing unit
  • GPU graphics processing unit
  • processor or “controller” should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read-only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • ROM read-only memory
  • RAM random access memory
  • non-volatile storage Other hardware, conventional and/or custom, may also be included.
  • FIG 1 there is shown a computer system 100 suitable for use with some implementations of the present technology, the computer system 100 comprising various hardware components including one or more single or multi-core processors collectively represented by processor 110, a graphics processing unit (GPU) 111, a solid-state drive 120, a random access memory 130, a display interface 140, and an input/output interface 150.
  • processor 110 a graphics processing unit (GPU) 111
  • solid-state drive 120 solid-state drive
  • random access memory 130 random access memory
  • display interface 140 a display interface 140
  • input/output interface 150 input/output interface
  • Communication between the various components of the computer system 100 may be enabled by one or more internal and/or external buses 160 (e.g. a PCI bus, universal serial bus, IEEE 1394 "Firewire” bus, SCSI bus, Serial-ATA bus, etc.), to which the various hardware components are electronically coupled.
  • the display interface 140 may be coupled to a monitor 142 (e.g. via an HDMI cable 144) visible to a user 170, and the input/output interface 150 may be coupled to a touchscreen (not shown), a keyboard 151 (e.g. via a USB cable 153) and a mouse 152 (e.g. via a USB cable 154), each of the keyboard 151 and the mouse 152 being operable by the user 170.
  • the solid-state drive 120 stores program instructions suitable for being loaded into the random access memory 130 and executed by the processor 110 and/or the GPU 111 for processing activity indications associated with a user.
  • the program instructions may be part of a library or an application.
  • FIG 2 there is shown a networked computing environment 200 suitable for use with some implementations of the present technology, the networked computing environment 200 comprising a master server 210 in communication with a first slave server 220, a second slave server 222 and a third slave server 224 (also referred to as the slave servers 220, 222, 224 hereinafter) via a network (not depicted) enabling these systems to communicate.
  • the network can be implemented as the Internet.
  • the network may be implemented differently, such as any wide-area communications network, local-area communications network, a private communications network and the like.
  • the networked computing environment 200 may contain more or less slave servers without departing from the scope of the present technology. In some embodiments, no "master server - slave server" configuration may be required, a single server may be sufficient. The number of servers and the type of architecture is therefore not limitative to the scope of the present technology.
  • a communication channel (not depicted) between the master server 210 and the slave servers 220, 222, 224 may be established to allow data exchange.
  • data exchange may occur on a continuous basis or, alternatively, upon occurrence of certain events.
  • a data exchange may occur as a result of the master server 210 receiving first data and second data associated with a document for which association with a parameter of interest is to be made by the networked computing environment.
  • the master server 210 may receive the first data and the second data from a frontend search engine server (not depicted) and send the first and the second data to one or more of the slave servers 220, 222, 224.
  • the one or more slave servers 220, 222, 224 may process the first data and the second data in accordance with the present technology to generate a hashed complex vector indicative of an association between the document and the parameter of interest.
  • the generated hashed complex vector and/or the parameter of interest may be transmitted to the master server 210 that, in turn, may transmit the generated hashed complex vector and/or the parameter of interest to the frontend search engine server.
  • the one or more slave servers 220, 222, 224 may directly transmit the generated hashed complex vector and/or the parameter of interest to the frontend search engine server without going through the intermediary step of communicating with the master server 210.
  • the master server 210 can be implemented as a conventional computer server and may comprise some or all of the features of the computer system 100 depicted at FIG. 1.
  • the master server 210 can be implemented as a DellTM PowerEdgeTM Server running the MicrosoftTM Windows ServerTM operating system. Needless to say, the master server 210 can be implemented in any other suitable hardware and/or software and/or firmware or a combination thereof.
  • the master server 210 is a single server. In alternative non-limiting embodiments of the present technology, the functionality of the master server 210 may be distributed and may be implemented via multiple servers.
  • the master server 210 comprises a communication interface (not depicted) structured and configured to communicate with various entities (such as the frontend search engine server and/or the slave servers 220, 222, 224, for example and other devices potentially coupled to the network) via the network.
  • the master server 210 further comprises at least one computer processor (e.g., a processor 110 of the master server 210) operationally connected with the communication interface and structured and configured to execute various processes to be described herein.
  • the general purpose of the master server 210 is to coordinate the processing of the first data and the second data associated with the document by the slave servers 220, 222, 224.
  • the first data and the second data may be transmitted to some or all of the slave servers 220, 222, 224 so that the slave servers 220, 222, 224 may conduct associations between the document and a parameter of interest.
  • the master server 210 may receive, from the slave servers 220, 222, 224, the parameter of interest to be associated with the document.
  • the master server 210 may be limited to sending the first data and the second data without receiving any parameter of interest in return.
  • This scenario may occur upon determination by one or more of the slave servers 220, 222, 224 that the first data and second data leads to modification of one of the tree models hosted on the slave servers 220, 222, 224.
  • the master server 210 may transmit the first data and the second data to the slave servers 220, 222, 224 along with a parameter of interest to be associated with the first data and the second data.
  • one of the tree models hosted by the slave servers 220, 222, 224 may be modified so that the first data and/or the second data may be associated with the parameter of interest in the tree model.
  • the slave servers 220, 222, 224 may transmit a message to the master server 210, the message being indicative of a modification made to one of the tree models.
  • the master server 210 may be envisioned without departing from the scope of the present technology and may become apparent to the person skilled in the art of the present technology.
  • the slave servers 220, 222, 224 can be implemented as conventional computer servers and may comprise some or all of the features of the computer system 100 depicted at FIG. 1.
  • the slave servers 220, 222, 224 can be implemented as a DellTM PowerEdgeTM Server running the MicrosoftTM Windows ServerTM operating system.
  • the slave servers 220, 222, 224 can be implemented in any other suitable hardware and/or software and/or firmware or a combination thereof.
  • the slave servers 220, 222, 224 operate on a distributed architecture basis. In alternative non-limiting embodiments, a single slave server may be relied upon to operate the present technology.
  • each one of the slave servers 220, 222, 224 may comprise a communication interface (not depicted) structured and configured to communicate with various entities (such as the frontend search engine server and/or the master server 210, for example and other devices potentially coupled to the network) via the network.
  • Each one of the slave servers 220, 222, 224 further comprises at least one computer processor (e.g., similar to the processor 110 depicted at FIG. 1) operationally connected with the communication interface and structured and configured to execute various processes to be described herein.
  • Each one of the slave servers 220, 222, 224 further may comprise one or more memories (e.g., similar to the solid-state drive 120 and/or the random access memory 130 depicted at FIG. 1).
  • the general purpose of the slave servers 220, 222, 224 is to coordinate the processing of the first data and the second data associated with the document.
  • the first data and the second data may be received from the master server 210 and/or the frontend server.
  • the slave servers 220, 222, 224 may conduct associations between the document and a parameter of interest. Once the associations have been conducted, the slave servers 220, 222, 224 may transmit to the master server 210 the parameter of interest to be associated with the document. In some other embodiments, the slave servers 220, 222, 224 may not transmit any parameter of interest to the master server 210.
  • This scenario may occur upon determination by the slave servers 220, 222, 224 that the first data and second data leads to modification of one of the tree models that they host.
  • the slave servers 220, 222, 224 may receive the first data and the second data from the master server 210 along with a parameter of interest to be associated with the first data and the second data.
  • one of the tree models hosted by the slave servers 220, 222, 224 may be modified so that the first data and/or the second data may be associated with the parameter of interest in the tree model.
  • the slave servers 220, 222, 224 may transmit a message to the master server 210, the message being indicative of a modification made to one of the tree models.
  • Other variations as how the slave servers 220, 222, 224 interact with the master server 210 may be envisioned without departing from the scope of the present technology and may become apparent to the person skilled in the art of the present technology.
  • the slave servers 220, 222, 224 may each be communicately coupled to "hash table 1" database 230, "hash table 2" database 232 and “hash table n" database 234 (referred to as "the databases 230, 232, 234" hereinafter).
  • the databases 230, 232, 234 may be part of the slave servers 220, 222, 224 (e.g., stored in the memories of the slave servers 220, 222, 224 such as the solid-state drive 120 and/or the random access memory 130) or be hosted on distinct database servers. In some embodiments, a single database accessed by the slave servers 220, 222, 224 may be sufficient.
  • the number of databases and the arrangement of the databases 230, 232, 234 are therefore not limitative to the scope of the present technology.
  • the databases 230, 232, 234 may be used to access and/or store data relating to one or more hash tables representative of tree models generated in accordance with the present technology.
  • the databases 230, 232, 234 may be accessed by the slave servers 220, 222, 224 to identify a parameter of interest to be associated with the document further to the processing of the first data and the second data by the slave servers 220, 222, 224 in accordance with the present technology.
  • the databases 230, 232, 234 may be accessed by the slave servers 220, 222, 224 to store a new entry (also referred to as a "hashed complex vector" and/or "key” hereinafter) in the one or more hash tables, the new entry having been generated further to the processing of the first data and the second data and being representative of a parameter of interest to be associated with the document.
  • the new entry may be representative a modification made to a tree models modelized by the hash table.
  • the first set of parameters 330 and the second set of parameters 340 may equally be referred to as feature vectors.
  • the tree model 300 may have been generated in accordance with the present technology and may modelize an association between a document and a parameter of interest.
  • the document may take multiple forms and formats to represent documents of various natures, such as, but without being limitative, text files, text documents, web pages, audio files, video files and so on.
  • the document may equally be referred to as a file without departing from the scope of the present technology.
  • the file may be a document searchable by a search engines.
  • the parameter of interest may take multiple forms and formats to represent an indication of an order or ranking of a document, for example, but without being limitative.
  • the parameter of interest may be referred to as a label and/or a ranking, in particular in the context of search engines.
  • the parameter of interest may be generated by a machine-learning algorithm using a training document.
  • other methods may be used such as, but without being limitative manually defining the parameter of interest. How the parameter of interest is generated is therefore not limitative and multiple embodiments may be envisioned without departing from the scope of the present technology and may become apparent to the person skilled in the art of the present technology.
  • a path throughout the tree model 300 may be defined by the first set of parameters 330 and/or the second set of parameters 340.
  • the tree model 300 comprises multiple nodes each connected to one or more branches.
  • a first node 302, a second node 304, a third node 306, a fourth node 308 and a fifth node 310 are depicted.
  • Each one of the first node 302, the second node 304, the third node 306, the fourth node 308 and the fifth node 310 is associated with a condition.
  • the first node 302 is associated with a condition "if Page_rank ⁇ 3.5" associated with two branches (i.e., true represented by a binary number "0” and false represented by a binary number "1")
  • the second node 304 is associated with a condition "Is main page?” associated with two branches (i.e., true represented by a binary number "0” and false represented by a binary number "1")
  • the third node 306 is associated with a condition "if Number_clicks ⁇ 5,000 " associated with two branches (i.e., true represented by a binary number "0” and false represented by a binary number "1")
  • the fourth node 308 is associated with a condition "which URL?” associated with more than two branches (i.e., each one of the branches is associated with a different URL, for example, the URL "yandex.ru”)
  • the fifth node 310 is associated with a condition "which Search query?" associated with more than two branches (i.e., each one of the branches is associated
  • the tree model 300 may associate a document (such as, for example, but without being limitative, a web page in the html format) with the parameter of interest associated with the leaf 312, the association being defined by a path through the tree model 300 based on the first set of parameters 330 and/or the second set of parameters 340. It should be appreciated that for purpose of clarity, only a portion of the tree model 300 is illustrated. The person skilled in the art of the present technology may appreciate that the number of nodes, branches and leafs is virtually unlimited and solely depend on a complexity of the tree model to be modelized.
  • the tree model may be an oblivious tree model comprising a set of nodes each comprising two branches (i.e., true represented by a binary number "0" and false represented by a binary number "1").
  • the present technology is not limited to oblivious tree models and multiple variations may be envisioned by the person skilled in the art of the present technology, such as for example, a tree model comprising a first portion defining an oblivious tree model and a second portion defining a non-oblivious tree model as exemplified by the tree model 300 (e.g., a first portion defined by the first node 302, the second node 304 and the third node 306 and a second portion defined by the fourth node 308 and the fifth node 310).
  • the first set of parameters 330 illustrates an example of parameters defining the path exemplified by the tree model 300.
  • the set of parameters 330 comprises first data 335 and second data 339.
  • the first data 335 and the second data 339 may be associated with the document and allows defining the path in the tree model 300 described in the paragraph above.
  • the first data 335 may be of binary type and/or of real number type (e.g., integer number type, floating number type).
  • the first data 335 may represent a path in an oblivious portion of the tree model as it is the case in the example depicted in FIG. 3. Other variations may also be possible without departing from the scope of the present technology. In the example of FIG.
  • the first data 335 comprises a first component 332 associated with a value "01” and a second component 334 associated with a value "3500". Even though the term “component” is used in the present description, it should be understood that the term “variable” may be equally used and may therefore be considered as being an equivalent to “component”.
  • the first component 332 comprises the binary sequence "01” which, once projected in the tree model 300, allows establishing a first portion of the path. In the example of FIG. 3, the first portion of the path is established by applying a first binary digit "0" of the sequence "01" to the first node 302 and then a second binary digit "1" of the sequence "01" to the second node 304.
  • the second component 334 once project in the tree model 300, allows establishing a second portion of the path.
  • the second portion of the path is established by applying the number "3500" to the third node 306".
  • the example of FIG. 3 illustrates the first data as comprising the first component 332 and the second component 334, the number of components and the number of digits comprised by one of the components is not limitative and many variations may be envisioned without departing from the scope of the present technology.
  • the second data 339 may be of category type.
  • the second data 339 may also be referred to as categorical features and may comprise, for example, but without being limitative, a host, an URL, a domain name, an IP address, a search query and/or a key word.
  • the second data 339 may be broadly described as comprising label categories allowing categorisation of information.
  • the second data may take the form of a chain and/or string of characters and/or digits.
  • the second data 339 may be comprise a parameter that may take more than two values, as it is the case in the example of FIG.
  • the second data 339 may comprise is not limitative and many variations may be envisioned without departing from the scope of the present technology.
  • the second data 339 may represent a path in a non-oblivious portion of the tree model as it is the case in the example depicted at FIG. 3. Other variations may also be possible without departing from the scope of the present technology.
  • the second data 339 comprises a first component 336 associated with a value "yandex.ru” and a second component 338 associated with a value "See Eiffel Tower".
  • the first component 336 comprises a string of character "yandex.ru” which, once projected in the tree model 300, allows establishing a fourth portion of the path.
  • the fourth portion of the path is established by applying the string of character "yandex.ru” to the fourth node 308.
  • the second component 338 once projected in the tree model 300, allows establishing a fifth portion of the path.
  • the fifth portion of the path is established by applying the string of character "See Eiffel Tower" to the fifth node 310 thereby leading to the leaf 312 and the parameter of interest associated therewith.
  • the example of FIG. 3 illustrates the second data 339 as comprising the first component 336 and the second component 338, the number of components and the number of digits and/or characters comprised by one of the components is not limitative and many variations may be envisioned without departing from the scope of the present technology.
  • the second set of parameters 340 illustrates another example of parameters defining the path exemplified by the tree model 300.
  • the set of parameters 330 comprises first data 343 and second data 349.
  • the second set of parameters 340 may be associated with the document and allows defining the path in the tree model 300 described in the paragraph above.
  • the second set of parameters is similar on all aspects to the first set of parameters 330 with the exception of the first data 343.
  • the first data 343 comprises a sequence of digits "010"
  • the first data 335 comprises the first component 332 associated with the value "01" and the second component 334 associated with the value "3500".
  • the value "3500” has been represented by a binary digit "0" which is the output of the value "3500” applied to the condition associated with the node 306 (i.e., Number_clicks ⁇ 5,000").
  • the first data 343 may be considered as an alternative representation to the first data 335 of a same path in the tree mode 300.
  • a real number value may be translated into a binary value in particular for cases wherein a node of a tree model to which the integer value is to be applied corresponds to an oblivious section of the tree model.
  • the second data 349 comprises a first component 344 and a second component 346 that are identical to the first component 336 and the second component 338 of the second data 339.
  • a complex vector 350 comprises a component MASK, a hash value hi and a hash value h2.
  • the component MASK may be generated from the first data 343.
  • the component MASK may be generated from the first component 332 and the second component 334 of the first data 335.
  • the component MASK may correspond to the path between the first node 302 and the fourth node 308 in the tree model 300.
  • the hash value hi may be generated by applying a hash function HI to the first component 344 of the second data 349.
  • the hash function may be any function that may be readily apparent to the person skilled in the art of the present technology. Many variations may be envisioned without departing from the scope of the present technology.
  • the hash value h2 may be generated by applying the hash function HI to the second component 346 of the second data 349. Even though the complex vector 350 is depicted as comprising the component MASK, the hash value hi and the hash value h2, it should be understood that the complex vector may comprise more or less components MASK and more or less hash values. This aspect of the present technology is therefore not limitative and many variations may be envisioned without departing from the scope of the present technology.
  • a key 360 is generated by applying a hash function H2 to the complex vector 350.
  • the hash function may be any function that may be readily apparent to the person skilled in the art of the present technology. Many variations may be envisioned without departing from the scope of the present technology.
  • the hash function HI may be a same hash function as the hash function H2.
  • the hash function HI may be a different hash function as the hash function H2.
  • the key 360 may also be referred to as a hashed complex vector.
  • the key 360, once generated may be stored in a memory, such as one of the databases 230, 232, 234 depicted in FIG.
  • the key 360 once accessed, may allow identifying a path in a tree model.
  • a tree model may be partially or entirely be modelized by a set of keys generated in a similar fashion as the key 360.
  • the key because of the key being a single entry in a hash table, the key may be retrieved to allow identifying multiple parameters of category type such as the first component 344 and the second component 346.
  • adding new data of category type to a data structure representing a tree model may be limited to creating a new entry to the hash table thereby limiting (if not avoiding, under certain circumstances) allocation of memory arrays depending on a number of possible values the new data may take.
  • the present technology may allow allocation of memory arrays without having to enumerate all possible combinations associated with a category type beforehand.
  • the present technology thereby results in a more efficient use of the memory of the system while allowing association between a document and a parameter of interest based on a tree model.
  • generation of the key 360 may allow accessing a parameter of interest previously associated with the key 360 based on an existing data structure representing a tree model.
  • generation of the key 360 may allow adding a new parameter of interest not previously existing in the data structure representing the tree model.
  • the new parameter of interest may be added by creating one or more new branches in the tree model. Because of how the key 360 is generated, adding the new parameter of interest may be conducted while limiting the amount of additional memories to be allocated for cases wherein an association between a document and the parameter of interest includes data of category type.
  • a complex vector 450 comprises a component MASK, a hash value hi and a hash value h2.
  • the component MASK is similar to the component MASK of the complex vector 350.
  • the hash value hi may be generated by applying the hash function HI to the first component 344 of the second data 349.
  • the hash function HI is similar to the hash function HI depicted in FIG. 4.
  • the hash value hi of the complex vector 350 is similar to the hash value hi of the complex vector 450.
  • the hash value h2 may be generated by applying a hash function H3 to the second component 346 of the second data 349.
  • the hash function H3 may be a different hash function than the hash function HI and/or the hash function H2.
  • the embodiment of FIG. 5 illustrates that, in some embodiments of the present technology, multiple hash functions may be used for the generation of the hash value hi and the hash value h2.
  • a key 460 is generated by applying the hash function H2 to the complex vector 450.
  • the hash function H2 may be similar or different from the hash function H3.
  • FIG. 6 shows a flowchart illustrating a computer-implemented method 600 of generating a hashed complex vector indicative of an association between a document and a parameter of interest.
  • the parameter of interest is indicative of at least one of a search result prediction, a probability of click, a document relevance, a user interest, a URL and a number of clicks.
  • the method 600 starts at step 602 by accessing, from a non-transitory computer- readable medium, the first data associated with the document, the first data being at least one of a binary type and a real number type. Then, the method 600 proceeds to a step 604 by accessing, from the non-transitory computer-readable medium, the second data associated with the document, the second data being a category type.
  • the first data is indicative of at least one of a number of clicks, a number of views and a document ranking and wherein the second data is indicative of at least one of a URL, a domain name, an IP address, a search query and a key word.
  • the second data comprises a first categorical variable and a second categorical variable.
  • the method 600 proceed to generating, by a processor, a mask vector based on the first data, the mask vector comprising a plurality of numbers corresponding to a path in a tree model, each one of the plurality of numbers being indicative of a branch associated with a node of the tree model.
  • the tree model is an oblivious tree model.
  • the node of the tree model corresponds to a condition, the condition having been determined based on a machine learning algorithm.
  • the method 600 generates, by a processor, a mask vector based on the first data, the mask vector comprising a plurality of numbers corresponding to a path in a tree model, each one of the plurality of numbers being indicative of a branch associated with a node of the tree model.
  • each one of the plurality of numbers comprised by the mask vector is a binary number.
  • the first data comprises at least one integer variable and wherein generating the mask comprises applying a translation function to generate a binary number associated with the integer variable.
  • the method 600 then proceeds to generating, by the processor, a hash vector based on the second data by applying a first hash function to the second data.
  • generating the hash vector comprises applying the first hash function to the first categorical variable and a third hash function to the second categorical variable.
  • the first hash function and the third hash function are one of same hash function and distinct hash function.
  • the method 600 proceeds to generating, by the processor, a complex vector comprising the mask vector and the hash vector, the complex vector being indicative of a leaf of the tree model.
  • the leaf of the tree model is associated with a parameter of interest based on a machine learning algorithm using a training document.
  • the method 600 proceeds to generating, by the processor, a hashed complex vector by applying a second hash function to the complex vector.
  • the method 600 may then proceeds to step 614 wherein the hashed complex vector is stored in the non- transitory computer-readable medium.
  • the first hash function and the second hash function are one of same hash function and distinct hash function.
  • the method 600 may comprise accessing, from the non-transitory computer-readable medium, a collection of previously generated hashed complex vectors. If the hashed complex vector corresponds to one of the previously generated hashed complex vectors, the method 600 may associate a parameter of interest associated to the corresponding previously generated hashed complex vectors to the document. If the hashed complex vector does not correspond to any of the previously generated hashed complex vectors, the method 600 may add the hashed complex vector to the collection of previously generated hashed complex vectors.
  • adding the hashed complex vector to the collection of previously generated hashed complex vectors further comprises associating a parameter of interest to the hashed complex vector. In yet some other embodiments, adding the hashed complex vector to the collection of previously generated hashed complex vectors further comprises associating a parameter of interest to the hashed complex vector.
  • [106] A computer-implemented system (220, 222, 224) configured to perform the method of any one of clauses 1 to 15.
  • [107] [Clause 17] A non-transitory computer-readable medium (120, 130), comprising computer-executable instructions that cause a system to execute the method according to any one of clauses 1 to 16.
  • the signals can be sent-received using optical means (such as a fibre-optic connection), electronic means (such as using wired or wireless connection), and mechanical means (such as pressure-based, temperature based or any other suitable physical parameter based).
  • optical means such as a fibre-optic connection
  • electronic means such as using wired or wireless connection
  • mechanical means such as pressure-based, temperature based or any other suitable physical parameter based

Abstract

A computer-implemented method (600) of and a system (220, 222, 224) for generating a hashed complex vector (360) indicative of an association between a document and a parameter of interest, the document being associated with first data (343) and second data (349). The method comprises accessing (602) the first data (343); accessing (604) the second data (349) the second data being a category type; generating (606), by a processor, a mask vector based on the first data; generating (608) a hash vector; generating (610) a complex vector (350) comprising the mask vector and the hash vector; generating (612) a hashed complex vector by applying a second hash function (H2) to the complex vector (350); and storing (614) the hashed complex vector (360).

Description

METHOD OF AND SYSTEM FOR GENERATING A HASHED COMPLEX
VECTOR
CROSS-REFERENCE
[01] The present application claims priority to Russian Patent Application No 2015120563, filed June 01, 2015, entitled "METHOD OF AND SYSTEM FOR GENERATING A HASHED COMPLEX VECTOR" the entirety of which is incorporated herein.
FIELD
[02] The present technology relates to systems and methods for generating a hashed complex vector. In particular, the hashed complex vector may be indicative of an association between a document and a parameter of interest. The document may be associated with first data and second data which may be processed to associate a parameter of interest to the document.
BACKGROUND
[03] Typically, in building a search-efficient data collection management system such as web search engines, data items are indexed according to some or all of the possible search terms that may be contained in search queries. Thus, conventionally an "inverted index" of the data collection is created, maintained, and updated by the system. The inverted index will comprise a large number of "posting lists" to be reviewed during execution of a search query. Each posting list corresponds to a potential search term and contains "postings", which are references to the data items in the data collection that include that search term (or otherwise satisfy some other condition that is expressed by the search term). For example, if the data items are text documents, as is often the case for Internet (or "Web") search engines, then search terms are individual words (and/or some of their most often used combinations), and the inverted index comprises one posting list for every word that has been encountered in at least one of the documents.
[04] Search queries, especially those made by human users, typically have the form of a simple list of one or more words, which are the "search terms" of the search query. Every such search query may be understood as a request to the search engine to locate every data item in the data collection containing each and every one of the search terms specified in the search query. Processing of a search query will involve searching through one or more posting lists of the inverted index. As was discussed above, typically there will be a posting list corresponding to each of the search terms in the search query. Posting lists are searched as they can be easily stored and manipulated in a fast access memory device, whereas the data items themselves cannot (the data items are typically stored in a slower access storage device). This generally allows search queries to be performed at a much higher speed.
[05] Typically, each data item in a data collection is numbered. Rather than being ordered in some chronological, geographical or alphabetical order in the data collection, data items are commonly ordered (and thus numbered) within the data collection in descending order of what is known in the art as their "query- independent relevance" (hereinafter abbreviated to "QIR"). QIR is a system-calculated heuristic parameter defined in such a way that the data items with a higher QIR value are statistically more likely to be considered by a search requester of any search query as sufficiently relevant to them. The data items in the data collection will be ordered so that those with a higher QIR value will be found first when a search is done. They will thus appear at (or towards) the beginning of the search result list (which is typically shown in various pages, with those results at the beginning of the search result list being shown on the first page). Thus, each posting list in the inverted index will contain postings, a list of references to data items containing the term with which that posting list is associated, with the postings being ordered in descending QIR value order. (This is very commonly the case in respect of web search engines.).
[06] It should be evident, however, that such a heuristic QIR parameter may not provide for an optimal ordering of the search results in respect of any given specific query, as it will clearly be the case that a data item which is generally relevant in many searches (and thus high in terms of QIR) may not be specifically relevant in any particular case. Further, the relevance of any one particular data item will vary between searches. Because of this, conventional search engines implement various methods for filtering, ranking and/or reordering search results to present them in an order that is believed to be relevant to the particular search query yielding those search results. This is known in the art as "query- specific relevance" (hereinafter abbreviated "QSR"). Many parameters are typically taken into account when determining QSR. These parameters include: various characteristics of the search query; of the search requester; of the data items to be ranked; data having been collected during (or, more generally, some "knowledge" learned from) past similar search queries.
[07] Thus, the overall process of executing a search query can be considered as having two broad distinct stages: A first stage wherein all of the search results are collected based (in part) on their QIR values, aggregated and ordered in descending QIR order; and a second stage wherein at least some of the search results are reordered according to their QSR. Afterwards a new QSR-ordered list of the search results is created and delivered to the search requester. The search result list is typically delivered in parts, starting with the part containing the search results with the highest QSR.
[08] Typically, in the first stage, the collecting of the search results stops after some predefined maximum number of results has been attained or some predefined minimum QIR threshold has been reached. This is known in the art as "pruning"; and it occurs, as once the pruning condition has been reached, it is very likely that the relevant data items have already been located.
[09] Typically, in the second stage, a shorter, QSR-ordered, list (which is a subset of the search results of the first stage) is produced. This is because a conventional web search engine, when conducting a search of its data collection (which contains several billions of data items) for data items satisfying a given search query, may easily produce a list of tens of thousands of search results (and even more in some cases). Obviously the search requester cannot be provided with such an amount of search results. Hence the great importance of narrowing down the search results actually provided to the requester to a few tens of result items that are potentially of highest relevance to the search requester.
[10] In order to address the ranking needs required for proper operations of web search engines such as, for example but without being limited thereto, the generation of QIR values and/or QSR values, multiple constructions of ranking models have been developed over the recent years. These ranking models may enable ranking of documents (e.g., web pages, text files, image files and/or video files) according to one or more parameters. Under some approaches, machine-learning algorithms are used for construction and operations of ranking models and are typically referred to as Machine-learned ranking (hereinafter abbreviated to "MLR"). As one person skilled in the art of the present technology may appreciate, MLR is not limited to web search engines per se but may be applicable to a broad range of information retrieval systems.
[11] Under some approaches, the ranking generated by a MLR model may consist of a ranking value associated with a document that may equally be referred to as a "parameter of interest" or "label". The document may equally be referred to as a file. The ranking may consist of an "absolute ranking" of an absolute order of a first document compared to a second document such as, for example, the QIR value. In some other instances, the ranking may consist of a "relative ranking" of a relative order of the first document compared to the second document given a particular context such as, for example, the QSR value. In order to associate documents with parameters of interest, MLR models may, in some instances, be generated and maintained through machine-learning algorithms relying on one or more training samples. The number of MLR models that is required for particular application may greatly vary. Conventional web search engines such as Yandex™ may rely on several thousands of MLR models that may be used in parallel during processing of search queries.
[12] In some instances, MLR models may rely on tree models and feature vectors comprising one or more parameters to associate a document (e.g., a web page) and a parameter of interest (e.g., a ranking value). The one or more parameters of a feature vector may be used to define a specific path in a tree model thereby allowing identification of which parameter of interest is to be associated with a particular document. Under some approaches, the one or more parameters may be of different types such as binary type, integer type and/or category type.
SUMMARY
[13] Under some approaches, a system hosting a tree model for the purpose of associating a document to a parameter of interest may receive a feature vector comprising parameters associated with the document. In some instances, the parameters may be implemented as first data associated with the document and second data also associated with the document. As an example, the first data may be of a binary type and/or of a real number type and the second data may be of a category type. The second data may be indicative of multiple features, for example a feature f 1 representative of an URL and a feature f2 representative of a key word. As a person skilled in the art of the present technology may appreciate, upon processing the feature vector, the system hosting the tree model may establish that at least a portion of the feature vector is unknown to the tree model. For the purpose of exemplifying this situation, f2 is established as being the portion of the feature vector that is unknown to the tree model. As a result, the system may establish that it is necessary to add one or more new fields to a data structure representing the tree model, namely f2. Upon facing a similar situation, a conventional approach consists of allocating, in a memory of the system wherein the data modelizing the tree model is stored, additional memory arrays. The size of the memory arrays to be allocated may dependent both from the data modelizing the tree model and f2. The size of the memory arrays to be allocated may depend on a number of possible values that f2 may take. As a person skilled in the art of the present technology may appreciate, such approach may present ineffiencies in how the memory of the system is used as the memory arrays allocated as a result of the number of possible values of f2 may represent a substantial size, in particular if the number of possible values for f2 is high. In addition, the memory arrays being allocated based on the number of possible value of f2, it might often be that the memory arrays have only a limited number of actual entries of f2 therefore resulting in memory arrays being allocated but not actually used at a given time.
[14] There is therefore a need for improved methods and systems aiming at more efficiently managing memory usage of a system wherein data modelizing a tree model is stored. In particular, there is a need for limiting the constraint associated with allocating memory arrays of a memory of the system depending on a number of possible values that a parameter, in particular a parameter of category type, may take.
[15] The present technology arises from an observation made by the inventor(s) that data of a category type representing one or more parameters of a feature vector may be processed by applying a hash function to generate a hash vector. Multiple hash vectors may be generated, for example, a hash vector may be generated for each one of the parameters of category type. The generated hash vectors may then be used in the generation of a hashed complex vector that then become a key of a hash table. The key may then be retrieved to allow identifying multiple parameters of category type. As a result, adding new data of category type to a data structure representing a tree model may be limited to creating a new entry to the hash table thereby limiting (if not avoiding, under certain circumstances) allocation of memory arrays depending on a number of possible values the new data may take. Under certain circumstance, the present technology may allow allocation of memory arrays without having to enumerate all possible combinations associated with a category type beforehand. The present technology thereby results in a more efficient and flexible use of the memory of the system while allowing association between a document and a parameter of interest based on a tree model.
[16] Thus, in one aspect, various implementations of the present technology provide a computer-implemented method of generating a hashed complex vector indicative of an association between a document and a parameter of interest, the document being associated with first data and second data, the method comprising: • accessing, from a non-transitory computer-readable medium, the first data associated with the document, the first data being at least one of a binary type and a real number type;
• accessing, from the non-transitory computer-readable medium, the second data associated with the document, the second data being a category type;
• generating, by a processor, a mask vector based on the first data, the mask vector comprising a plurality of numbers corresponding to a path in a tree model, each one of the plurality of numbers being indicative of a branch associated with a node of the tree model;
• generating, by the processor, a hash vector based on the second data by applying a first hash function to the second data;
• generating, by the processor, a complex vector comprising the mask vector and the hash vector, the complex vector being indicative of a leaf of the tree model;
• generating, by the processor, a hashed complex vector by applying a second hash function to the complex vector; and
• storing, in the non-transitory computer-readable medium, the hashed complex vector.
[17] In some implementations, the method further comprises: accessing, from the non-transitory computer-readable medium, a collection of previously generated hashed complex vectors; if the hashed complex vector corresponds to one of the previously generated hashed complex vectors, associating a parameter of interest associated to the corresponding previously generated hashed complex vectors to the document; and if the hashed complex vector does not correspond to any of the previously generated hashed complex vectors, adding the hashed complex vector to the collection of previously generated hashed complex vectors.
[18] In some further implementations, adding the hashed complex vector to the collection of previously generated hashed complex vectors further comprises associating a parameter of interest to the hashed complex vector.
[19] In some implementations, the leaf of the tree model is associated with a parameter of interest based on a machine learning algorithm using a training document. [20] In some further implementations, the first data is indicative of at least one of a number of clicks, a number of views and a document ranking and wherein the second data is indicative of at least one of a URL, a domain name, an IP address, a search query and a key word.
[21] In some implementations, the tree model is an oblivious tree model.
[22] In some further implementations, each one of the plurality of numbers comprised by the mask vector is a binary number.
[23] In some implementations, the first data comprises at least one integer variable and wherein generating the mask comprises applying a translation function to generate a binary number associated with the integer variable.
[24] In some further implementations, the second data comprises a first categorical variable and a second categorical variable.
[25] In some implementations, generating the hash vector comprises applying the first hash function to the first categorical variable and a third hash function to the second categorical variable.
[26] In some further implementations, the first hash function and the third hash function are one of same hash function and distinct hash function.
[27] In some implementations, the first hash function and the second hash function are one of same hash function and distinct hash function.
[28] In some further implementations, the node of the tree model corresponds to a condition, the condition having been determined based on a machine learning algorithm.
[29] In some implementations, adding the hashed complex vector to the collection of previously generated hashed complex vectors further comprises adding a node to the tree model.
[30] In some further implementations, the parameter of interest is indicative of at least one of a search result prediction, a probability of click, a document relevance, a user interest, a URL and a number of clicks. [31] In other aspects, various implementations of the present technology provide a non- transitory computer-readable medium storing program instructions for generating a hashed vector indicative of an association between a document and a parameter of interest, the program instructions being executable by a processor of a computer-based system to carry out one or more of the above-recited methods.
[32] In other aspects, various implementations of the present technology provide a computer-based system, such as, for example, but without being limitative, an electronic device comprising at least one processor and a memory storing program instructions for generating a hashed vector indicative of an association between a document and a parameter of interest, the program instructions being executable by one or more processors of the computer-based system to carry out one or more of the above-recited methods.
[33] In the context of the present specification, unless expressly provided otherwise, an "electronic device", an "electronic device", a "server", a, "remote server", and a "computer- based system" are any hardware and/or software appropriate to the relevant task at hand. Thus, some non-limiting examples of hardware and/or software include computers (servers, desktops, laptops, netbooks, etc.), smartphones, tablets, network equipment (routers, switches, gateways, etc.) and/or combination thereof.
[34] In the context of the present specification, unless expressly provided otherwise, the expression "computer-readable medium" and "memory" are intended to include media of any nature and kind whatsoever, non-limiting examples of which include RAM, ROM, disks (CD- ROMs, DVDs, floppy disks, hard disk drives, etc.), USB keys, flash memory cards, solid state- drives, and tape drives.
[35] In the context of the present specification, unless expressly provided otherwise, an "indication" of an information element may be the information element itself or a pointer, reference, link, or other indirect mechanism enabling the recipient of the indication to locate a network, memory, database, or other computer-readable medium location from which the information element may be retrieved. For example, an indication of a document could include the document itself (i.e. its contents), or it could be a unique document descriptor identifying a file with respect to a particular file system, or some other means of directing the recipient of the indication to a network location, memory address, database table, or other location where the file may be accessed. As one skilled in the art would recognize, the degree of precision required in such an indication depends on the extent of any prior understanding about the interpretation to be given to information being exchanged as between the sender and the recipient of the indication. For example, if it is understood prior to a communication between a sender and a recipient that an indication of an information element will take the form of a database key for an entry in a particular table of a predetermined database containing the information element, then the sending of the database key is all that is required to effectively convey the information element to the recipient, even though the information element itself was not transmitted as between the sender and the recipient of the indication.
[36] In the context of the present specification, unless expressly provided otherwise, the words "first", "second", "third", etc. have been used as adjectives only for the purpose of allowing for distinction between the nouns that they modify from one another, and not for the purpose of describing any particular relationship between those nouns. Thus, for example, it should be understood that, the use of the terms "first server" and "third server" is not intended to imply any particular order, type, chronology, hierarchy or ranking (for example) of/between the server, nor is their use (by itself) intended imply that any "second server" must necessarily exist in any given situation. Further, as is discussed herein in other contexts, reference to a "first" element and a "second" element does not preclude the two elements from being the same actual real- world element. Thus, for example, in some instances, a "first" server and a "second" server may be the same software and/or hardware, in other cases they may be different software and/or hardware.
[37] Implementations of the present technology each have at least one of the above- mentioned object and/or aspects, but do not necessarily have all of them. It should be understood that some aspects of the present technology that have resulted from attempting to attain the above-mentioned object may not satisfy this object and/or may satisfy other objects not specifically recited herein.
[38] Additional and/or alternative features, aspects and advantages of implementations of the present technology will become apparent from the following description, the accompanying drawings and the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS
[39] For a better understanding of the present technology, as well as other aspects and further features thereof, reference is made to the following description which is to be used in conjunction with the accompanying drawings, where:
[40] Figure 1 is a diagram of a computer system suitable for implementing the present technology and/or being used in conjunction with implementations of the present technology;
[41] Figure 2 is a diagram of a networked computing environment in accordance with an embodiment of the present technology;
[42] Figure 3 is a diagram illustrating a tree model and two exemplary feature vectors in accordance with an embodiment of the present technology;
[43] Figure 4 is a diagram illustrating a generation of a hashed complex vector in accordance with an embodiment of the present technology;
[44] Figure 5 is a diagram illustrating a generation of a hashed complex vector in accordance with another embodiment of the present technology; and
[45] Figure 6 is a diagram illustrating a flowchart illustrating a computer-implemented method implementing embodiments of the present technology.
[46] It should also be noted that, unless otherwise explicitly specified herein, the drawings are not to scale.
DETAILED DESCRIPTION
[47] The examples and conditional language recited herein are principally intended to aid the reader in understanding the principles of the present technology and not to limit its scope to such specifically recited examples and conditions. It will be appreciated that those skilled in the art may devise various arrangements which, although not explicitly described or shown herein, nonetheless embody the principles of the present technology and are included within its spirit and scope.
[48] Furthermore, as an aid to understanding, the following description may describe relatively simplified implementations of the present technology. As persons skilled in the art would understand, various implementations of the present technology may be of a greater complexity.
[49] In some cases, what are believed to be helpful examples of modifications to the present technology may also be set forth. This is done merely as an aid to understanding, and, again, not to define the scope or set forth the bounds of the present technology. These modifications are not an exhaustive list, and a person skilled in the art may make other modifications while nonetheless remaining within the scope of the present technology. Further, where no examples of modifications have been set forth, it should not be interpreted that no modifications are possible and/or that what is described is the sole manner of implementing that element of the present technology.
[50] Moreover, all statements herein reciting principles, aspects, and implementations of the present technology, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof, whether they are currently known or developed in the future. Thus, for example, it will be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the present technology. Similarly, it will be appreciated that any flowcharts, flow diagrams, state transition diagrams, pseudo-code, and the like represent various processes which may be substantially represented in computer-readable media and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
[51] The functions of the various elements shown in the figures, including any functional block labeled as a "processor" or a "graphics processing unit", may be provided through the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, by a single shared processor, or by a plurality of individual processors, some of which may be shared. In some embodiments of the present technology, the processor may be a general purpose processor, such as a central processing unit (CPU) or a processor dedicated to a specific purpose, such as a graphics processing unit (GPU). Moreover, explicit use of the term "processor" or "controller" should not be construed to refer exclusively to hardware capable of executing software, and may implicitly include, without limitation, digital signal processor (DSP) hardware, network processor, application specific integrated circuit (ASIC), field programmable gate array (FPGA), read-only memory (ROM) for storing software, random access memory (RAM), and non-volatile storage. Other hardware, conventional and/or custom, may also be included.
[52] Software modules, or simply modules which are implied to be software, may be represented herein as any combination of flowchart elements or other elements indicating performance of process steps and/or textual description. Such modules may be executed by hardware that is expressly or implicitly shown.
[53] With these fundamentals in place, we will now consider some non-limiting examples to illustrate various implementations of aspects of the present technology.
[54] Referring to FIG 1, there is shown a computer system 100 suitable for use with some implementations of the present technology, the computer system 100 comprising various hardware components including one or more single or multi-core processors collectively represented by processor 110, a graphics processing unit (GPU) 111, a solid-state drive 120, a random access memory 130, a display interface 140, and an input/output interface 150.
[55] Communication between the various components of the computer system 100 may be enabled by one or more internal and/or external buses 160 (e.g. a PCI bus, universal serial bus, IEEE 1394 "Firewire" bus, SCSI bus, Serial-ATA bus, etc.), to which the various hardware components are electronically coupled. The display interface 140 may be coupled to a monitor 142 (e.g. via an HDMI cable 144) visible to a user 170, and the input/output interface 150 may be coupled to a touchscreen (not shown), a keyboard 151 (e.g. via a USB cable 153) and a mouse 152 (e.g. via a USB cable 154), each of the keyboard 151 and the mouse 152 being operable by the user 170.
[56] According to implementations of the present technology, the solid-state drive 120 stores program instructions suitable for being loaded into the random access memory 130 and executed by the processor 110 and/or the GPU 111 for processing activity indications associated with a user. For example, the program instructions may be part of a library or an application.
[57] In FIG 2, there is shown a networked computing environment 200 suitable for use with some implementations of the present technology, the networked computing environment 200 comprising a master server 210 in communication with a first slave server 220, a second slave server 222 and a third slave server 224 (also referred to as the slave servers 220, 222, 224 hereinafter) via a network (not depicted) enabling these systems to communicate. In some non- limiting embodiments of the present technology, the network can be implemented as the Internet. In other embodiments of the present technology, the network may be implemented differently, such as any wide-area communications network, local-area communications network, a private communications network and the like.
[58] The networked computing environment 200 may contain more or less slave servers without departing from the scope of the present technology. In some embodiments, no "master server - slave server" configuration may be required, a single server may be sufficient. The number of servers and the type of architecture is therefore not limitative to the scope of the present technology.
[59] In one embodiment, a communication channel (not depicted) between the master server 210 and the slave servers 220, 222, 224 may be established to allow data exchange. Such data exchange may occur on a continuous basis or, alternatively, upon occurrence of certain events. For example, in the context of crawling webpages and/or processing a search query, a data exchange may occur as a result of the master server 210 receiving first data and second data associated with a document for which association with a parameter of interest is to be made by the networked computing environment. In some embodiments, the master server 210 may receive the first data and the second data from a frontend search engine server (not depicted) and send the first and the second data to one or more of the slave servers 220, 222, 224. Once received from the master server 210, the one or more slave servers 220, 222, 224 may process the first data and the second data in accordance with the present technology to generate a hashed complex vector indicative of an association between the document and the parameter of interest. The generated hashed complex vector and/or the parameter of interest may be transmitted to the master server 210 that, in turn, may transmit the generated hashed complex vector and/or the parameter of interest to the frontend search engine server. In some alternative embodiments, the one or more slave servers 220, 222, 224 may directly transmit the generated hashed complex vector and/or the parameter of interest to the frontend search engine server without going through the intermediary step of communicating with the master server 210.
[60] The master server 210 can be implemented as a conventional computer server and may comprise some or all of the features of the computer system 100 depicted at FIG. 1. In an example of an embodiment of the present technology, the master server 210 can be implemented as a Dell™ PowerEdge™ Server running the Microsoft™ Windows Server™ operating system. Needless to say, the master server 210 can be implemented in any other suitable hardware and/or software and/or firmware or a combination thereof. In the depicted non-limiting embodiment of present technology, the master server 210 is a single server. In alternative non-limiting embodiments of the present technology, the functionality of the master server 210 may be distributed and may be implemented via multiple servers.
[61] The implementation of the master server 210 is well known to the person skilled in the art of the present technology. However, briefly speaking, the master server 210 comprises a communication interface (not depicted) structured and configured to communicate with various entities (such as the frontend search engine server and/or the slave servers 220, 222, 224, for example and other devices potentially coupled to the network) via the network. The master server 210 further comprises at least one computer processor (e.g., a processor 110 of the master server 210) operationally connected with the communication interface and structured and configured to execute various processes to be described herein.
[62] The general purpose of the master server 210 is to coordinate the processing of the first data and the second data associated with the document by the slave servers 220, 222, 224. As previously described, in an embodiment, the first data and the second data may be transmitted to some or all of the slave servers 220, 222, 224 so that the slave servers 220, 222, 224 may conduct associations between the document and a parameter of interest. In return, the master server 210 may receive, from the slave servers 220, 222, 224, the parameter of interest to be associated with the document. In some other embodiments, the master server 210 may be limited to sending the first data and the second data without receiving any parameter of interest in return. This scenario may occur upon determination by one or more of the slave servers 220, 222, 224 that the first data and second data leads to modification of one of the tree models hosted on the slave servers 220, 222, 224. In some embodiments, the master server 210 may transmit the first data and the second data to the slave servers 220, 222, 224 along with a parameter of interest to be associated with the first data and the second data. In such instances, one of the tree models hosted by the slave servers 220, 222, 224 may be modified so that the first data and/or the second data may be associated with the parameter of interest in the tree model. In some embodiments, once one of the tree models hosted by the slave servers 220, 222, 224 has been modified, the slave servers 220, 222, 224 may transmit a message to the master server 210, the message being indicative of a modification made to one of the tree models. Other variations as how the master server 210 interacts with the slave servers 220, 222, 224 may be envisioned without departing from the scope of the present technology and may become apparent to the person skilled in the art of the present technology. In addition, it should be also expressly understood that in order to simplify the description presented herein above, the configuration of the master server 210 has been greatly simplified. It is believed that those skilled in the art will be able to appreciate implementational details for the master server 210 and for components thereof that may have been omitted for the purposes of simplification of the description.
[63] The slave servers 220, 222, 224 can be implemented as conventional computer servers and may comprise some or all of the features of the computer system 100 depicted at FIG. 1. In an example of an embodiment of the present technology, the slave servers 220, 222, 224 can be implemented as a Dell™ PowerEdge™ Server running the Microsoft™ Windows Server™ operating system. Needless to say, the slave servers 220, 222, 224 can be implemented in any other suitable hardware and/or software and/or firmware or a combination thereof. In the depicted non-limiting embodiment of present technology, the slave servers 220, 222, 224 operate on a distributed architecture basis. In alternative non-limiting embodiments, a single slave server may be relied upon to operate the present technology.
[64] The implementation of the slave servers 220, 222, 224 is well known to the person skilled in the art of the present technology. However, briefly speaking, each one of the slave servers 220, 222, 224 may comprise a communication interface (not depicted) structured and configured to communicate with various entities (such as the frontend search engine server and/or the master server 210, for example and other devices potentially coupled to the network) via the network. Each one of the slave servers 220, 222, 224 further comprises at least one computer processor (e.g., similar to the processor 110 depicted at FIG. 1) operationally connected with the communication interface and structured and configured to execute various processes to be described herein. Each one of the slave servers 220, 222, 224 further may comprise one or more memories (e.g., similar to the solid-state drive 120 and/or the random access memory 130 depicted at FIG. 1).
[65] The general purpose of the slave servers 220, 222, 224 is to coordinate the processing of the first data and the second data associated with the document. As previously described, in an embodiment, the first data and the second data may be received from the master server 210 and/or the frontend server. Once received, the slave servers 220, 222, 224 may conduct associations between the document and a parameter of interest. Once the associations have been conducted, the slave servers 220, 222, 224 may transmit to the master server 210 the parameter of interest to be associated with the document. In some other embodiments, the slave servers 220, 222, 224 may not transmit any parameter of interest to the master server 210. This scenario may occur upon determination by the slave servers 220, 222, 224 that the first data and second data leads to modification of one of the tree models that they host. In some embodiments, the slave servers 220, 222, 224 may receive the first data and the second data from the master server 210 along with a parameter of interest to be associated with the first data and the second data. As previously detailed in connection with the description of the master server 210, in such instances, one of the tree models hosted by the slave servers 220, 222, 224 may be modified so that the first data and/or the second data may be associated with the parameter of interest in the tree model. In some embodiments, once one of the tree models hosted by the slave servers 220, 222, 224 has been modified, the slave servers 220, 222, 224 may transmit a message to the master server 210, the message being indicative of a modification made to one of the tree models. Other variations as how the slave servers 220, 222, 224 interact with the master server 210 may be envisioned without departing from the scope of the present technology and may become apparent to the person skilled in the art of the present technology. In addition, it should be also expressly understood that in order to simplify the description presented herein above, the configuration of the slave servers 220, 222, 224 has been greatly simplified. It is believed that those skilled in the art will be able to appreciate implementational details for the slave servers 220, 222, 224 and for components thereof that may have been omitted for the purposes of simplification of the description.
[66] Still referring to FIG. 2, the slave servers 220, 222, 224 may each be communicately coupled to "hash table 1" database 230, "hash table 2" database 232 and "hash table n" database 234 (referred to as "the databases 230, 232, 234" hereinafter). The databases 230, 232, 234 may be part of the slave servers 220, 222, 224 (e.g., stored in the memories of the slave servers 220, 222, 224 such as the solid-state drive 120 and/or the random access memory 130) or be hosted on distinct database servers. In some embodiments, a single database accessed by the slave servers 220, 222, 224 may be sufficient. The number of databases and the arrangement of the databases 230, 232, 234 are therefore not limitative to the scope of the present technology. The databases 230, 232, 234 may be used to access and/or store data relating to one or more hash tables representative of tree models generated in accordance with the present technology. In some embodiments, the databases 230, 232, 234 may be accessed by the slave servers 220, 222, 224 to identify a parameter of interest to be associated with the document further to the processing of the first data and the second data by the slave servers 220, 222, 224 in accordance with the present technology. In some other embodiments, the databases 230, 232, 234 may be accessed by the slave servers 220, 222, 224 to store a new entry (also referred to as a "hashed complex vector" and/or "key" hereinafter) in the one or more hash tables, the new entry having been generated further to the processing of the first data and the second data and being representative of a parameter of interest to be associated with the document. In such embodiments, the new entry may be representative a modification made to a tree models modelized by the hash table.
[67] More details regarding how the first data and second data are processed be will be provided in connection with the description of FIG. 3 to 5.
[68] Turning now to FIG. 3, a tree model 300, a first set of parameters 330 and a second set of parameters 340 are depicted. The first set of parameters 330 and the second set of parameters 340 may equally be referred to as feature vectors. The tree model 300 may have been generated in accordance with the present technology and may modelize an association between a document and a parameter of interest. The document may take multiple forms and formats to represent documents of various natures, such as, but without being limitative, text files, text documents, web pages, audio files, video files and so on. The document may equally be referred to as a file without departing from the scope of the present technology. In an embodiment, the file may be a document searchable by a search engines. However, multiple embodiments may be envisioned without departing from the scope of the present technology and may become apparent to the person skilled in the art of the present technology. As previously discussed, the parameter of interest may take multiple forms and formats to represent an indication of an order or ranking of a document, for example, but without being limitative. In some embodiments, the parameter of interest may be referred to as a label and/or a ranking, in particular in the context of search engines. In some embodiments, the parameter of interest may be generated by a machine-learning algorithm using a training document. In some alternative embodiments, other methods may be used such as, but without being limitative manually defining the parameter of interest. How the parameter of interest is generated is therefore not limitative and multiple embodiments may be envisioned without departing from the scope of the present technology and may become apparent to the person skilled in the art of the present technology.
[69] A path throughout the tree model 300 may be defined by the first set of parameters 330 and/or the second set of parameters 340. The tree model 300 comprises multiple nodes each connected to one or more branches. In the embodiment depicted at FIG. 3, a first node 302, a second node 304, a third node 306, a fourth node 308 and a fifth node 310 are depicted. Each one of the first node 302, the second node 304, the third node 306, the fourth node 308 and the fifth node 310 is associated with a condition. The first node 302 is associated with a condition "if Page_rank < 3.5" associated with two branches (i.e., true represented by a binary number "0" and false represented by a binary number "1"), the second node 304 is associated with a condition "Is main page?" associated with two branches (i.e., true represented by a binary number "0" and false represented by a binary number "1"), the third node 306 is associated with a condition "if Number_clicks < 5,000 " associated with two branches (i.e., true represented by a binary number "0" and false represented by a binary number "1"), the fourth node 308 is associated with a condition "which URL?" associated with more than two branches (i.e., each one of the branches is associated with a different URL, for example, the URL "yandex.ru") and the fifth node 310 is associated with a condition "which Search query?" associated with more than two branches (i.e., each one of the branches is associated with a different search query, for example, the search query "See Eiffel Tower"). In addition, the fifth node 310, via the branch "See Eiffel Tower" is associated with a leaf 312. In some embodiments, the leaf 312 may be indicative of a parameter of interest.
[70] As a result of the above-described configuration, the tree model 300 may associate a document (such as, for example, but without being limitative, a web page in the html format) with the parameter of interest associated with the leaf 312, the association being defined by a path through the tree model 300 based on the first set of parameters 330 and/or the second set of parameters 340. It should be appreciated that for purpose of clarity, only a portion of the tree model 300 is illustrated. The person skilled in the art of the present technology may appreciate that the number of nodes, branches and leafs is virtually unlimited and solely depend on a complexity of the tree model to be modelized. In addition, in some embodiments, the tree model may be an oblivious tree model comprising a set of nodes each comprising two branches (i.e., true represented by a binary number "0" and false represented by a binary number "1"). However, the present technology is not limited to oblivious tree models and multiple variations may be envisioned by the person skilled in the art of the present technology, such as for example, a tree model comprising a first portion defining an oblivious tree model and a second portion defining a non-oblivious tree model as exemplified by the tree model 300 (e.g., a first portion defined by the first node 302, the second node 304 and the third node 306 and a second portion defined by the fourth node 308 and the fifth node 310). [71] The first set of parameters 330 illustrates an example of parameters defining the path exemplified by the tree model 300. The set of parameters 330 comprises first data 335 and second data 339. The first data 335 and the second data 339 may be associated with the document and allows defining the path in the tree model 300 described in the paragraph above. The first data 335 may be of binary type and/or of real number type (e.g., integer number type, floating number type). In some embodiments, the first data 335 may represent a path in an oblivious portion of the tree model as it is the case in the example depicted in FIG. 3. Other variations may also be possible without departing from the scope of the present technology. In the example of FIG. 3, the first data 335 comprises a first component 332 associated with a value "01" and a second component 334 associated with a value "3500". Even though the term "component" is used in the present description, it should be understood that the term "variable" may be equally used and may therefore be considered as being an equivalent to "component". The first component 332 comprises the binary sequence "01" which, once projected in the tree model 300, allows establishing a first portion of the path. In the example of FIG. 3, the first portion of the path is established by applying a first binary digit "0" of the sequence "01" to the first node 302 and then a second binary digit "1" of the sequence "01" to the second node 304. The second component 334, once project in the tree model 300, allows establishing a second portion of the path. In the example of FIG. 3, the second portion of the path is established by applying the number "3500" to the third node 306". Even though the example of FIG. 3 illustrates the first data as comprising the first component 332 and the second component 334, the number of components and the number of digits comprised by one of the components is not limitative and many variations may be envisioned without departing from the scope of the present technology.
[72] The second data 339 may be of category type. In some embodiments, the second data 339 may also be referred to as categorical features and may comprise, for example, but without being limitative, a host, an URL, a domain name, an IP address, a search query and/or a key word. In some embodiments, the second data 339 may be broadly described as comprising label categories allowing categorisation of information. In some embodiments, the second data may take the form of a chain and/or string of characters and/or digits. In yet some embodiments, the second data 339 may be comprise a parameter that may take more than two values, as it is the case in the example of FIG. 3 thereby resulting in the tree model 300 having as many branches connected to a node as a number of possible values of the parameter. Multiple variations as to what the second data 339 may comprise is not limitative and many variations may be envisioned without departing from the scope of the present technology. In some embodiments, the second data 339 may represent a path in a non-oblivious portion of the tree model as it is the case in the example depicted at FIG. 3. Other variations may also be possible without departing from the scope of the present technology.
[73] In the example of FIG. 3, the second data 339 comprises a first component 336 associated with a value "yandex.ru" and a second component 338 associated with a value "See Eiffel Tower". The first component 336 comprises a string of character "yandex.ru" which, once projected in the tree model 300, allows establishing a fourth portion of the path. In the example of FIG. 3, the fourth portion of the path is established by applying the string of character "yandex.ru" to the fourth node 308. The second component 338, once projected in the tree model 300, allows establishing a fifth portion of the path. In the example of FIG. 3, the fifth portion of the path is established by applying the string of character "See Eiffel Tower" to the fifth node 310 thereby leading to the leaf 312 and the parameter of interest associated therewith. Even though the example of FIG. 3 illustrates the second data 339 as comprising the first component 336 and the second component 338, the number of components and the number of digits and/or characters comprised by one of the components is not limitative and many variations may be envisioned without departing from the scope of the present technology.
[74] Turning now to the second set of parameters 340, the second set of parameters 340 illustrates another example of parameters defining the path exemplified by the tree model 300. The set of parameters 330 comprises first data 343 and second data 349. As for the first set of parameters 330, the second set of parameters 340 may be associated with the document and allows defining the path in the tree model 300 described in the paragraph above. The second set of parameters is similar on all aspects to the first set of parameters 330 with the exception of the first data 343. The first data 343 comprises a sequence of digits "010" whereas the first data 335 comprises the first component 332 associated with the value "01" and the second component 334 associated with the value "3500". As a person skilled in the art of the present technology may appreciate, in the first data 343, the value "3500" has been represented by a binary digit "0" which is the output of the value "3500" applied to the condition associated with the node 306 (i.e., Number_clicks < 5,000"). As a result, the first data 343 may be considered as an alternative representation to the first data 335 of a same path in the tree mode 300. As a result, in some embodiments, a real number value may be translated into a binary value in particular for cases wherein a node of a tree model to which the integer value is to be applied corresponds to an oblivious section of the tree model. Other variations may also be possible and the example of the second set of parameters 340 should not be construed as being limitative of the scope of the present technology. The second data 349 comprises a first component 344 and a second component 346 that are identical to the first component 336 and the second component 338 of the second data 339.
[75] Turning now to FIG. 4, an example of how a hashed complex vector indicative of an association between a document and a parameter of interest is generated is illustrated. The example is based on the second set of parameters 340 depicted in FIG. 3. A complex vector 350 comprises a component MASK, a hash value hi and a hash value h2. In an embodiment, the component MASK may be generated from the first data 343. In some other embodiments, the component MASK may be generated from the first component 332 and the second component 334 of the first data 335. As a result, the component MASK may correspond to the path between the first node 302 and the fourth node 308 in the tree model 300.
[76] In some embodiments, the hash value hi may be generated by applying a hash function HI to the first component 344 of the second data 349. The hash function may be any function that may be readily apparent to the person skilled in the art of the present technology. Many variations may be envisioned without departing from the scope of the present technology. The hash value h2 may be generated by applying the hash function HI to the second component 346 of the second data 349. Even though the complex vector 350 is depicted as comprising the component MASK, the hash value hi and the hash value h2, it should be understood that the complex vector may comprise more or less components MASK and more or less hash values. This aspect of the present technology is therefore not limitative and many variations may be envisioned without departing from the scope of the present technology.
[77] Still referring to FIG. 4, a key 360 is generated by applying a hash function H2 to the complex vector 350. As for the hash function HI, the hash function may be any function that may be readily apparent to the person skilled in the art of the present technology. Many variations may be envisioned without departing from the scope of the present technology. In some embodiments, the hash function HI may be a same hash function as the hash function H2. In some other embodiments, the hash function HI may be a different hash function as the hash function H2. In some embodiments, the key 360 may also be referred to as a hashed complex vector. The key 360, once generated may be stored in a memory, such as one of the databases 230, 232, 234 depicted in FIG. 2. As a person skilled in the art of the present technology may appreciate, the key 360, once accessed, may allow identifying a path in a tree model. As a result, a tree model may be partially or entirely be modelized by a set of keys generated in a similar fashion as the key 360. In addition, because of the key being a single entry in a hash table, the key may be retrieved to allow identifying multiple parameters of category type such as the first component 344 and the second component 346. As a result, adding new data of category type to a data structure representing a tree model may be limited to creating a new entry to the hash table thereby limiting (if not avoiding, under certain circumstances) allocation of memory arrays depending on a number of possible values the new data may take. Under certain circumstance, the present technology may allow allocation of memory arrays without having to enumerate all possible combinations associated with a category type beforehand. The present technology thereby results in a more efficient use of the memory of the system while allowing association between a document and a parameter of interest based on a tree model.
[78] In some embodiments, generation of the key 360 may allow accessing a parameter of interest previously associated with the key 360 based on an existing data structure representing a tree model. In some other embodiments, generation of the key 360 may allow adding a new parameter of interest not previously existing in the data structure representing the tree model. In such embodiments, the new parameter of interest may be added by creating one or more new branches in the tree model. Because of how the key 360 is generated, adding the new parameter of interest may be conducted while limiting the amount of additional memories to be allocated for cases wherein an association between a document and the parameter of interest includes data of category type.
[79] Turning now to FIG. 5, another example of how a hashed complex vector indicative of an association between a document and a parameter of interest is generated is illustrated. The example is also based on the second set of parameters 340 depicted in FIG. 3. A complex vector 450 comprises a component MASK, a hash value hi and a hash value h2. In an embodiment, the component MASK is similar to the component MASK of the complex vector 350. In some embodiments, the hash value hi may be generated by applying the hash function HI to the first component 344 of the second data 349. The hash function HI is similar to the hash function HI depicted in FIG. 4. As a result the hash value hi of the complex vector 350 is similar to the hash value hi of the complex vector 450. In the embodiments depicted at FIG. 5, the hash value h2 may be generated by applying a hash function H3 to the second component 346 of the second data 349. The hash function H3 may be a different hash function than the hash function HI and/or the hash function H2. As a result, the embodiment of FIG. 5 illustrates that, in some embodiments of the present technology, multiple hash functions may be used for the generation of the hash value hi and the hash value h2.
[80] As in the embodiment depicted in FIG. 4, a key 460 is generated by applying the hash function H2 to the complex vector 450. The hash function H2 may be similar or different from the hash function H3.
[81] Having described, with reference to FIG. 1 to FIG. 5, some non-limiting example instances of systems and computer-implemented methods used in connection with the problem of generating a hashed complex vector indicative of an association between a document and a parameter of interest, we shall now describe general solutions to the problem with reference to FIG. 6.
[82] More specifically, FIG. 6 shows a flowchart illustrating a computer-implemented method 600 of generating a hashed complex vector indicative of an association between a document and a parameter of interest. In some embodiments, the parameter of interest is indicative of at least one of a search result prediction, a probability of click, a document relevance, a user interest, a URL and a number of clicks.
[83] The method 600 starts at step 602 by accessing, from a non-transitory computer- readable medium, the first data associated with the document, the first data being at least one of a binary type and a real number type. Then, the method 600 proceeds to a step 604 by accessing, from the non-transitory computer-readable medium, the second data associated with the document, the second data being a category type. In some embodiments, the first data is indicative of at least one of a number of clicks, a number of views and a document ranking and wherein the second data is indicative of at least one of a URL, a domain name, an IP address, a search query and a key word. In yet some embodiments, the second data comprises a first categorical variable and a second categorical variable.
[84] Then, at a step 604, the method 600 proceed to generating, by a processor, a mask vector based on the first data, the mask vector comprising a plurality of numbers corresponding to a path in a tree model, each one of the plurality of numbers being indicative of a branch associated with a node of the tree model. In some embodiments, the tree model is an oblivious tree model. In some embodiments, the node of the tree model corresponds to a condition, the condition having been determined based on a machine learning algorithm.
[85] Then, at a step 606, the method 600 generates, by a processor, a mask vector based on the first data, the mask vector comprising a plurality of numbers corresponding to a path in a tree model, each one of the plurality of numbers being indicative of a branch associated with a node of the tree model. In some embodiments, each one of the plurality of numbers comprised by the mask vector is a binary number. In yet some embodiments, the first data comprises at least one integer variable and wherein generating the mask comprises applying a translation function to generate a binary number associated with the integer variable.
[86] At a step 608, the method 600 then proceeds to generating, by the processor, a hash vector based on the second data by applying a first hash function to the second data. In some embodiments, generating the hash vector comprises applying the first hash function to the first categorical variable and a third hash function to the second categorical variable. In some embodiments, the first hash function and the third hash function are one of same hash function and distinct hash function. Then, at a step 610, the method 600 proceeds to generating, by the processor, a complex vector comprising the mask vector and the hash vector, the complex vector being indicative of a leaf of the tree model. In some embodiments, the leaf of the tree model is associated with a parameter of interest based on a machine learning algorithm using a training document.
[87] At a step 612, the method 600 proceeds to generating, by the processor, a hashed complex vector by applying a second hash function to the complex vector. The method 600 may then proceeds to step 614 wherein the hashed complex vector is stored in the non- transitory computer-readable medium.
[88] In some embodiments, the first hash function and the second hash function are one of same hash function and distinct hash function. In yet some other embodiments, the method 600 may comprise accessing, from the non-transitory computer-readable medium, a collection of previously generated hashed complex vectors. If the hashed complex vector corresponds to one of the previously generated hashed complex vectors, the method 600 may associate a parameter of interest associated to the corresponding previously generated hashed complex vectors to the document. If the hashed complex vector does not correspond to any of the previously generated hashed complex vectors, the method 600 may add the hashed complex vector to the collection of previously generated hashed complex vectors. In some embodiments, adding the hashed complex vector to the collection of previously generated hashed complex vectors further comprises associating a parameter of interest to the hashed complex vector. In yet some other embodiments, adding the hashed complex vector to the collection of previously generated hashed complex vectors further comprises associating a parameter of interest to the hashed complex vector.
[89] While the above-described implementations have been described and shown with reference to particular steps performed in a particular order, it will be understood that these steps may be combined, sub-divided, or re-ordered without departing from the teachings of the present technology. Accordingly, the order and grouping of the steps is not a limitation of the present technology.
[90] As such, the methods and systems implemented in accordance with some non-limiting embodiments of the present technology can be represented as follows, presented in numbered clauses.
[91] [Clause 1] A computer-implemented method (600) of generating a hashed complex vector (360) indicative of an association between a document and a parameter of interest, the document being associated with first data (343) and second data (349), the method comprising:
• accessing (602), from a non-transitory computer-readable medium (120, 130), the first data (343) associated with the document, the first data (343) being at least one of a binary type and a real number type;
• accessing (604), from the non-transitory computer-readable medium (120, 130), the second data (349) associated with the document, the second data (349) being a category type;
• generating (606), by a processor (110), a mask vector based on the first data (343), the mask vector comprising a plurality of numbers corresponding to a path in a tree model (300), each one of the plurality of numbers being indicative of a branch associated with a node of the tree model;
• generating (608), by the processor (110), a hash vector based on the second data by applying a first hash function (HI) to the second data; • generating (610), by the processor (110), a complex vector (350) comprising the mask vector and the hash vector, the complex vector (350) being indicative of a leaf of the tree model (300);
• generating (612), by the processor (110), a hashed complex vector (360) by applying a second hash function (H2) to the complex vector (350); and
• storing (614), in the non-transitory computer-readable medium (130), the hashed complex vector (360).
[92] [Clause 2] The method of clause 1, wherein the method further comprises: accessing, from the non-transitory computer-readable medium (130), a collection of previously generated hashed complex vectors; if the hashed complex vector (360) corresponds to one of the previously generated hashed complex vectors, associating a parameter of interest associated to the corresponding previously generated hashed complex vectors to the document; and if the hashed complex vector (360) does not correspond to any of the previously generated hashed complex vectors, adding the hashed complex vector to the collection of previously generated hashed complex vectors.
[93] [Clause 3] The method of clause 2, wherein adding the hashed complex vector (360) to the collection of previously generated hashed complex vectors further comprises associating a parameter of interest to the hashed complex vector (360).
[94] [Clause 4] The method of any one of clauses 1 to 3, wherein the leaf of the tree model (300) is associated with a parameter of interest based on a machine learning algorithm using a training document.
[95] [Clause 5] The method of any one of clauses 1 to 4, wherein the first data (343) is indicative of at least one of a number of clicks, a number of views and a document ranking and wherein the second data is indicative of at least one of a URL, a domain name, an IP address, a search query and a key word.
[96] [Clause 6] The method of any one of clauses 1 to 5, wherein the tree model (300) is an oblivious tree model. [97] [Clause 7] The method of any one of clauses 1 to 6, wherein each one of the plurality of numbers comprised by the mask vector is a binary number.
[98] [Clause 8] The method of clause 7, wherein the first data (343) comprises at least one integer variable and wherein generating the mask comprises applying a translation function to generate a binary number associated with the integer variable.
[99] [Clause 9] The method of any one of clauses 1 to 8, wherein the second data (349) comprises a first categorical variable and a second categorical variable.
[100] [Clause 10] The method of any clauses 1 to 9, wherein generating the hash vector comprises applying the first hash function (HI) to the first categorical variable and a third hash function (H3) to the second categorical variable.
[101] [Clause 11] The method of clause 10, wherein the first hash function (HI) and the third hash function (H3) are one of same hash function and distinct hash function.
[102] [Clause 12] The method of any one of clauses 1 to 11, wherein the first hash function (HI) and the second hash function (H2) are one of same hash function and distinct hash function.
[103] [Clause 13] The method of any one of clauses 1 to 12, wherein the node of the tree model (300) corresponds to a condition, the condition having been determined based on a machine learning algorithm.
[104] [Clause 14] The method of clause 2, wherein adding the hashed complex vector (360) to the collection of previously generated hashed complex vectors further comprises adding a node to the tree model (300).
[105] [Clause 15] The method of any one of claims 1 to 14, wherein the parameter of interest is indicative of at least one of a search result prediction, a probability of click, a document relevance, a user interest, a URL and a number of clicks.
[106] [Clause 16] A computer-implemented system (220, 222, 224) configured to perform the method of any one of clauses 1 to 15. [107] [Clause 17] A non-transitory computer-readable medium (120, 130), comprising computer-executable instructions that cause a system to execute the method according to any one of clauses 1 to 16.
[108] It should be expressly understood that not all technical effects mentioned herein need to be enjoyed in each and every embodiment of the present technology. For example, embodiments of the present technology may be implemented without the user enjoying some of these technical effects, while other embodiments may be implemented with the user enjoying other technical effects or none at all.
[109] Some of these steps and signal sending-receiving are well known in the art and, as such, have been omitted in certain portions of this description for the sake of simplicity. The signals can be sent-received using optical means (such as a fibre-optic connection), electronic means (such as using wired or wireless connection), and mechanical means (such as pressure-based, temperature based or any other suitable physical parameter based).
[110] Modifications and improvements to the above-described implementations of the present technology may become apparent to those skilled in the art. The foregoing description is intended to be exemplary rather than limiting. The scope of the present technology is therefore intended to be limited solely by the scope of the appended claims.

Claims

CLAIMS:
1. A computer-implemented method of generating a hashed complex vector indicative of an association between a document and a parameter of interest, the document being associated with first data and second data, the method comprising: accessing, from a non-transitory computer-readable medium, the first data associated with the document, the first data being at least one of a binary type and a real number type; accessing, from the non-transitory computer-readable medium, the second data associated with the document, the second data being a category type; generating, by a processor, a mask vector based on the first data, the mask vector comprising a plurality of numbers corresponding to a path in a tree model, each one of the plurality of numbers being indicative of a branch associated with a node of the tree model; generating, by the processor, a hash vector based on the second data by applying a first hash function to the second data; generating, by the processor, a complex vector comprising the mask vector and the hash vector, the complex vector being indicative of a leaf of the tree model; generating, by the processor, a hashed complex vector by applying a second hash function to the complex vector; and storing, in the non-transitory computer-readable medium, the hashed complex vector.
2. The method of claim 1, wherein the method further comprises: accessing, from the non-transitory computer-readable medium, a collection of previously generated hashed complex vectors; if the hashed complex vector corresponds to one of the previously generated hashed complex vectors, associating a parameter of interest associated to the corresponding previously generated hashed complex vectors to the document; and if the hashed complex vector does not correspond to any of the previously generated hashed complex vectors, adding the hashed complex vector to the collection of previously generated hashed complex vectors.
3. The method of claim 2, wherein adding the hashed complex vector to the collection of previously generated hashed complex vectors further comprises associating a parameter of interest to the hashed complex vector.
4. The method of claim 1, wherein the leaf of the tree model is associated with a parameter of interest based on a machine learning algorithm using a training document.
5. The method of claim 1, wherein the first data is indicative of at least one of a number of clicks, a number of views and a document ranking and wherein the second data is indicative of at least one of a URL, a domain name, an IP address, a search query and a key word.
6. The method of claim 1, wherein the tree model is an oblivious tree model.
7. The method of claim 1, wherein each one of the plurality of numbers comprised by the mask vector is a binary number.
8. The method of claim 7, wherein the first data comprises at least one integer variable and wherein generating the mask comprises applying a translation function to generate a binary number associated with the integer variable.
9. The method of claim 1, wherein the second data comprises a first categorical variable and a second categorical variable.
10. The method of claim 1, wherein generating the hash vector comprises applying the first hash function to the first categorical variable and a third hash function to the second categorical variable.
11. The method of claim 10, wherein the first hash function and the third hash function are one of same hash function and distinct hash function.
12. The method of claim 1, wherein the first hash function and the second hash function are one of same hash function and distinct hash function.
13. The method of claim 1, wherein the node of the tree model corresponds to a condition, the condition having been determined based on a machine learning algorithm.
14. The method of claim 2, wherein adding the hashed complex vector to the collection of previously generated hashed complex vectors further comprises adding a node to the tree model.
15. The method of claim 1, wherein the parameter of interest is indicative of at least one of a search result prediction, a probability of click, a document relevance, a user interest, a URL and a number of clicks.
16. A computer- implemented system for generating a hashed complex vector indicative of an association between a document and a parameter of interest, the document being associated with first data and second data, the system comprising: a non-transitory computer-readable medium; a processor configured to perform: accessing, from a non-transitory computer-readable medium, the first data associated with the document, the first data being at least one of a binary type and a real number type; accessing, from the non-transitory computer-readable medium, the second data associated with the document, the second data being a category type; generating, by a processor, a mask vector based on the first data, the mask vector comprising a plurality of numbers corresponding to a path in a tree model, each one of the plurality of numbers being indicative of a branch associated with a node of the tree model; generating, by the processor, a hash vector based on the second data by applying a first hash function to the second data; generating, by the processor, a complex vector comprising the mask vector and the hash vector, the complex vector being indicative of a leaf of the tree model; generating, by the processor, a hashed complex vector by applying a second hash function to the complex vector; and storing, in the non-transitory computer-readable medium, the hashed complex vector.
17. The system of claim 16, wherein the processor is further configured to perform: accessing, from the non-transitory computer-readable medium, a collection of previously generated hashed complex vectors; if the hashed complex vector corresponds to one of the previously generated hashed complex vectors, associating a parameter of interest associated to the corresponding previously generated hashed complex vectors to the document; and if the hashed complex vector does not correspond to any of the previously generated hashed complex vectors, adding the hashed complex vector to the collection of previously generated hashed complex vectors.
18. The system of claim 17, wherein adding the hashed complex vector to the collection of previously generated hashed complex vectors further comprises associating a parameter of interest to the hashed complex vector.
19. The system of claim 16, wherein the leaf of the tree model is associated with a parameter of interest based on a machine learning algorithm using a training document.
20. The system of claim 16, wherein the first data is indicative of at least one of a number of clicks, a number of views and a document ranking and wherein the second data is indicative of at least one of a URL, a domain name, an IP address, a search query and a key word.
21. The system of claim 16, wherein the tree model is an oblivious tree model.
22. The system of claim 16, wherein each one of the plurality of numbers comprised by the mask vector is a binary number.
23. The system of claim 22, wherein the first data comprises at least one integer variable and wherein generating the mask comprises applying a translation function to generate a binary number associated with the integer variable.
24. The system of claim 16, wherein the second data comprises a first categorical variable and a second categorical variable.
25. The system of claim 16, wherein generating the hash vector comprises applying the first hash function to the first categorical variable and a third hash function to the second categorical variable.
26. The system of claim 25, wherein the first hash function and the third hash function are one of same hash function and distinct hash function.
27. The system of claim 16, wherein the first hash function and the second hash function are one of same hash function and distinct hash function.
28. The system of claim 16, wherein the node of the tree model corresponds to a condition, the condition having been determined based on a machine learning algorithm.
29. The system of claim 17, wherein adding the hashed complex vector to the collection of previously generated hashed complex vectors further comprises adding a node to the tree model.
30. The system of claim 16, wherein the parameter of interest is indicative of at least one of a search result prediction, a probability of click, a document relevance, a user interest, a URL and a number of clicks.
PCT/IB2015/058957 2015-06-01 2015-11-19 Method of and system for generating a hashed complex vector WO2016193797A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
RU2015120563 2015-06-01
RU2015120563A RU2015120563A (en) 2015-06-01 2015-06-01 METHOD AND SYSTEM FOR CREATING A HASHED COMPLEX VECTOR

Publications (1)

Publication Number Publication Date
WO2016193797A1 true WO2016193797A1 (en) 2016-12-08

Family

ID=57440673

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2015/058957 WO2016193797A1 (en) 2015-06-01 2015-11-19 Method of and system for generating a hashed complex vector

Country Status (2)

Country Link
RU (1) RU2015120563A (en)
WO (1) WO2016193797A1 (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020143787A1 (en) * 2001-03-31 2002-10-03 Simon Knee Fast classless inter-domain routing (CIDR) lookups
US20060026138A1 (en) * 2004-01-09 2006-02-02 Gavin Robertson Real-time indexes
US20060136390A1 (en) * 2004-12-22 2006-06-22 International Business Machines Corporation Method and system for matching of complex nested objects by multilevel hashing
US20070217676A1 (en) * 2006-03-15 2007-09-20 Kristen Grauman Pyramid match kernel and related techniques
US20110153611A1 (en) * 2009-12-22 2011-06-23 Anil Babu Ankisettipalli Extracting data from a report document
US20110225589A1 (en) * 2010-03-12 2011-09-15 Lsi Corporation Exception detection and thread rescheduling in a multi-core, multi-thread network processor
US20110225372A1 (en) * 2009-04-27 2011-09-15 Lsi Corporation Concurrent, coherent cache access for multiple threads in a multi-core, multi-thread network processor
US20120203745A1 (en) * 2011-02-08 2012-08-09 Wavemarket Inc. System and method for range search over distributive storage systems
US20130173571A1 (en) * 2011-12-30 2013-07-04 Microsoft Corporation Click noise characterization model

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020143787A1 (en) * 2001-03-31 2002-10-03 Simon Knee Fast classless inter-domain routing (CIDR) lookups
US20060026138A1 (en) * 2004-01-09 2006-02-02 Gavin Robertson Real-time indexes
US20060136390A1 (en) * 2004-12-22 2006-06-22 International Business Machines Corporation Method and system for matching of complex nested objects by multilevel hashing
US20070217676A1 (en) * 2006-03-15 2007-09-20 Kristen Grauman Pyramid match kernel and related techniques
US20110225372A1 (en) * 2009-04-27 2011-09-15 Lsi Corporation Concurrent, coherent cache access for multiple threads in a multi-core, multi-thread network processor
US20110153611A1 (en) * 2009-12-22 2011-06-23 Anil Babu Ankisettipalli Extracting data from a report document
US20110225589A1 (en) * 2010-03-12 2011-09-15 Lsi Corporation Exception detection and thread rescheduling in a multi-core, multi-thread network processor
US20120203745A1 (en) * 2011-02-08 2012-08-09 Wavemarket Inc. System and method for range search over distributive storage systems
US20130173571A1 (en) * 2011-12-30 2013-07-04 Microsoft Corporation Click noise characterization model

Also Published As

Publication number Publication date
RU2015120563A (en) 2016-12-20

Similar Documents

Publication Publication Date Title
US11341419B2 (en) Method of and system for generating a prediction model and determining an accuracy of a prediction model
US10963794B2 (en) Concept analysis operations utilizing accelerators
US11163957B2 (en) Performing semantic graph search
CN106605221B (en) Multi-user search system with method for instant indexing
US11853334B2 (en) Systems and methods for generating and using aggregated search indices and non-aggregated value storage
US9311823B2 (en) Caching natural language questions and results in a question and answer system
US9684713B2 (en) Methods and systems for retrieval of experts based on user customizable search and ranking parameters
US9449096B2 (en) Identifying influencers for topics in social media
US9189565B2 (en) Managing tag clouds
CA2992822A1 (en) Methods and systems for identifying a level of similarity between a filtering criterion and a data item within a set of streamed documents
US20150012529A1 (en) Pivot facets for text mining and search
US11347815B2 (en) Method and system for generating an offline search engine result page
WO2015023518A2 (en) Browsing images via mined hyperlinked text snippets
US10678820B2 (en) System and method for computerized semantic indexing and searching
TWI579715B (en) Search servers, end devices, and search methods for use in a distributed network
WO2022269510A1 (en) Method and system for interactive searching based on semantic similarity of semantic representations of text objects
US11074266B2 (en) Semantic concept discovery over event databases
RU2634223C2 (en) Method (optional) and system (optional) for management of data associated with hierarchical structure
US11816159B2 (en) Method of and system for generating a training set for a machine learning algorithm (MLA)
US11550777B2 (en) Determining metadata of a dataset
RU2721159C1 (en) Method and server for generating meta-attribute for ranging documents
WO2016193797A1 (en) Method of and system for generating a hashed complex vector
US9754030B2 (en) Free text search engine system and method
WO2017001906A1 (en) Method of and system for updating a data table

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15894047

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15894047

Country of ref document: EP

Kind code of ref document: A1