WO2005114484A1 - Systems and methods of geographical text indexing - Google Patents
Systems and methods of geographical text indexing Download PDFInfo
- Publication number
- WO2005114484A1 WO2005114484A1 PCT/US2005/017697 US2005017697W WO2005114484A1 WO 2005114484 A1 WO2005114484 A1 WO 2005114484A1 US 2005017697 W US2005017697 W US 2005017697W WO 2005114484 A1 WO2005114484 A1 WO 2005114484A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- geographic
- text string
- coordinates
- documents
- text
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9537—Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
Definitions
- This invention relates to document databases, geographical information retrieval, and search engines.
- Text search engines are among a widely used family of tools that enable users to search documents for specific words, called keywords, and for key phrases. Text search engines also typically support queries that include range constraints, phrase queries, wildcard queries, and Boolean combinations of any permissible query.
- a searcher looks for information that corresponds to a range of spatial geographical locations. Such a range is specified as a range of geographical coordinates, such as a latitude and longitude range.
- a searcher must use a special search engine that employs specially constructed spatial indices, such as R- trees or quad-trees, which index data records according to geographic fields in the records.
- R- trees or quad-trees which index data records according to geographic fields in the records.
- Embodiments described herein employ a variety of methods for geographic text searching that use traditional text search indices without creating separate geographic indices. These techniques allow a generic keyword search system to limit results to ⁇ specific geographic domains without special indexing for geographic coordinate and natural language confidence score metadata. Further, these techniques allow the unmodified generic keyword search system to sort the results of such multiply- constrained queries according to relevance factors with at least some knowledge of the multiple constraints. Other embodiments described herein describe modifications that can be made to generic keyword search systems to enable their relevance sorting functions to have more awareness of the geographic information in the documents. Such a modified search system is referred to herein as an "enhanced search engine.”
- embodiments described herein address two specific challenges in constructing geographic search systems: 1) efficiently generating lists of documents that match searches comprising both geographic and non-geographic search constraints, and 2) efficiently sorting such lists based on relevance functions that incorporate both geographic and non-geographic assessments of the pertinence of each document to the specified search. This is achieved by encoding geographic coordinates, confidence scores, emphasis scores, and other information in specially formatted strings. The described embodiment teaches several methods of formatting these strings such that they can be accessed using generic text search commands.
- a document with a sentence that matches both geographic and non-geographic query constraints is clearly more relevant than a document that matches the constraints via paragraphs at opposite ends of the document.
- This and other combined relevance functions require whole document analysis, which is extremely expensive to perform at the time of joining results from separate indices.
- This re-sorted intersection also known as a sorted join, takes time proportional to the size of the two lists being joined, which is typically the size of the collection of documents. For collections of millions of documents, this could mean minutes, hours, or even days to compute search results.
- Described herein are a variety of methods of representing geographic location metadata about documents in textual strings that can be indexed as though they were regular keywords and can be searched for using a variety common keyword search techniques, including trailing wildcard queries, phrase queries, and Boolean operator queries. Certain embodiments employ graphical user interface techniques for utilizing this geographic information. In general, the system of the geographic mapping user interface interacts with one or several text search indices containing such specially encoded geographic metadata. These techniques described herein allow geographic metadata to be added to existing text search infrastructure possibly without any modification of the existing text search indexing software. Specific modifications useful to further improving performance are also disclosed.
- coordinate metadata is typically stored in an index.
- systems such as those described in U.S. Patent Applications No. 09/791,533 and No. 10/633,915, also owned by the assignee of the present application and incorporated herein by reference, use a special index for holding textual information from documents in a highly unique structure that permits geographic range searches to be combined with text searches.
- These prior art systems achieve the goal of efficiently computing sorted joins by holding both textual and geographic data in an unusual data structure.
- This specialized index data structure known as CartaTrees, arranges all the words from the documents into spatial trees that resemble traditional geographic quadtrees.
- Geographic distances on the Earth provide exactly such a grounded distance metric: the distance between any two points can be measured in kilometers, independent of any documents mentioning these points.
- generic text search systems to hold geographic information, they must use multi-dimensional range query indices, such as R-trees or quad-trees or other special spatial data indexes that are separate from their text indices. This separation forces such systems to typically take a long time to answer queries that combine these operators with other text search commands.
- Generating relevance-sorted result lists based on geographic ranges is either impossible or extremely slow in traditional text search engines.
- a geoparser is a software system that creates geographic coordinates based on information about electronic files.
- a geoparser might use human input to decide what coordinates to associate with a file, or it might operate fully automatically to generate geographic coordinates to describe points, lines, polygons, and other geographic entities relevant to the file.
- confidence scores are numbers indicating the likelihood that a particular coordinate or geographic entity is actually correctly associated with the file.
- a fully automatic geoparser might interpret the natural language context of the document to guess which locations the author intended. The quality of these guesses is estimated by the confidence scores (geoconfidence) output by the geoparser along with the coordinates describing the geographic entities. Geoconfidence typically figures into relevance scoring of files in response to queries that include geographic constraints. Thus, by encoding geoconfidence in a manner that allows it to be stored with geographic coordinates in a generic text search engine, these methods allow a traditional text search engine to answer some forms of relevance-sorted geographic range queries without using comparison operators and without using any special metadata tables and without necessarily requiring special loading techniques separate from those used to process all the other words in the documents.
- the encodings described herein can be used in almost any text search engine without special modification to the text search engine and without need for separate geographic data structures. Useful modifications to a generic search system are possible.
- the invention contemplates a variety of specific enhancements to a generic search system, which make it more capable of computing good relevance functions on documents containing the specially formatted geographic strings.
- generic search engines typically assign word positions to every word in a document and would normally assign word positions to every geographic string added to a document.
- standoff metadata described below
- generic search engines typically have no notion of confidence scores. The invention teaches two methods of coping with this. As mentioned above, the first is to encode the geoconfidence in the specially formatted geographic string. The second method is to enhance the search engine to treat confidence as a property of all words in the documents.
- the present invention allows further modifications, such as standoff notation and confidence scores, to operate on the same generic text index structure that holds all the other words.
- the present invention is a key enabler for a wide variety of additional geographic search enhancements to generic text search systems.
- a key concept is that of a hierarchical coordinate system.
- a hierarchical coordinate system is a graph representation of a manifold, or region of an afiine space.
- An affine space as traditionally defined in mathematics, is a space in which any two points can be connected by a vector. There is not necessarily a preferred origin for the coordinates in an affine space, and the coordinates need not be flat (i.e. Euclidean). For example, unprojected latitude/longitude coordinates on the surface of the Earth are an example of coordinates in non-Euclidean affine space. Each point in the affine space can be defined by an n-tuple of numbers.
- hierarchical coordinate systems define objects with extent.
- a hierarchical coordinate system can refer to very small areas using a long string. However, to describe an actual point, a hierarchical string would have to be infinitely long.
- This area property of hierarchical strings is integral to the methods disclosed here.
- a polygon on the surface of the Earth has area, and a set of polygons inscribed inside that polygon also have areal extent.
- the country of Germany can be described by a polygon with areal extent.
- the various provinces inside of Germany can be described by polygons that also have areal extent.
- a hierarchical coordinate system is constructed by assigning names to each of these polygons and including in each name all the names of its enclosing polygons.
- the enclosing polygons are parents of the child polygons in a tree structure.
- a hierarchical coordinate system is simply a naming convention on such a tree structure, or directed acyclic graph.
- the hierarchical coordinate system allows the name of each polygon to unambiguously identify all of the parent nodes above it in the tree.
- the Military Grid Reference System (MGRS) and the Quaternary Triangular Mesh (QTM) are examples of hierarchical coordinate systems.
- MGRS Military Grid Reference System
- QTM Quaternary Triangular Mesh
- the earth is covered by a mesh of triangles, and each triangle is subdivided into four new "child" triangles.
- To initialize the QTM tree structure eight large triangles are placed on the Earth in the shape of an octahedron (See http ://w w w.
- any triangle can be identified by a string that lists first the largest enclosing triangle, and then the next smaller enclosing triangle, and then the next smaller, and so on until the number of the smallest triangle is listed.
- a triangle covering part of Germany might be the 2nd triangle within the 3rd triangle of the 5th large triangle used to initialize the tree structure. This triangle over Germany would be identified by the string 532. This triangle contains four triangles at the next level down in the hierarchy, which have the names 5320, 5321, 5322, and 5323. Each of these also contains four triangles, and so on to any level of depth. Deeper levels correspond to higher spatial precision.
- Hierarchical coordinate strings Another defining feature of hierarchical coordinate strings is that symbols on opposite ends of the string refer to large and small scales. Each additional symbol in the string corresponds to progressively smaller scale. As with any decimal-like system, the symbols could be written right-to-left or left-to-right with obviously appropriate changes to the generic query styles. Any string of symbols designating progressively smaller areas (or hypervolumes) of an affine space can be used as a hierarchical coordinate.
- Such a hierarchical coordinate system can be constructed from any affine vector.
- the n-tuple of numbers defining a point in an affine space can be reformatted in the spirit of a hierarchical coordinate system using methods described below.
- the invention teaches a method of converting any affine space vector n-tuple into a useful hierarchical representation.
- the invention utilizes such hierarchical tree representations of affine spaces to construct word-like strings that contain higher-than-one-dimensional meaning, such as for example, geographic meaning.
- word-like strings can be constructed for any data object with spatial coordinates. Regardless of whether the original spatial coordinates were formatted as affine vectors that had to be converted or were already formatted as hierarchical tree coordinates, the invention teaches a number of methods for formatting the hierarchical strings for use in a generic text search engine. These formatting techniques allow generic text search commands to operate on the specially encoded strings such that they can detect the geographic meaning of the string without requiring the generic text search engine to have any notion of geography.
- the described embodiment uses hierarchical coordinate systems in two ways: first, to access hierarchical string encodings via generic text search commands used in a text index designed for holding only words; and second, to allow the specially formatted hierarchical strings to impact the relevance scoring that sorts the results produced in response to queries.
- a "query style" is any type of search command that might be issued to a search engine.
- the wildcard query style allows the user to find documents containing words that include a substring specified by the wildcard query.
- the commonly known syntax for regular expressions applies here. For example, searching for: te?t
- a particular query style used in some embodiments is the trailing wildcard query style, which puts an asterisk at the end of the query string, as follows: te*
- phrase query style Another type of query style is the phrase query style.
- a phrase search is typically designated by putting quotation marks around the query words, as follows:
- Another query style is a Boolean query style, which allows the user to combine various other query styles into single expressions using the commonly known AND OR and NOT operators.
- ⁇ query styles refer to those query styles that operate on strings without interpreting any meaning in the strings.
- An example of a non-generic query style is a standard range query, which attributes relational meamng to the data in the fields against which the query operates.
- the commonly known greater-than and less-than operators can only be applied to data objects that have been cast into a meaningful form.
- this meaning creation is achieved by putting the data objects in a typed field, where the type is isomorphic to the integers. Since the greater-than and less-than operators can be defined on the integers, one can use the isomorphism between the typed field and the integers to apply the range operators.
- This meaning creation step is not required for generic query styles, which can operate on untyped strings of symbols alone. Such untyped strings are often referred to as unstructured data.
- Generic query styles operate on unstructured data.
- the described embodiment constructs a geographic search system using only generic query styles. That is, it builds a geographic search system utilizing an index designed only to handle unstructured data. Even if an engine supports a variety of non- generic query styles, they are likely to perform slowly when combined with word searches on large collections of documents (as discussed above).
- the described embodiment further discloses an enhanced search engine that can efficiently compute some forms of geographically aware relevance for sorting the results.
- an enhanced search engine that can efficiently compute some forms of geographically aware relevance for sorting the results.
- three factors of high importance are described.
- the described embodiment teaches how to capture these three factors when using specially formatted hierarchical string encodings via generic query styles on both generic search engines and enhanced search engines.
- the described embodiment uses these specially formatted hierarchical string encodings to allow an enhanced map search interface to access multiple document repositories via text search engines that support different types of generic query styles.
- Such an enhanced map search interface can perform so-called federated search across multiple repositories and efficiently merge the results together into one or more result sets.
- the invention features a method of processing a document.
- the method involves: identifying a plurality of one or more geospatial references within the document; and for each identified geospatial reference of the plurality of geospatial references: (1) associating a geographical location with the identified geospatial reference, the geographic location being represented by a set of coordinates of a selected coordinate system; (2) generating a geographic text string that encodes the geographic coordinates, wherein generating a geographic text string involves interleaving the coordinates of the set of coordinates or otherwise acquiring a hierarchical representation of the coordinates; (3) formatting the geographic text string for use with a selected query style; and (4) associating the geographic text string with the identified geospatial reference.
- the selected coordinate system is a non-hierarchical coordinate system on the globe or a portion of the globe (e.g. comprising latitude and longitude coordinates or, for another example, comprising Massachusetts State Plan Coordinates).
- the selected coordinate system is a hierarchical coordinate system (e.g. comprising a mesh of nested shapes, such as a triangular mesh.)
- a specific example of a hierarchical coordinate system is the quarternary triangular mesh coordinate system.
- Associating the geographic text string with the identified geospatial reference involves inserting that geographic text string into the document at the location of the corresponding geospatial reference.
- associating the geographic text string with the identified geospatial reference involves placing that geographic text string into a separate file, which also identifies the geospatial reference with which that geographical text string is associated in the document. For each identified geospatial reference of the plurality of geospatial references also determining a confidence level for the associated geographical location and wherein encoding the geographical location as a geographic text string involves encoding both the geographical location and the confidence level into the geographic text string. Generating the geographic text string involves representing the confidence level within the text string as a corresponding bin of a plurality of bins, each of said plurality of bins representing a different range of confidence levels.
- the invention features another method of processing a document. The method involves: identifying a plurality of one or more geospatial references within the document; and for each identified geospatial reference of the plurality of geospatial references: (1) associating a geographical location with that identified geospatial reference, the geographical location being represented by a set of coordinates of a selected coordinate system; (2) determining a confidence level for that associated geographical location; (3) encoding both the geographical location and the confidence level for that identified geospatial reference as a geographic text string; and (4) associating the geographic text string with the identified geospatial reference.
- Encoding involves interleaving the coordinates of the set of coordinates for that associated geographical location to generate the geographic text string.
- Encoding both the geographical location and the confidence level for that identified geospatial reference as a geographic text string involves representing the confidence level within the text string as a corresponding bin of a plurality of bins, wherein each of the plurality of bins represents a different range of confidence levels.
- encoding both the geographical location and the confidence level for that identified geospatial reference as a geographic text string involves representing the confidence level as a number string and interleaving the number string along with the coordinates of the set of coordinates for that associated geographical location to generate the geographic text string.
- the selected coordinate system is a affine coordinate system (e.g. employing latitude and longitude coordinates).
- the selected coordinate system is a hierarchical coordinate system.
- Associating the geographic text string with the identified geospatial reference involves inserting that geographic text string into the document at the location of the corresponding geospatial reference.
- Associating the geographic text string with the identified geospatial reference involves placing that geographic text string into a separate file, which also identifies the geospatial reference with which that geographical text string is associated in the document.
- the invention features a method of processing a set of documents.
- the method involves: for each document in the set of documents, identifying a plurality of one or more geospatial references within that document; and for each identified geospatial reference of the plurality of geospatial references within that document: (1) associating a geographical location with the identified geospatial reference, the geographical location being represented by a set of coordinates of a selected coordinate system; (2) determining a confidence level for the associated geographical location; encoding the geographical location and its confidence level into a geographic text string; and associating the geographic text string with the identified geospatial reference.
- the invention features a method of constructing a text search query for identifying among a plurality of documents those documents that contain geospatial references that are associated with a geographic location.
- the method involves: receiving an identification of the geographical location; in response to receiving that specification, representing said geographical location as a set of coordinates; and generating a geographical text string from the set of geographical coordinates by interleaving the coordinates of the set of coordinates for that geographical location.
- the method also includes submitting the geographical text string to a text search engine, which searches a text index to for the plurality documents to identify those documents that contain geospatial references that are associated with said geographic location.
- the method further includes receiving a specification of a confidence, wherein generating the geographical text string further involves combining a representation of the confidence level with the set of geographical coordinates to generate the geographic text string.
- Another embodiment includes a client application that constructs text search queries for multiple text search engines using the special text strings described herein.
- the text encodings and query formats for the different text search engines may vary.
- the client application can combine the results from these various engines into. one or more result sets and display them to a user in a text read out or on a geographic map.
- Fig. 1 is a high level block diagram showing the principal elements of the geographical location text indexing and searching system.
- Fig. 2 is a flow diagram illustrating the process for generating a text index that can be used to submit geospatial queries to a document repository.
- Fig. 3 is a flow diagram illustrating the process for conducting geospatial queries of a document repository.
- Figs. 4A and 4B are diagrams illustrating the decomposition of a query from a mapping application into multiple queries.
- System 100 includes: a document repository 101, which contains all of the documents within the search space for the system; a geoparser 104, which identifies and tags the geospatial references within the documents stored in repository 101 with a special text string and places the tagged documents into temporary document repository 102; text indexing software 106, which generates a text index 108 for all documents stored in temporary document repository 102; and text search software 110, which operates on text index 108 to find all documents in document repository 101 that are responsive to a search query 112 specified by a user.
- System 100 also includes a keyword search user interface 114 and a map user interface 116.
- Keyword search user interface 114 enables the user to specify whatever keywords are to be included within the search query; and map user interface 116 enables the user to specify whatever geospatial ranges are to be used in the search query and to also specify confidence thresholds that Umit the results to only those geospatial references that meet the corresponding specified confidence thresholds.
- text search engine 110 uses text index 108 to find all relevant documents and returns the results to the user, typically in the form of a visual output on a display device or as printed output or as a saved electronic file
- Geoparser 104 processes each text document found in document repository 101 and for each document produces geographic coordinates, such as (latitude, longitude, altitude) for the corresponding the geospatial references that are found within that document.
- the function that is performed by geoparser 104 is referred to as geoparsing.
- geoparsing involves looking for references within a document that have geographical significance or meaning (i.e., geospatial references). For example, geoparser 104 might look for names of cities (e.g.
- geoparser 104 is implemented in code, which performs the geoparsing functions automatically, as described in U.S. Patent Application Nos. 09/791,533 and 10/633,915.
- a human can also perform the functions of a geoparser and enter the relevant information about the document by hand.
- Geoparser 104 also generates a confidence score that indicates the probability that the identified textual reference actually refers to the location that geoparser 104 associates with the reference. Stated differently, it can also be viewed as the probability that the author of the document would agree with the software's choice of coordinates for that reference. These coordinates and confidence scores are data about the data in the document (namely the geospatial references within the document), so they are called "metadata.” Confidence scores are typically represented as percentages that indicate the probability that a human would agree with the location chosen by the software to represent the author's original wording. A confidence score of 68% could be interpreted to mean that sixty-eight out of a hundred human readers would agree that these coordinates are what the author intended.
- a particular geographic reference might be tagged with several candidate locations of varying confidence. For example, there are at least 44 cities in the world known as Paris, so a particular reference to the word "Paris" might not clearly identify which particular location was intended by the author. In such a case, an automatic geoparser might tag this reference with the coordinates for the Paris in central France at 95% confidence and the Paris in the state of Texas at 57% confidence, and other locations with other confidence scores.
- confidence scores are to allow the system to present the most correct and most useful results first, so a human reader can understand and cope with search results from large collections of documents.
- search results are plotted on a map search user interface (which in the described embodiment is functionality that is implemented by search engine 110). By sorting the results according to confidence score, those locations that are most likely to have been tagged correctly are presented to the user first.
- Geoparser 104 represents the location and confidence information (i.e., the metadata) as a specially structured text string that encodes the coordinate and confidence metadata in a way that it can be searched by using traditional text search indexing software. These special encodings take advantage of either phrase search or wildcard queries or Boolean operators to represent range queries.
- the encoding method that is employed by geoparser 104 converts the multiple spatial coordinates identifying a particular location into a single geographic text string. It does this by interleaving the digits that make up the coordinates of the location. So, for example, if the coordinates are (48.28°, 24.55°), which specify a position in terms of a (latitude, longitude), then one constructs the special text string by alternately taking a digit from each coordinate starting with the leftmost digit (i.e., the most significant digit) and adding it to the text string until all of the digits have been used. In the case of the coordinates (48.28°, 24.55°) this process produces the following string: "42842585.”
- This interleaving technique can be applied to any multi-dimensional spatial coordinate system in which displacement along each coordinate dimension is represented by a string (typically a string of numerical digits) and each element of the string (or each digit) represents a larger spatial range than the element (or digit) to its right.
- a string typically a string of numerical digits
- each element of the string or each digit
- the "4" digit represents a range that extends between 40.00° and 49.99°.
- the next digit namely, "8" represents a range that extends between 8.00° and 8.99°, which is ten times smaller.
- coordinate systems include the Universal Transverse Mercatur (UTM). As described above, each coordinate pair is usually assumed to have infinite precision, with an infinitely long string of zeros implicitly tacked on to the end. When interleaving these coordinates, it is helpful to pad them on the left and right with enough zeros to make all coordinate dimensions the same length regardless of the actual number of significant digits and regardless of the precision.
- UTM Universal Transverse Mercatur
- Hierarchical coordinate systems such as the military grid reference system (MGRS) and the quaternary triangular mesh (QTM), are already in a single-string format.
- the interleaving procedure described above for affine space coordinates is a method for generating hierarchical coordinates that correspond to the affine space.
- the geographic string encodings described here are simply string representations of hierarchical coordinates. The described embodiment teaches unique uses of these strings in geographic text retrieval that ca be applied to strings from any hierarchical coordinate system or any other coordinate system converted to a hierarchical string.
- geoparser 104 inserts this geographic text string directly into the document next to the geospatial reference.
- This approach is referred to herein as the "inline” method.
- geoparser 104 actually modifies the document, which results in altering the positions of all words within the document that follow the location at which the special text string is inserted.
- the inline method "warps" the document and this will likely affect the search results when proximity conditions are used in a search query.
- standoff method An alternative approach that avoids this problem is referred to as the "standoff method.
- a separate file is created that carries the special text strings.
- the separate file also specifies the character positions identifying the locations of the corresponding geospatial references within the actual document. This allows the geographic text strings to be associated with one character position, a character range, one word position, or a chosen set of words in the document.
- the standoff method does not warp the document and permits the geographic text strings to participate in relevance ranking computations that use textual proximity.
- Generic search engines typically do not support standoff metadata.
- An enhanced search engine may handle standoff metadata.
- Geoparser 104 stores the encoded geographic metadata information in temporary document repository 102 as part of the documents either as inline or standoff metadata. Adding these special strings to copies of the documents essentially tricks traditional text indexing software into interpreting these special strings as regular words thereby making them searchable by conventional text search software using generic query styles. This, in turn, enables a conventional text search engine to easily locate all documents that contain geographic representations that are relevant to geographic ranges specified by the map user interface.
- document repository 101 typically, although not always, multiple documents are stored in document repository 101 and can be bulk processed in batches to create temporary document repository 102 in which the metadata is added.
- individual documents can be geoparsed as part of a larger processing system, such as a document tagging pipeline or a document editor user interface that allows a user to check the accuracy of the metadata output by the geoparser.
- Documents stored in repository 102 typically have document identifiers, such as URLs, that allow users to retrieve a document simply by entering the document identifier into a viewer, such as entering a URL into a web browser.
- Text indexing engine 106 processes documents from repository 102 to create an "inverted index" or text index 108 that can be operated by text search engine 110 to allow users to retrieve documents based on the keywords and/or the geospatial references contained in the document instead of requiring the user to know the document identifier.
- Text index 108 is usually represented as large files stored on disks or in memory. Text index 108 allows users to retrieve documents or document references, such as URLs, based on search query commands input through a keyword search user interface 114.
- Keyword search user interface 114 allows users to construct queries that are used for searching the document in repository 102.
- the search query will typically include one or more strings of characters and possibly operators, such as quotation marks to denote sets of strings separated by spaces, asterisks to denote wildcard matching, and AND/OR/NOT operators to denote Boolean operations.
- Text search engine 110 then applies these commands to the information that it has stored in text index 108 about the documents in temporary document repository 102.
- the information in text index 108 is typically organized by the text indexing engine that created the index to optimize the time required to apply these commands.
- text index engine 110 might create and store a list of all document identifiers to documents that contain any word beginning with “cat,” including documents that contain the word “catalog” and "catastrophe.” This allows the text index to answer a wildcard query of the form "cat*" simply by returning that list of document identifiers, which is much faster than reprocessing every document in search of words that match that query command.
- map user interface 116 enables the user to define through a graphical user interface the geographic regions that are to be included as search criteria. It is referred to as an "enhanced" map user interface because it not only specifies the geospatial ranges that are input by the user through a graphical user interface but it also converts those geospatial ranges into geographic string encodings such as are described below in greater detail. These are supplied to text search engine 114 which uses them to search text index 108 to identify the relevant documents in temporary document repository 102.
- Map user interface 116 interacts with text search engine 110 via keyword search user interface 114, which is a generic keyword search user interface that is able to interact with text search engine 110.
- Keyword search user interface is the interface into which the user types the keywords that will make up part of the overall search query that is to be applied by text search engine 110.
- An alternative approach would be to design map user interface 116 to interact directly with the text search engine 110, in which case it might incorporate the functionality of a keyword search user interface thereby allowing the user to enter keywords or search commands that are passed to the text index software along with the encoded geographic queries.
- Map user interface 116 can be implemented by any one of a large number of map viewing applications, including, for example, an ESRI ArcGIS client running on a desktop computer that employs the Windows operating system or a web-browser-based application served by a web server that has been enhanced with the ability to issue queries to a text search engine using the encodings described below.
- the results from text search engine 110 are typically plotted on the map in the viewing application.
- Map search user interface 116 allows a user to select a spatial domain of interest by zooming a map image.
- the viewable map area within the image can then be used as the query constraint, or the user may be allowed to define the spatial search criteria by highlighting areas of interest on the map.
- a two-dimensional map search user interface might show a latitude-longitude map of a region like Europe and allow a user to draw a loop around their area of interest.
- a three- dimensional map search user interface might show a fly through of a building complex and allow a user to select a parallelepiped surrounding a hallway of interest.
- the multi-dimensional domains of interest are then combined with keyword search commands and sent to generic text search engine 110 which uses only generic query styles to represent both the geographic and non-geographic query constraints. This retrieves documents or document identifiers that match both the spatial domain and keyword constraints.
- Fig. 2 shows a flow diagram of the process by which the system builds the text indexes that include the geographic text strings.
- the operator or system administrator provides a repository of all documents that are to be searchable (step 202).
- the geoparser goes through each document in the repository to identify geospatial references (step 204). For each geospatial reference that is identified in a document, the geoparser determines the geographical locations to which that geospatial reference might refer; it computes a confidence score for those locations; and it constructs metadata containing that information (step 206).
- the geoparser then encodes the metadata into a geographic text string of the type described above (step 208), and it inserts those into the document using either the inline approach or the standoff approach (step 210). After the geoparser processes all documents in the document repository in that way, the resulting augmented document repository is ready to be indexed by the text indexing engine.
- the system might apply the geoparser to the documents as they are passed through a processing pipeline between the repository and the indexing engine.
- the metadata need not be stored in the repository.
- the metadata can be associated with the documents in-memory as they are passed into the indexing engine.
- the text indexing engine indexes the documents in the repository using techniques that are commonly employed by such engines (step 210). However, because the geospatial information has been added to the documents as special text strings, the text indexing engine will index that information in the same way that it indexes all keywords and keyword phrases that are found within the corpus of documents.
- the resulting inverted index which may include many indices each one for a different keyword or keyword phrase, maps all key ords and text strings to the appropriate documents in the document repository.
- Fig. 3 shows how the system enables a user to search for all documents that are relevant to a query that includes one or more keywords and a geographical region of interest.
- the map user interface presents the user with a visual graphical representation that enables the user to specify the geographical region or regions that are to part of the search query (step 302). Through this interface the user identifies all geographical regions for which the user wants to see documents that contain geospatial references that are relevant to those geographical regions. The user is also permitted by the interface to specify a confidence threshold which instructs the search engine to ignore any documents that contain geospatial references for which the probability that it is referring to the specified geographic is not sufficiently high.
- Another part of the interface namely the keyword search user interface, enables the user to also specify a list of keywords that are to form part of the search query.
- the interface also enables the user to use conventional Boolean and other standard operators and conditions to construct the keyword search query (step 304). For example, keyword 1 w/in 3 of keyword2 might be written as
- the user interface then generates the appropriate search strings that are to be presented to the text search engine to define the search criteria that are to be applied to the search (step 306). As part of this operation, it encodes the selected geographical regions into the special strings of the type that are described elsewhere in this document.
- the system presents the search commands to the search engine, which then conducts the search (step 308).
- the search engine presents the results to the user in some useful form, e.g. as information displayed in visual display or printed out in hard copy or stored on electronic media (step 310). Constructing Hierarchical Coordinates from Affine Space n-Tuples
- the geographic coordinate metadata created by the geoparser is converted to hierarchical coordinates by interleaving, as described in this section.
- This interleaving can be performed on any multi-dimensional affine coordinate tuple, such as those on the sphere of the Earth or in Euclidean three-dimensional space.
- the tuple could include latitude, longitude, and meters above sea level, or x-feet east and y-feet north of a particular anchor point.
- Interleaving takes the first digit of each coordinate and concatenates them, and then the second digit from each coordinate and concatenates them to the string of first digits, and so on through all the digits.
- the coordinate location 432 feet east and 987 feet north can encoded as:
- a hierarchical coordinate refers to an area. In this example, each coordinate refers to a square. The longer the string, the smaller the square.
- the geoparser might encode these coordinates by first shifting the origin so that negative symbols do not appear. To keep the number of left-of-decimal-point digits the same amongst all the coordinates, the geoparser adds padding zeroes. So, for the location mentioned above, the geoparser could shift the origin 90° south and 180° west and pads with zeros to produce the following interleaving encoding:
- This string encoding is equivalent to a hierarchy of rectangular areas.
- n-tuple interleaving described here preserves the singularities of the original coordinate system. For example, latitude-longitude coordinates behave poorly at the poles, by having many very different coordinates for nearly the same location. A hierarchical coordinate system constructed directly from latitude-longitude by interleaving still contains this problem, by having squares of equal "size" cover very different amounts of real ground when considered at the poles versus at the equator.
- a document containing hierarchical string used in the example above can be found using a trailing wild card query such as 000004013504* since this query would retrieve any string between 000004013504000000000 and 000004013504999999999.
- This range of text strings co ⁇ esponds to the encodings for all locations within the three- dimensional bounding box ranging from (00050.00°, 00100.00°, 04340.00) to (00059.99°, 00109.99°, 04349.99).
- the right-most digits in these strings are the least significant.
- the last n-digits correspond to the least significant digit in each of the coordinate directions. It is typical to assume infinite precision on these coordinates, which implies an infinite string of zeros appended to the right of these least significant digits.
- the documents retrieved by the range query will include all those with matching prefix string (most significant digits) regardless of the precision (i.e. length of non-zero string).
- the trailing wildcard query style can be combined with non-geographic query constraints. For example, to find documents that refer both to the word "roadblock” and a location within the bounding box with latitude greater than or equal to 50 degrees and less than 60 degrees, and longitude greater than 100 and less than 110 degrees, a query like one of following might be sent to the text search index: roadblock 0150*
- the first example requires that the document contain the word roadblock and also contain the exact phrase following the magic string.
- the second example requires that document contain roadblock be within 40 words of the magicstring phrase.
- the third example shows how a special identifying string, such as the characters "magicstring,” might be attached to the beginning of the specially encoded geographic string in order to ensure that the wild card search only acts on those numbers that were inserted by the geoparser and not other extraneous numbers occurring in the documents.
- each prefix might be prepended with a magicstring to ensure that it is uniquely identifiable via the query. If the indexing engine supports the standoff method, then all the prefixes can be associated only with the character or word positions of the geographic reference. While this design may require the text index to hold many more words, the words can be stored in a simple index that need not support wildcard queries. As with the wildcard query style, this string matching query style can be combined with non-geographic query constraints. For example, to find roadblocks within a particular area, one need only issue a query for: roadblock 0150
- the proximity operator could be used to find roadblock within a certain number of words of the spatial reference. This illustrates a problem with the proposed technique. If the specially formatted hierarchical strings are inserted inline, then the word proximity operator might count them as part of the separation between query words. This is not the most co ⁇ ect behavior. By accepting standoff metadata, an enhanced search engine avoids this problem. Standoff metadata allows multiple of the specially encoded geographic strings to occupy the same word position as already existing words in the document.
- Typical generic text search engines are equipped with the ability to search for a phrase.
- a phrase search can be more efficient than a trailing wild card search because the system does not have to generate a list of all the sub-words beginning with the search string that precedes the wild card.
- Another cause of inefficiency in wildcard searches comes from the use of separate indices: if the prefix index does not include character positions, then searches on the prefix index must be joined with a word position index in order to compute textual proximity based word relevance functions. In this method, the system needs only to search for word combinations using the phrase search generic query styles.
- Phrase searching can treat the sought for elements of the text string as separate words, and search only for the required word combinations.
- a special string is added to the beginning of the encoding.
- the following string is added to a document: magicstringOl 50 71 78 91
- the first example requires that the document contain the word roadblock and also contain the exact phrase following the magic string.
- the second example requires that document contain roadblock be within 40 words of the magicstring phrase.
- the phrases can be any size. However, there might be an advantage to selecting a size that corresponds to the number of dimensions of the coordinate space. In the above example, the coordinate space had two dimensions, namely, latitude and longitude; and the phrase that was selected had two digits. Thus, by adding another set of three characters to the trailing end of the phrase search specified above, one reduces the size of the query box by a factor of ten along each dimension.
- the geoparser can also add natural language confidence scores about the geographic metadata to the specially formatted hierarchical strings simply by treating confidence as another coordinate dimension. To extend the previous example, assume that it now includes a confidence score: latitude longitude altitude confidence of 88% (00057.79°, 00101.81°, 04349.00, 00088.00)
- the geoparser could encode the confidence as though it were a fourth affine coordinate dimension. For trailing wild card queries, this would look like this: magicstring0000004001305048719878009100
- the wild card query magicstring0000004001305048* retrieves documents refe ⁇ ing to the latitude, longitude, altitude bounding box ranging from (50.00°, 100.00°, 4340m) to (59.99°, 109.99°, 4349m) with a confidence level between 80.00% and 89.99%. And in case of phrase searching, the phrase search string "magicstringOOOO 0040 0130 5048" retrieves the same set of documents.
- the queries are forced to use the same degree of precision along all coordinate directions. If the coordinates have different numbers of significant digits, a query may specify a relatively small range in one dimension and a relatively large range in another dimension. Normalizing all the coordinate dimensions to a range between 0 and 1 mitigates this problem.
- the latitude is divided by 180, which is the largest deviation it can experience.
- the longitude is divided by 360, which is the largest deviation it can experience.
- the altitude is normalized to 50,000 meters above sea level, which is an arbitrary maximum altitude. Since the confidence score is already normalized to one, it usually need not be changed.
- the resulting normalized coordinates would be:
- the normalized coordinates encode as: 320828881260089050806600, for trailing wild car searches, and 3208 2888 1260 0890 5080 6600, for phrase searching.
- the geoparser can use a mixed encoding strategy in which the encoding scheme bins one or more of the coordinates and represents the binned coordinates in a way that excludes them from the interleaved coordinate encoding.
- the following bins can be defined:
- phrase search query-capable text search engines or any of the listed prefixes for an engine that does not necessarily support either phrase searches or wildcard searches.
- the interleaving scheme described above can be applied to coordinates from any affine space.
- Geographic mapping projections are examples of affine space coordinates. They often use sphere-like coordinates on the globe. Common examples include "unprojected" latitude-longitude and Universal Transverse Mercator (UTM).
- Grid coordinate systems also known as "hierarchical" coordinate systems, such as military grid reference system (MGRS) and the quaternary triangular mesh (QTM), are already in a hierarchical representation. Such grid coordinate systems do not need to be interleaved.
- MGRS military grid reference system
- QTM quaternary triangular mesh
- QTM embeds an octahedron in the earth and then subdivides its triangular faces into four triangles, which are further subdivided into four triangles ad infinitum.
- Each face of the octahedron is numbered 0 to 7, and each triangular subdivision is numbered 0 to 3.
- the vertices of the polyhedron are then projected to the surface along radial lines of the sphere. Any point on the surface can now be specified to any level of precision with a longer or shorter string of digits, where the first ranges from 0 to 1, and each subsequent symbol ranges from 0 to 3.
- a trailing wild card query retrieves all locations within the last triangle number specified in the query.
- the grid string can be formatted for the various types of generic query styles. For example,
- Most text search engines provide results with snippets of text containing instances of the search words from the original documents.
- the geoparser adds extra information to the existing encodings by appending one or more letter/number pairs to the encoded string.
- the search engine retrieves this information to help the user locate within the text of the document the geotags of interest. For example, in order to indicate that the words used to make a particular geotag started 12 characters preceding the first character in this geotag, the letter/number pair "cl2" is added, as follows: magicstringA2012 0302 1023 0203 012cl2.
- the addition of such information to the geographical metadata information allows the application that presents search results to the user to do so in a way that is more intelligible to the user.
- the system can highlight the geotags in one color and their normahzed representations in another color.
- the map user interface constructs the desired query from multiple sub-queries.
- the mapping application takes a domain specified by user input and converts it to a set of multiple queries that use generic query styles, such as trailing wildcards or phrases. The mapping application then combines these multiple queries with Boolean OR operators to form a single query expression.
- the mapping application sends multiple queries to the text search engine. In the latter case, the mapping application may have to combine several result lists that are returned by the search engine and it may have to trim results that fall outside the range intended by the user's input.
- Trimming is done by searching through the returned documents and identifying those for which the geospatial references fall outside of the user's specified range. But since the set of returned documents is usually small in number in comparison to the number stored in the repository, the trimming operation is typically not that time consuming.
- FIG. 4A An example of multiple queries is illustrated in Fig. 4A in which the bold lined box 302 indicates the rectangular range queried by a user. According to the method shown in Fig. 4A the mapping application merges four sub-queries, indicated by boxes 304, 306, 308, and 310, and then trims results that fall outside the bold box. Alternatively, the mapping application generates a single four-part OR query for results falling in boxes 304, 306, 308, or 310, and then trims the results.
- the mapping application merges six sub-queries indicated by boxes 312, 314, 316, 318, 320, and 322, or alternatively generates a single six-part Boolean OR query.
- This method requires no trimming; however, it requires that the boxes be defined so that their boundaries fall on the boundary of the bold box. Meeting the second condition might require using a box size that is so small that the number of searches that need to be performed by the search engine seriously deteriorates the efficiency of the procedure.
- the enhanced map search user interface might query multiple search engines. Since the different search engines might handle different generic query styles more or less efficiently, they can be "wrapped" in different embodiments of this invention. One might be setup to use trailing wildcard generic query styles to implement range queries, and another might be setup to use phrase search generic query styles.
- the client receives results from the various search engines, it can merge the results into one or more result sets to present to the user.
- confidence scores are typically generated by the geoparser to indicate the likelihood that a particular coordinate was intended by the author of the document.
- the most powerful way to incorporate confidence scores into a search engine is to enhance the index so that each word ca ⁇ ies with it a general confidence value.
- Such a general confidence value can be assigned to any type of word, geographic or non-geographic, and can be used to indicate the likelihood that the author intended for that word to be in the document.
- most of the words were written by the author, so most of them have 100% confidence.
- metadata is added to the document by various automated processes, some of the text may have less than 100% confidence.
- a scoring function operating on a result list can utilize this per-term confidence information directly as a generic feature in the search engine. If a search engine does not support this notion of confidence, then it can be incorporated into the specially formatted hierarchical strings using either the confidence binning method or by treating it as an additional affine coordinate, as described above. Either of these methods require the enhanced map search interface to formulate queries for ranges or bins of confidence, and thus to enforce the impact of confidence on the relevance from outside the search engine.
- the client issuing the queries does this by using a generic query style to first request documents within a high confidence range or bin, e.g., greater than 80% confidence, and then if not enough results are returned, the client can request additional documents in a lower range or bin.
- An enhanced search engine can incorporate confidence values directly into its relevance computation in a variety of ways, including simply multiplying the documents relevance by the highest confidence that matches the constraint.
- the specially formatted geographic strings are particularly effective, because they become part of the document without warping the length of document. Regardless of which method is used, both methods associate the specially formatted geographic strings with specific regions of text in the document. The geographic strings are given word positions in the text. This means that they are automatically and seamlessly incorporated into any word-proximity calculation performed by the search engine's generic relevance calculation. Even with the warping of the inline insertion method, this provides dramatically better results than attempting to merge results from two separate indices.
- the third enhancement contemplated relates to term frequencies.
- relevance functions use the frequency of a term to determine its importance. Intuitively, one expects that rare words are more important than common words included in a user's search.
- the frequencies of occu ⁇ ence are calculated by dividing the number of occu ⁇ ences of the word to the total number of words.
- TDF term-document frequency
- TCF term-corpus frequency
- Relevance calculations typically include various functions involving logarithms and other mathematical curves applied to the ratio of these two frequencies. If the total number of words in the collection or in a document includes all the specially formatted hierarchical strings, then the relevance function might be warped by their presence. This can be avoided by constructing a relevance function that ignores the magicstring words in its counting of word occu ⁇ ences.
- the text string encoding of the spatial coordinate systems can be interleaved in different orders, such as by taking a digit of the longitude before the co ⁇ esponding digit of latitude, or by taking the altitude digit first.
- confidence information can be combined with the spatial coordinate-derived text string according to other encoding schemes, as long as a key word query can be formulated for the desired searches.
- Geospatial ranges can be two-dimensional, three-dimensional, or n-dimensional, each with regular or arbitrarily defined boundaries. The ranges can be measured in familiar "absolute" coordinates, such as latitude and longitude, or in relative coordinates, such as coordinates with respect to an arbitrary point.
- Any desired coordinate normalization scheme can be used that offers users the ability to specify geospatial ranges of interest. Such ranges can include similar absolute ranges in each of several dimensions, or disparate ranges in one or more of the dimensions.
- the geographic string formats can be applied to any hierarchical coordinate system or hierarchical representation of any affine space.
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CA002566280A CA2566280A1 (en) | 2004-05-19 | 2005-05-19 | Systems and methods of geographical text indexing |
JP2007527466A JP2007538343A (en) | 2004-05-19 | 2005-05-19 | Geographic text indexing system and method |
EP05751762A EP1763799A1 (en) | 2004-05-19 | 2005-05-19 | Systems and methods of geographical text indexing |
AU2005246368A AU2005246368A1 (en) | 2004-05-19 | 2005-05-19 | Systems and methods of geographical text indexing |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US57255804P | 2004-05-19 | 2004-05-19 | |
US60/572,558 | 2004-05-19 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2005114484A1 true WO2005114484A1 (en) | 2005-12-01 |
Family
ID=34970556
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2005/017697 WO2005114484A1 (en) | 2004-05-19 | 2005-05-19 | Systems and methods of geographical text indexing |
Country Status (6)
Country | Link |
---|---|
US (1) | US20050278378A1 (en) |
EP (1) | EP1763799A1 (en) |
JP (1) | JP2007538343A (en) |
AU (1) | AU2005246368A1 (en) |
CA (1) | CA2566280A1 (en) |
WO (1) | WO2005114484A1 (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007334799A (en) * | 2006-06-19 | 2007-12-27 | Fujitsu Ltd | Information provision program, recording medium which records the program, information provision device and information provision method |
WO2008019344A2 (en) * | 2006-08-04 | 2008-02-14 | Metacarta, Inc. | Systems and methods for obtaining and using information from map images |
WO2010070410A1 (en) | 2008-12-16 | 2010-06-24 | Foundation For Research And Technology-Hellas | System and method for classifying and storing related forms of data |
US7908280B2 (en) | 2000-02-22 | 2011-03-15 | Nokia Corporation | Query method involving more than one corpus of documents |
US8200676B2 (en) | 2005-06-28 | 2012-06-12 | Nokia Corporation | User interface for geographic search |
US9286404B2 (en) | 2006-06-28 | 2016-03-15 | Nokia Technologies Oy | Methods of systems using geographic meta-metadata in information retrieval and document displays |
US9411896B2 (en) | 2006-02-10 | 2016-08-09 | Nokia Technologies Oy | Systems and methods for spatial thumbnails and companion maps for media objects |
US9436715B2 (en) | 2012-03-30 | 2016-09-06 | Fujitsu Limited | Data management apparatus and data management method |
US9721157B2 (en) | 2006-08-04 | 2017-08-01 | Nokia Technologies Oy | Systems and methods for obtaining and using information from map images |
Families Citing this family (94)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2328795A1 (en) | 2000-12-19 | 2002-06-19 | Advanced Numerical Methods Ltd. | Applications and performance enhancements for detail-in-context viewing technology |
US8416266B2 (en) | 2001-05-03 | 2013-04-09 | Noregin Assetts N.V., L.L.C. | Interacting with detail-in-context presentations |
CA2345803A1 (en) | 2001-05-03 | 2002-11-03 | Idelix Software Inc. | User interface elements for pliable display technology implementations |
US7213214B2 (en) | 2001-06-12 | 2007-05-01 | Idelix Software Inc. | Graphical user interface with zoom for detail-in-context presentations |
US9760235B2 (en) | 2001-06-12 | 2017-09-12 | Callahan Cellular L.L.C. | Lens-defined adjustment of displays |
US7084886B2 (en) | 2002-07-16 | 2006-08-01 | Idelix Software Inc. | Using detail-in-context lenses for accurate digital image cropping and measurement |
CA2361341A1 (en) | 2001-11-07 | 2003-05-07 | Idelix Software Inc. | Use of detail-in-context presentation on stereoscopically paired images |
CA2370752A1 (en) | 2002-02-05 | 2003-08-05 | Idelix Software Inc. | Fast rendering of pyramid lens distorted raster images |
US8120624B2 (en) | 2002-07-16 | 2012-02-21 | Noregin Assets N.V. L.L.C. | Detail-in-context lenses for digital image cropping, measurement and online maps |
US20070064018A1 (en) * | 2005-06-24 | 2007-03-22 | Idelix Software Inc. | Detail-in-context lenses for online maps |
CA2393887A1 (en) | 2002-07-17 | 2004-01-17 | Idelix Software Inc. | Enhancements to user interface for detail-in-context data presentation |
CA2406047A1 (en) | 2002-09-30 | 2004-03-30 | Ali Solehdin | A graphical user interface for digital media and network portals using detail-in-context lenses |
CA2449888A1 (en) | 2003-11-17 | 2005-05-17 | Idelix Software Inc. | Navigating large images using detail-in-context fisheye rendering techniques |
CA2411898A1 (en) | 2002-11-15 | 2004-05-15 | Idelix Software Inc. | A method and system for controlling access to detail-in-context presentations |
US9489853B2 (en) * | 2004-09-27 | 2016-11-08 | Kenneth Nathaniel Sherman | Reading and information enhancement system and method |
US7486302B2 (en) | 2004-04-14 | 2009-02-03 | Noregin Assets N.V., L.L.C. | Fisheye lens graphical user interfaces |
US8106927B2 (en) | 2004-05-28 | 2012-01-31 | Noregin Assets N.V., L.L.C. | Graphical user interfaces and occlusion prevention for fisheye lenses with line segment foci |
US9317945B2 (en) | 2004-06-23 | 2016-04-19 | Callahan Cellular L.L.C. | Detail-in-context lenses for navigation |
US7714859B2 (en) | 2004-09-03 | 2010-05-11 | Shoemaker Garth B D | Occlusion reduction and magnification for multidimensional data presentations |
US7995078B2 (en) | 2004-09-29 | 2011-08-09 | Noregin Assets, N.V., L.L.C. | Compound lenses for multi-source data presentation |
US7801897B2 (en) * | 2004-12-30 | 2010-09-21 | Google Inc. | Indexing documents according to geographical relevance |
US7650345B2 (en) * | 2005-02-28 | 2010-01-19 | Microsoft Corporation | Entity lookup system |
US7580036B2 (en) | 2005-04-13 | 2009-08-25 | Catherine Montagnese | Detail-in-context terrain displacement algorithm with optimizations |
US7933395B1 (en) * | 2005-06-27 | 2011-04-26 | Google Inc. | Virtual tour of user-defined paths in a geographic information system |
CA2928051C (en) * | 2005-07-15 | 2018-07-24 | Indxit Systems, Inc. | Systems and methods for data indexing and processing |
US8031206B2 (en) | 2005-10-12 | 2011-10-04 | Noregin Assets N.V., L.L.C. | Method and system for generating pyramid fisheye lens detail-in-context presentations |
EP1840511B1 (en) * | 2006-03-31 | 2016-03-02 | BlackBerry Limited | Methods and apparatus for retrieving and displaying map-related data for visually displayed maps of mobile communication devices |
US7983473B2 (en) | 2006-04-11 | 2011-07-19 | Noregin Assets, N.V., L.L.C. | Transparency adjustment of a presentation |
US20070271259A1 (en) * | 2006-05-17 | 2007-11-22 | It Interactive Services Inc. | System and method for geographically focused crawling |
US20080010273A1 (en) | 2006-06-12 | 2008-01-10 | Metacarta, Inc. | Systems and methods for hierarchical organization and presentation of geographic search results |
US7747562B2 (en) * | 2006-08-15 | 2010-06-29 | International Business Machines Corporation | Virtual multidimensional datasets for enterprise software systems |
US7895150B2 (en) * | 2006-09-07 | 2011-02-22 | International Business Machines Corporation | Enterprise planning and performance management system providing double dispatch retrieval of multidimensional data |
EP2070006B1 (en) * | 2006-09-08 | 2015-07-01 | FortiusOne, Inc. | System and method for web enabled geo-analytics and image processing |
US20080082578A1 (en) | 2006-09-29 | 2008-04-03 | Andrew Hogue | Displaying search results on a one or two dimensional graph |
US8918755B2 (en) * | 2006-10-17 | 2014-12-23 | International Business Machines Corporation | Enterprise performance management software system having dynamic code generation |
US20080208847A1 (en) * | 2007-02-26 | 2008-08-28 | Fabian Moerchen | Relevance ranking for document retrieval |
US8347202B1 (en) * | 2007-03-14 | 2013-01-01 | Google Inc. | Determining geographic locations for place names in a fact repository |
US8024454B2 (en) * | 2007-03-28 | 2011-09-20 | Yahoo! Inc. | System and method for associating a geographic location with an internet protocol address |
US8621064B2 (en) * | 2007-03-28 | 2013-12-31 | Yahoo! Inc. | System and method for associating a geographic location with an Internet protocol address |
US8244772B2 (en) * | 2007-03-29 | 2012-08-14 | Franz, Inc. | Method for creating a scalable graph database using coordinate data elements |
CN101627620B (en) * | 2007-05-31 | 2011-10-19 | 株式会社Pfu | Electronic document encryption system, decoding system, and method |
US7747988B2 (en) | 2007-06-15 | 2010-06-29 | Microsoft Corporation | Software feature usage analysis and reporting |
US7765216B2 (en) * | 2007-06-15 | 2010-07-27 | Microsoft Corporation | Multidimensional analysis tool for high dimensional data |
US7870114B2 (en) | 2007-06-15 | 2011-01-11 | Microsoft Corporation | Efficient data infrastructure for high dimensional data analysis |
US9026938B2 (en) | 2007-07-26 | 2015-05-05 | Noregin Assets N.V., L.L.C. | Dynamic detail-in-context user interface for application access and content access on electronic displays |
US8060535B2 (en) * | 2007-08-08 | 2011-11-15 | Siemens Enterprise Communications, Inc. | Method and apparatus for information and document management |
US20090165116A1 (en) * | 2007-12-20 | 2009-06-25 | Morris Robert P | Methods And Systems For Providing A Trust Indicator Associated With Geospatial Information From A Network Entity |
FR2929778B1 (en) * | 2008-04-07 | 2012-05-04 | Canon Kk | METHODS AND DEVICES FOR ITERATIVE BINARY CODING AND DECODING FOR XML TYPE DOCUMENTS. |
US8463774B1 (en) * | 2008-07-15 | 2013-06-11 | Google Inc. | Universal scores for location search queries |
US7991756B2 (en) * | 2008-08-12 | 2011-08-02 | International Business Machines Corporation | Adding low-latency updateable metadata to a text index |
CN101661461B (en) * | 2008-08-29 | 2016-01-13 | 阿里巴巴集团控股有限公司 | Determine the method for core geographic information in document, system |
US8060582B2 (en) * | 2008-10-22 | 2011-11-15 | Google Inc. | Geocoding personal information |
US8402058B2 (en) * | 2009-01-13 | 2013-03-19 | Ensoco, Inc. | Method and computer program product for geophysical and geologic data identification, geodetic classification, organization, updating, and extracting spatially referenced data records |
US20100179754A1 (en) * | 2009-01-15 | 2010-07-15 | Robert Bosch Gmbh | Location based system utilizing geographical information from documents in natural language |
KR20100101204A (en) * | 2009-03-09 | 2010-09-17 | 한국전자통신연구원 | Method for retrievaling ucc image region of interest based |
US8275546B2 (en) * | 2009-09-29 | 2012-09-25 | Microsoft Corporation | Travelogue-based travel route planning |
US8281246B2 (en) * | 2009-09-29 | 2012-10-02 | Microsoft Corporation | Travelogue-based contextual map generation |
US8977632B2 (en) * | 2009-09-29 | 2015-03-10 | Microsoft Technology Licensing, Llc | Travelogue locating mining for travel suggestion |
US8204886B2 (en) * | 2009-11-06 | 2012-06-19 | Nokia Corporation | Method and apparatus for preparation of indexing structures for determining similar points-of-interests |
US8706717B2 (en) * | 2009-11-13 | 2014-04-22 | Oracle International Corporation | Method and system for enterprise search navigation |
US9009163B2 (en) * | 2009-12-08 | 2015-04-14 | Intellectual Ventures Fund 83 Llc | Lazy evaluation of semantic indexing |
US9557735B2 (en) * | 2009-12-10 | 2017-01-31 | Fisher-Rosemount Systems, Inc. | Methods and apparatus to manage process control status rollups |
US20110196602A1 (en) * | 2010-02-08 | 2011-08-11 | Navteq North America, Llc | Destination search in a navigation system using a spatial index structure |
US8676807B2 (en) | 2010-04-22 | 2014-03-18 | Microsoft Corporation | Identifying location names within document text |
US8572076B2 (en) | 2010-04-22 | 2013-10-29 | Microsoft Corporation | Location context mining |
US8489641B1 (en) * | 2010-07-08 | 2013-07-16 | Google Inc. | Displaying layers of search results on a map |
US8566026B2 (en) * | 2010-10-08 | 2013-10-22 | Trip Routing Technologies, Inc. | Selected driver notification of transitory roadtrip events |
CA2760624C (en) * | 2010-12-07 | 2015-04-07 | Rakuten, Inc. | Server, dictionary creation method, dictionary creation program, and computer-readable recording medium recording the program |
WO2012082859A1 (en) * | 2010-12-14 | 2012-06-21 | The Regents Of The University Of California | High efficiency prefix search algorithm supporting interactive, fuzzy search on geographical structured data |
TWI431491B (en) * | 2010-12-20 | 2014-03-21 | King Yuan Electronics Co Ltd | Comparison device and method for comparing test pattern files of a wafer tester |
US8626681B1 (en) * | 2011-01-04 | 2014-01-07 | Google Inc. | Training a probabilistic spelling checker from structured data |
US20120282950A1 (en) * | 2011-05-06 | 2012-11-08 | Gopogo, Llc | Mobile Geolocation String Building System And Methods Thereof |
CN103609144A (en) * | 2011-06-16 | 2014-02-26 | 诺基亚公司 | Method and apparatus for resolving geo-identity |
US8688688B1 (en) | 2011-07-14 | 2014-04-01 | Google Inc. | Automatic derivation of synonym entity names |
JP2013065116A (en) * | 2011-09-15 | 2013-04-11 | Fujitsu Ltd | Information management method and information management apparatus |
JP5782948B2 (en) * | 2011-09-15 | 2015-09-24 | 富士通株式会社 | Information management method and information management apparatus |
US20130117719A1 (en) * | 2011-11-07 | 2013-05-09 | Sap Ag | Context-Based Adaptation for Business Applications |
JP5670944B2 (en) * | 2012-03-29 | 2015-02-18 | 日本電信電話株式会社 | Document summarization apparatus, method and program |
JP6032467B2 (en) * | 2012-06-18 | 2016-11-30 | 株式会社日立製作所 | Spatio-temporal data management system, spatio-temporal data management method, and program thereof |
US9262511B2 (en) * | 2012-07-30 | 2016-02-16 | Red Lambda, Inc. | System and method for indexing streams containing unstructured text data |
US8595317B1 (en) | 2012-09-14 | 2013-11-26 | Geofeedr, Inc. | System and method for generating, accessing, and updating geofeeds |
WO2014071055A1 (en) * | 2012-10-31 | 2014-05-08 | Virtualbeam, Inc. | Distributed association engine |
US9311416B1 (en) * | 2012-12-31 | 2016-04-12 | Google Inc. | Selecting content using a location feature index |
US10229415B2 (en) | 2013-03-05 | 2019-03-12 | Google Llc | Computing devices and methods for identifying geographic areas that satisfy a set of multiple different criteria |
WO2015014189A1 (en) * | 2013-08-02 | 2015-02-05 | 优视科技有限公司 | Method and device for accessing website |
US11138243B2 (en) | 2014-03-06 | 2021-10-05 | International Business Machines Corporation | Indexing geographic data |
US20150278860A1 (en) * | 2014-03-25 | 2015-10-01 | Google Inc. | Dynamically determining a search radius to select online content |
EP3143526A4 (en) | 2014-05-12 | 2017-10-04 | Diffeo, Inc. | Entity-centric knowledge discovery |
US11194865B2 (en) * | 2017-04-21 | 2021-12-07 | Visa International Service Association | Hybrid approach to approximate string matching using machine learning |
CN108776667B (en) * | 2018-05-04 | 2022-10-21 | 昆明理工大学 | Space keyword query method and device based on geohash and B-Tree |
US11140128B2 (en) * | 2018-10-05 | 2021-10-05 | Palo Alto Research Center Incorporated | Hierarchical geographic naming associated to a recursively subdivided geographic grid referencing |
KR102206289B1 (en) * | 2019-06-05 | 2021-01-22 | 네이버 주식회사 | Method and system for integrating poi search coverage |
CN114791942B (en) * | 2022-06-21 | 2022-09-20 | 广东省智能机器人研究院 | Spatial text density clustering retrieval method |
CN115269500B (en) * | 2022-08-01 | 2023-05-30 | 生态环境部卫星环境应用中心 | Ecological environment data storage method, ecological environment data retrieval method and electronic equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1072987A1 (en) * | 1999-07-29 | 2001-01-31 | International Business Machines Corporation | Geographic web browser and iconic hyperlink cartography |
US20010011270A1 (en) * | 1998-10-28 | 2001-08-02 | Martin W. Himmelstein | Method and apparatus of expanding web searching capabilities |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2684214B1 (en) * | 1991-11-22 | 1997-04-04 | Sepro Robotique | INDEXING CARD FOR GEOGRAPHIC INFORMATION SYSTEM AND SYSTEM INCLUDING APPLICATION. |
DE69422406T2 (en) * | 1994-10-28 | 2000-05-04 | Hewlett Packard Co | Method for performing data chain comparison |
US5659732A (en) * | 1995-05-17 | 1997-08-19 | Infoseek Corporation | Document retrieval over networks wherein ranking and relevance scores are computed at the client for multiple database documents |
US5893093A (en) * | 1997-07-02 | 1999-04-06 | The Sabre Group, Inc. | Information search and retrieval with geographical coordinates |
US5845278A (en) * | 1997-09-12 | 1998-12-01 | Inioseek Corporation | Method for automatically selecting collections to search in full text searches |
US5991754A (en) * | 1998-12-28 | 1999-11-23 | Oracle Corporation | Rewriting a query in terms of a summary based on aggregate computability and canonical format, and when a dimension table is on the child side of an outer join |
US6493711B1 (en) * | 1999-05-05 | 2002-12-10 | H5 Technologies, Inc. | Wide-spectrum information search engine |
US6556990B1 (en) * | 2000-05-16 | 2003-04-29 | Sun Microsystems, Inc. | Method and apparatus for facilitating wildcard searches within a relational database |
US20020107918A1 (en) * | 2000-06-15 | 2002-08-08 | Shaffer James D. | System and method for capturing, matching and linking information in a global communications network |
US6741981B2 (en) * | 2001-03-02 | 2004-05-25 | The United States Of America As Represented By The Administrator Of The National Aeronautics And Space Administration (Nasa) | System, method and apparatus for conducting a phrase search |
-
2005
- 2005-05-19 EP EP05751762A patent/EP1763799A1/en not_active Withdrawn
- 2005-05-19 CA CA002566280A patent/CA2566280A1/en not_active Abandoned
- 2005-05-19 JP JP2007527466A patent/JP2007538343A/en active Pending
- 2005-05-19 WO PCT/US2005/017697 patent/WO2005114484A1/en not_active Application Discontinuation
- 2005-05-19 AU AU2005246368A patent/AU2005246368A1/en not_active Abandoned
- 2005-05-19 US US11/133,138 patent/US20050278378A1/en not_active Abandoned
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20010011270A1 (en) * | 1998-10-28 | 2001-08-02 | Martin W. Himmelstein | Method and apparatus of expanding web searching capabilities |
EP1072987A1 (en) * | 1999-07-29 | 2001-01-31 | International Business Machines Corporation | Geographic web browser and iconic hyperlink cartography |
Non-Patent Citations (2)
Title |
---|
POULIQUEN B ; STEINBERGER R ; IGNAT C ; DE GROEVE T: "Geographical information recognition and visualization in texts written in various languages", PROCEEDINGS OF THE 2004 ACM SYMPOSIUM ON APPLIED COMPUTING, 14 March 2004 (2004-03-14), pages 1051 - 1058, XP002340571, ISBN: 1-58113-812-1 * |
REES T: ""C-Squares", a New Spatial Indexing System and its Applicability to the Description of Oceanographic Datasets", OCEANOGRAPHY, vol. 16, no. 1, March 2003 (2003-03-01), pages 11 - 19, XP002340570, Retrieved from the Internet <URL:http://www.marine.csiro.au/csquares/csq-article-Mar03.pdf> [retrieved on 20050811] * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9201972B2 (en) | 2000-02-22 | 2015-12-01 | Nokia Technologies Oy | Spatial indexing of documents |
US7917464B2 (en) | 2000-02-22 | 2011-03-29 | Metacarta, Inc. | Geotext searching and displaying results |
US7908280B2 (en) | 2000-02-22 | 2011-03-15 | Nokia Corporation | Query method involving more than one corpus of documents |
US7953732B2 (en) | 2000-02-22 | 2011-05-31 | Nokia Corporation | Searching by using spatial document and spatial keyword document indexes |
US8200676B2 (en) | 2005-06-28 | 2012-06-12 | Nokia Corporation | User interface for geographic search |
US9411896B2 (en) | 2006-02-10 | 2016-08-09 | Nokia Technologies Oy | Systems and methods for spatial thumbnails and companion maps for media objects |
US9684655B2 (en) | 2006-02-10 | 2017-06-20 | Nokia Technologies Oy | Systems and methods for spatial thumbnails and companion maps for media objects |
US10810251B2 (en) | 2006-02-10 | 2020-10-20 | Nokia Technologies Oy | Systems and methods for spatial thumbnails and companion maps for media objects |
US11645325B2 (en) | 2006-02-10 | 2023-05-09 | Nokia Technologies Oy | Systems and methods for spatial thumbnails and companion maps for media objects |
JP2007334799A (en) * | 2006-06-19 | 2007-12-27 | Fujitsu Ltd | Information provision program, recording medium which records the program, information provision device and information provision method |
US9286404B2 (en) | 2006-06-28 | 2016-03-15 | Nokia Technologies Oy | Methods of systems using geographic meta-metadata in information retrieval and document displays |
WO2008019344A2 (en) * | 2006-08-04 | 2008-02-14 | Metacarta, Inc. | Systems and methods for obtaining and using information from map images |
WO2008019344A3 (en) * | 2006-08-04 | 2008-03-27 | Metacarta Inc | Systems and methods for obtaining and using information from map images |
US9721157B2 (en) | 2006-08-04 | 2017-08-01 | Nokia Technologies Oy | Systems and methods for obtaining and using information from map images |
WO2010070410A1 (en) | 2008-12-16 | 2010-06-24 | Foundation For Research And Technology-Hellas | System and method for classifying and storing related forms of data |
US9436715B2 (en) | 2012-03-30 | 2016-09-06 | Fujitsu Limited | Data management apparatus and data management method |
Also Published As
Publication number | Publication date |
---|---|
AU2005246368A1 (en) | 2005-12-01 |
CA2566280A1 (en) | 2005-12-01 |
JP2007538343A (en) | 2007-12-27 |
US20050278378A1 (en) | 2005-12-15 |
EP1763799A1 (en) | 2007-03-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20050278378A1 (en) | Systems and methods of geographical text indexing | |
US8015183B2 (en) | System and methods for providing statstically interesting geographical information based on queries to a geographic search engine | |
US9721157B2 (en) | Systems and methods for obtaining and using information from map images | |
US7801893B2 (en) | Similarity detection and clustering of images | |
Faloutsos | Searching multimedia databases by content | |
CN110399457A (en) | A kind of intelligent answer method and system | |
US20080059452A1 (en) | Systems and methods for obtaining and using information from map images | |
US20080065685A1 (en) | Systems and methods for presenting results of geographic text searches | |
KR20060047885A (en) | Method and system for schema matching of web databases | |
JP2005525659A (en) | Apparatus and method for retrieving structured content, semi-structured content, and unstructured content | |
Simpson | XPath and XPointer: Locating Content in XML Documents | |
US8700661B2 (en) | Full text search using R-trees | |
US7979452B2 (en) | System and method for retrieving task information using task-based semantic indexes | |
Gog et al. | Improved single-term top-k document retrieval | |
JP3430273B2 (en) | Database search device and database search method | |
JP3578045B2 (en) | Full-text search method and apparatus, and storage medium storing full-text search program | |
Deng et al. | LAF: a new XML encoding and indexing strategy for keyword‐based XML search | |
Chen | Building a web‐snippet clustering system based on a mixed clustering method | |
Sabri et al. | Performance Analysis for Mining Images of Deep Web | |
Ohr | NASH: Range Search over Temporal, Numerical, and Geographical Annotated Documents | |
Zezula et al. | Processing XML queries with tree signatures | |
Lee et al. | Spatial knowledge representation for iconic image database | |
Sideridis et al. | Fragkiskos Gryllakis | |
Dominick | Models for graphically-enhanced data base management system design. | |
Thom | Design of Document Database Systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2566280 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2005246368 Country of ref document: AU |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2005751762 Country of ref document: EP Ref document number: 2007527466 Country of ref document: JP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWW | Wipo information: withdrawn in national office |
Ref document number: DE |
|
ENP | Entry into the national phase |
Ref document number: 2005246368 Country of ref document: AU Date of ref document: 20050519 Kind code of ref document: A |
|
WWP | Wipo information: published in national office |
Ref document number: 2005246368 Country of ref document: AU |
|
WWP | Wipo information: published in national office |
Ref document number: 2005751762 Country of ref document: EP |