WO2014074317A1 - Extraction and clarification of ambiguities for addresses in documents - Google Patents

Extraction and clarification of ambiguities for addresses in documents Download PDF

Info

Publication number
WO2014074317A1
WO2014074317A1 PCT/US2013/066493 US2013066493W WO2014074317A1 WO 2014074317 A1 WO2014074317 A1 WO 2014074317A1 US 2013066493 W US2013066493 W US 2013066493W WO 2014074317 A1 WO2014074317 A1 WO 2014074317A1
Authority
WO
WIPO (PCT)
Prior art keywords
names
documents
context
clouds
place names
Prior art date
Application number
PCT/US2013/066493
Other languages
French (fr)
Inventor
Islam EL-ASHI
Mark AYZENSHTAT
Original Assignee
Evernote Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Evernote Corporation filed Critical Evernote Corporation
Publication of WO2014074317A1 publication Critical patent/WO2014074317A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • This application is directed to the field of extracting and analyzing information, especially in conjunction with personal and shared content management systems. BACKGROUND OF THE INVENTION
  • Geographic information is one of the most common components of everyday information streams and documents. Consumers are using geographic information in conjunction with shopping, travel, entertainment, communications, healthcare, leisure and many other activities. For businesses, geographic information defines production and supplier sites, local and worldwide offices, transportation, inventories and distributions centers, postal delivery, etc. Urban and rural planning by municipal, state and federal governments, military planning and intelligence are all utilizing vast amounts of geographic information.
  • GNIS Global System for Mobile Communications
  • GNS Geographic Information System
  • TGN Geographic Names Server
  • mapping applications are offering individuals comprehensive and intuitive ways to obtain geographical information. Accordingly, geographical data are becoming an increasingly common component of different types of personal content, such as multimedia electronic notes supported by various note-taking software, for example, the Evernote service and software, developed by the Evernote Corporation of Redwood City, California.
  • the purpose of including geographical names into a document may be discovered only by analyzing additional content of the document (which, in its turn, may provide verification of the validity of geographical name).
  • extracting and clarifying ambiguities of addresses in documents includes constructing a plurality of context clouds, where each of the context clouds corresponds to place names and homographs therefor, determining place names in each of the documents using a dictionary of geographic names, clarifying ambiguities for the place names determined in each of the documents using the context clouds and residual documents, where the residual documents correspond to each of the documents having detected place names removed therefrom, and providing a relevance score for each remaining one of the detected place names.
  • Using the context clouds and residual documents may include determining the relevance score of each of the place names and determining a relevance score of corresponding homographs for each of the place names.
  • Detecting place names may include looking up place name words and bigrams in the dictionary of geographic names. Detecting place names may further include using a Named-Entity Recognition technique, wherein recognized Named- Entity Recognition entries are limited only to geographic items that have an associated place name in the dictionary of geographic names.
  • the documents may include notes.
  • a cloud- based content management system with mobile clients may extract and clarify ambiguities of addresses in documents. At least one of the mobile clients may be a mobile device that includes software that is one of: pre-loaded with the device, installed from an app store, and downloaded from a Web site. The mobile device may use an operating system selected from the group consisting of: iOS, Android OS, Windows Phone OS, Blackberry OS and mobile versions of Linux OS.
  • computer software provided in a non-transitory computer-readable medium, extracts and clarifies ambiguities of addresses in documents.
  • the software includes executable code that constructs a plurality of context clouds, where each of the context clouds corresponds to place names and homographs therefor, executable code that determines place names in each of the documents using a dictionary of geographic names, executable code that clarifies ambiguities for the place names determined in each of the documents using the context clouds and residual documents, where the residual documents correspond to each of the documents having detected place names removed therefrom, and executable code that provides a relevance score for each remaining one of the detected place names.
  • Using the context clouds and residual documents may include determining the relevance score of each of the place names and determining a relevance score of corresponding homographs for each of the place names.
  • the software may also include executable code that constructs a list of place names contained in each of the documents with the corresponding relevance scores. A place name that is determined to be less relevant than a particular homograph therefor may be eliminated.
  • the software may also include executable code that provides generalization for remaining ones of the detected place names to provide additional augmented place names.
  • Executable code that constructs a plurality of context clouds may construct a list of geographic names from existing databases. Executable code that constructs a plurality of context clouds may consult lexical resources to add homographs of the geographical names to detect ambiguities in the geographic names.
  • additional information may be provided based on online resources that include at least one of: encyclopedia sites, news sites, travel sites, and business sites.
  • Web crawlers may be used for at least one of: constructing the list of geographic names, adding homographs, and building context clouds.
  • the relevance score may be based on a tf-idf score and on uniqueness of terms in context clouds.
  • the context clouds may be periodically refreshed to obtain upgrades.
  • Executable code that detects place names may look up place name words and bigrams in the dictionary of geographic names.
  • Executable code that detects place names may use a Named-Entity Recognition technique, where recognized Named-Entity Recognition entries are limited only to geographic items that have an associated place name in the dictionary of geographic names.
  • the documents may include notes.
  • the cloud-based content management system may also include a mobile client device that includes software that is one of: pre-loaded with the device, installed from an app store, and downloaded from a Web site.
  • the mobile device may use an operating system selected from the group consisting of: iOS, Android OS, Windows Phone OS, Blackberry OS and mobile versions of Linux OS.
  • the proposed system uses a custom designed dictionary of geographic names, where entries are enhanced with additional information, for extraction, clarification of ambiguities, and possible generalization of geographic terms and address information contained in personal notes and other documents and for generating an output place list with document specific scores.
  • a system output is intended for use by other applications, including, but not limited to: typeahead search, such as described in U.S. Patent Application No.
  • a custom designed dictionary of geographic names may include two components: • A hierarchical dictionary of place names with accompanying location and other information; and
  • a context cloud accompanying each entry and containing key facts, points of interest, remarkable people, news and other data associated with a root entry, and possibly similar context clouds describing each homograph of the root entry.
  • the dictionary of geographic names is constructed in three phases. First, a set of available databases of geographic names may be merged into an initial version of the dictionary, which may represent, due to such combination, both address information (such as in postal databases or in the GNIS) and place information (such as in the TGN).
  • each entry in the initial dictionary may belong to one or more hierarchies built from an administrative, feature or other categorization schemes; for example, the City of San Jose may belong to an administrative hierarchy as follows:
  • a custom web crawler may visit online encyclopedia, dictionary and other special resources to compile lists of homographs associated with dictionary entries and add basic information about each homograph. For example, parsing a disambiguation section of Wikipedia for a key entry "Moscow" yields 23 place names in the US, Canada, India and the UK, and five non-toponymic homographs. Additional search in the homograph section of the Merriam-Webster dictionary adds a place name Moscow River absent from the Wikipedia list.
  • another custom crawler may visit online resources, such as encyclopedia, news, travel, business and other sites related to dictionary entries (in an embodiment, the crawler also may visit similar sites for each homograph of a dictionary entry) and retrieves relevant information, which is subsequently compiled into context clouds, attached to dictionary terms and possibly to homographs; each of the context clouds may subsequently be used to clarify certain ambiguities.
  • Context clouds may include substantive context defining terms based on various relevance metrics, such as tf-idf weighting; these metrics may also be used as a relevance score for retained context defining terms.
  • Characteristic terms may subsequently receive Bayesian weighting based on uniqueness of the characteristic terms; for example, a term that is common for context clouds of both a place name and a homograph may receive a lower score compared with another term that has occurrences only in a context cloud of a place name and is not associated with context clouds of homographs corresponding to the place name.
  • an original tf-idf score of a term in a context cloud (a metrics of the relevance of the term within a given corpus of texts forming the context cloud) may be modified by multiplying the score over a non-uniqueness coefficient.
  • ⁇ ( ⁇ ) is a non-uniqueness coefficient of a term t;
  • H is a set of homographs of a given place name (not including the place name itself);
  • C( i) is a context cloud of a homograph h;
  • tf * idf is a tf-idf score of a respective term in the corresponding context cloud.
  • Context clouds may be periodically refreshed by the crawler through re-visiting relevant sites to obtain upgrades or through searching for new sites associated with dictionary terms.
  • NER Named-Entity Recognition
  • ambiguities in extracted place names may be clarified by using a remaining portion of content of a note or a document to verify a relevance of each extracted place name based on context clouds associated with core dictionary entries.
  • synonyms of dictionary entries may also be used to provide context clouds for clarifying ambiguities. This may result in an acceptance or rejection of extracted
  • geographical names may also define relative scores of retained names.
  • strong connotation of an extracted term with a non-toponymic homonym of a term may cause an immediate rejection like in the above example of a word combination Manhattan project.
  • one or several substantive terms of a context cloud found in a note and associated with a geographical name extracted from the note may serve as evidence for accepting a name.
  • neither of the words Australia or travel may represent a satisfactory basis for clarifying ambiguous data to determine whether a word Darwin in a note is being used as a place name; the data may remain ambiguous since Charles Darwin has extensively traveled during his lifetime, including Australia.
  • Port Darwin is known as a world capital of lightning, which term would likely be substantive in a correctly compiled context cloud for the geographic name [Port] Darwin and has no particular reason to score high and be substantive in a context cloud for the homograph [Charles] Darwin.
  • the system may be utilizing not only context clouds of immediately extracted place names but also context clouds of adjacent place names in a geographic hierarchy and their geographic neighbors.
  • a word aquarium in a note with a place name Santa Cruz may add evidence in favor of selecting the City of Santa Cruz, California vs. hundreds of toponymic and non-toponymic homographs of the term; such evidence may be obtained by scanning a geographic hierarchy and extracting a neighbor city of Monterey with its landmark Monterey Bay Aquarium.
  • an additional analysis of the retained place names may be performed for possible generalization. For example, in an event that multiple neighboring places are named in a note or a document, one or more of the parents of the events in the dictionary hierarchy may be added as a suggestion to the list of extracted names (for example, if a note contains three place names Cupertino, Sunnyvale and Redwood City then a list of suggestions may include Santa Clara County and/or Silicon Valley Cities as a generalized place name).
  • the system generates a list of place names contained in a note, with corresponding scores indicating certainty and possibly with the scores of closest contenders of the place names in associated lists of homographs.
  • FIG. 1 is a schematic illustration of creation of a custom dictionary of geographic names, according to an embodiment of to the system described herein.
  • FIG. 2 is a schematic illustration of retrieval of place names from a note, clarifying ambiguities of the place names, and generalization, according to an embodiment of the system described herein.
  • FIG. 3 is a principal system flow diagram describing a process of extracting and clarifying ambiguities of place names from notes, according to an embodiment of the system described herein.
  • FIG. 4 is a system flow diagram describing the process of building a dictionary of geographic names, according to an embodiment of the system described herein.
  • FIG. 5 is a system flow diagram describing clarifying ambiguities of place names using context clouds, according to an embodiment of the system described herein.
  • FIG. 6 is a system flow diagram describing in more detail clarifying ambiguities of place names using context clouds, according to an embodiment of the system described herein.
  • DETAILED DESCRIPTION OF VARIOUS EMBODIMENTS The system described herein provides a mechanism for extracting place names, such as geographic locations and postal addresses, from notes and documents, and for clarifying ambiguities and generalizing extracted place names.
  • the system uses a specially designed dictionary of geographic names with context clouds accompanying entries in the system to identify and verify geographical terms, separate the terms from homographs and generalize extracted names when possible.
  • the resulting information may be used by other applications such as typeahead search, action notes, note atlas, etc., as explained elsewhere herein.
  • FIG. 1 is a schematic illustration 100 showing creation of a custom dictionary of geographic names.
  • Three phases of creation of the dictionary of geographic names are enumerated I, II, and III, as shown by items 110, 120, 130.
  • various pre-existing sources such as databases of geographic names 140a, 140b, 140c, may be merged into an initial, preliminary version of the dictionary 150.
  • the system may use other licensable sources, such as the official USPS database of US postal addresses and similar databases in other countries.
  • a web crawler 160 may visit dedicated lexical resources, for example, sections of Wikipedia, a homograph section of the Merriam- Webster dictionary, etc.
  • phase II The purpose of phase II is to add homographs as potential sources of ambiguity to the initial version of the dictionary of geographic names, as exemplified by a list of homograph entries 170.
  • another set of web crawlers 180 may visit diverse online resources related to geographic info and place names, such as encyclopedia, news, travel, business, postal and other sites, to compile context clouds 190 associated with each root entry.
  • the web crawlers 180 may also compile context clouds for some or all homographs of dictionary entries.
  • FIG. 2 is a schematic illustration 200 illustrating retrieval of place names from a note, clarification of ambiguities relating to the place names, and generalization. Similarly to FIG. 1, subsequent phases of data retrieval and processing are enumerated I, II, III, IV as shown by items 210, 215, 220, 225.
  • a note or a document 230 is analyzed for the presence of place names. Such analysis may be performed by looking up in a dictionary of geographic names 235 for unigrams, bigrams and possibly for results of Named Entity Recognition in a textual part of the note 230, including, in some embodiments, note attachments, results of handwriting, image, voice and video recognition, etc. Identified place names may be compiled into a preliminary list 240.
  • ambiguities relating to place names may be clarified using context clouds as follows: a residual note 245 remaining after putting aside detected place names retrieved at phase I (the residual note) is compared with a plurality of sets of context clouds 250 to assess comparative relevance of the residual note to context clouds associated with the actual place names vs. homographs of the actual place names.
  • Each of the three sets of context clouds 250 shown on the illustration 200 corresponds to a place name from the preliminary list 240 shown as placel, place2, ..., placeN and may represent several context clouds from the dictionary of geographic name: one for the actual extracted place and the remaining for homographs of the extracted place in the dictionary.
  • Sets of context clouds may be symbolically depicted by double cloud shapes where a smaller cloud shape with a solid boundary illustrates the context cloud for the place name (such as Port Darwin or Moscow, Russia) while a larger outer cloud shape with a dashed boundary illustrates a subset of context clouds for homographs of place names (such as Charles Darwin, Moscow Cantata by Tchaikovsky or MoSCoW Method).
  • a smaller cloud shape with a solid boundary illustrates the context cloud for the place name (such as Port Darwin or Moscow, Russia) while a larger outer cloud shape with a dashed boundary illustrates a subset of context clouds for homographs of place names (such as Charles Darwin, Moscow Cantata by Tchaikovsky or MoSCoW Method).
  • relevance scores may be assigned to pairs (residual note, place name) and (residual note, homograph) for all homographs; the scores may measure degrees of relevance of content of the residual note to context clouds corresponding to the place name, on the one hand, and to each homograph, on
  • a place name with a low score may be immediately dropped from the list, as illustrated by a place name 255 and a trash bin 260.
  • a score a place name with a homograph that is more relevant to the residual note content than the place name may be decreased and a homograph may be added to the list next to the place name to allow users or other applications additional data that may be used for analysis.
  • place names that survive phase II may undergo a generalization process.
  • the place names may be supplemented with one or more area names retrieved from the dictionary of geographic names. This may be provided by a lookup in the dictionary of geographic names 265which retrieves an Area name 270 to augment several extracted place names 275 ⁇ place!, ... placeK .
  • the remainder of the place names on the list through the last item placeN may not be generalized in this schematic example. For instance, place names Cupertino, Sunnyvale, Redwood City present in one note may lead to
  • a list of place names 280 with potentially added general terms and with confidence scores of all items on the list may be compiled and presented as a system output that may be utilized by other applications, by services and/or directly by users.
  • the list of place names 280 may be ordered by decreasing scores, with due respect to potential grouping of items introduced in the generalization phase IV.
  • a flow diagram 300 illustrates extracting and clarifying place names from notes. Processing begins at a step 310 where the dictionary of geographic names with context clouds is built. This step is explained in more detail elsewhere herein.
  • processing proceeds to a step 320 where the system scans through a note to detect unigrams, bigrams and, optionally, using results of Named Entity Recognition, looks up the dictionary of geographic names to identify place names among scanned terms.
  • processing proceeds to a step 330 where ambiguities in preliminary place names are clarified using context clouds, as explained elsewhere herein. See, for example, phase II of the illustration 200, described above. The step 330 may result in dropping some of the preliminary place names.
  • processing proceeds to a step 340 where generalized place names may be added, as explained in more detail elsewhere herein. See, for example, phase III in the illustration 200.
  • a flow diagram 400 illustrates in more detail building a dictionary of geographic names as set forth in the step 310 of the flow diagram 300, described above. Processing begins at a step 410 where existing dictionary sources are merged into a preliminary version of the dictionary of geographic names.
  • processing proceeds to a step 420 where a special web crawler visits dedicated online sites (such as encyclopedia and online dictionaries) and compiles lists of homographs for some or all entries of the preliminary version of the dictionary, as explained elsewhere herein. See, for example, phase II of the illustration 100.
  • dedicated online sites such as encyclopedia and online dictionaries
  • the crawler or a standalone utility may analyze existing offline resources to identify homographs for dictionary entries.
  • processing proceeds to a step 430 where another (or possibly the same) web crawler may visit a variety of relevant web sites (or, in some embodiments, a standalone utility may parse relevant offline resources) to compile context clouds for basic entries of the dictionary of geographic names and, in some embodiments, to also build context clouds for some or each of the homographs identified at the step 420.
  • another web crawler may visit a variety of relevant web sites (or, in some embodiments, a standalone utility may parse relevant offline resources) to compile context clouds for basic entries of the dictionary of geographic names and, in some embodiments, to also build context clouds for some or each of the homographs identified at the step 420.
  • processing proceeds to a test step 440 where it is determined whether the context clouds (built previously) need a refresh. If so, then processing proceeds back to the step 430; otherwise, following the step 440, processing is complete.
  • a flow diagram 500 illustrates clarifying ambiguity for place names using context clouds as set forth in the step 330 of the flow diagram 300, described above.
  • Processing begins at a step 510 where the system chooses the first extracted place name from a preliminary list built from a note, as explained elsewhere herein.
  • processing proceeds to a step 520 where a relevance score for the place name and possibly other (related) places names and relevance scores are determined.
  • processing at the step 520 is described in more detail elsewhere herein.
  • processing proceeds to a step 530, where a chosen place name and a corresponding relevance score is added to the output list; this step reflects the situation when the relevance of the chosen place name beats all its homographs in the dictionary.
  • processing proceeds to a test step 540 where it is determined whether the chosen place name is the last place name on the preliminary list extracted from the note. If not, processing proceeds to a step 550 where a next place name is chosen from the preliminary list. After the step 550, processing proceeds back to the step 520, discussed above, and an assessment process for a new place name starts. If it is determined at the test step 540 that the chosen place name is the last place name on the preliminary list extracted from the note, control transfers from the test step 540 to a step 560 where an output list of place names is sent to a generalization module explained elsewhere herein in connection with phase III of the illustration 200. After the step 560, processing is complete. Referring to FIG.
  • a flow diagram 600 illustrates in more detail the step 520 of the flow diagram 500, described above.
  • Processing begins at a step 615 where a context cloud corresponding to the chosen place name is retrieved from a dictionary of geographic names.
  • processing proceeds to a step 620 where a residual note is compared with the context cloud and relevance score is calculated.
  • processing proceeds to a step 625 where the system looks up the dictionary for homographs of the chosen place name.
  • processing proceeds to a test step 630 where it is determined if homographs exist. If so, then processing proceeds to a step 635 where a first homograph is chosen; otherwise, processing proceeds to a step 665.
  • processing proceeds to a step 640 where a context cloud corresponding to the chosen homograph is retrieved from the dictionary.
  • processing proceeds to a step 645 where a residual note is compared with the context cloud for the chosen homograph and a relevance score is calculated.
  • processing proceeds to a test step 650 where it is determined if the chosen place name is more relevant than the currently evaluated homograph (in other words, if the relevance score for the place name is higher than for the homograph). If it is determined at the test step 650 that the chosen place name is not more relevant than the currently evaluated homograph, then processing is complete. Otherwise, processing proceeds to a test step 655 where it is determined if the current homograph is the last entry on the list of homographs for the chosen place name.
  • processing proceeds to a step 660 where the next homograph is chosen. After the step 660, processing proceeds back to the step 640, discussed above, which can be independently reached from the step 635. If it is determined at the test step 655 that the current homograph is the last entry on the list of homographs for the chosen place name, processing proceeds to the step 665 where the chosen place name and a corresponding relevance score of the place name is added to an output list. Note that the step 665 can also be reached from the test step 630 if no homographs are found. After the step 665, processing is complete.
  • the system may also be implemented on and/or in cooperation with a cloud based content management system with mobile clients, such as the system provided by the Evernote service and software, developed by the Evernote Corporation of Redwood City, California.
  • mobile clients such as the system provided by the Evernote service and software, developed by the Evernote Corporation of Redwood City, California.
  • One or more of the mobile clients may be a mobile device, such as a conventional hand-held smartphone running one of several major mobile platforms, including iOS, Android,
  • the mobile device may run additional user software (e.g., an application) that provides the functionality described herein.
  • the user software may be bundled (pre-loaded), installed from an app store or downloaded from a Web site.
  • Software implementations of the system described herein may include executable code that is stored in a computer readable medium and executed by one or more processors.
  • the computer readable medium may be non-transitory and include a computer hard drive, ROM, RAM, flash memory, portable computer storage media such as a CD-ROM, a DVD- ROM, a flash drive, an SD card and/or other drive with, for example, a universal serial bus (USB) interface, and/or any other appropriate tangible or non-transitory computer readable medium or computer memory on which executable code may be stored and executed by a processor.
  • USB universal serial bus

Abstract

Extracting and clarifying ambiguities of addresses in documents includes constructing a plurality of context clouds, where each of the context clouds corresponds to place names and homographs therefor, determining place names in each of the documents using a dictionary of geographic names, clarifying ambiguities for the place names determined in each of the documents using the context clouds and residual documents, where the residual documents correspond to each of the documents having detected place names removed therefrom, and providing a relevance score for each remaining one of the detected place names. Using the context clouds and residual documents may include determining the relevance score of each of the place names and determining a relevance score of corresponding homographs for each of the place names.

Description

EXTRACTION AND CLARIFICATION OF AMBIGUITIES FOR
ADDRESSES IN DOCUMENTS
CROSS-REFERENCE TO RELATED APPLICATIONS
This application claims priority to U.S. Prov. App. No. 61/723,880, filed November 8, 2012, and entitled "EXTRACTING AND DISAMBIGUATING ADDRESSES FROM DOCUMENTS", which is incorporated herein by reference.
TECHNICAL FIELD
This application is directed to the field of extracting and analyzing information, especially in conjunction with personal and shared content management systems. BACKGROUND OF THE INVENTION
Geographic information is one of the most common components of everyday information streams and documents. Consumers are using geographic information in conjunction with shopping, travel, entertainment, communications, healthcare, leisure and many other activities. For businesses, geographic information defines production and supplier sites, local and worldwide offices, transportation, inventories and distributions centers, postal delivery, etc. Urban and rural planning by municipal, state and federal governments, military planning and intelligence are all utilizing vast amounts of geographic information.
Many information systems are designated for maintaining and augmenting lists of geographical names. Online map services, such as Google Maps, Yahoo! Maps, Bing Maps by Microsoft, MapQuest, OpenStreetMap, Nokia Maps, deCarta Maps, and many other similar solutions, are providing addresses to individuals and organizations, as well as to web and other application developers via open APIs. Additionally, national authorities in many countries are providing comprehensive lists of geographic names via dedicated systems. This includes postal databases, National Boards of Geographic Names, and other organizations. For example, the United States Geological Survey (USGS) in cooperation with the United States Board on Geographic Names (BGN) has developed the Geographic Names
Information System, GNIS, with over two million entries; the GEOnet Names Server (GNS) provides access to the National Geospatial-Intelligence Agency's and the BGN's database of geographic feature names for locations outside the United States. The US Postal Office database is open for use by businesses and public and includes every address that is marked by a nine-digit US Zip code. The Getty Thesaurus of Geographic Names (TGN) includes names and associated information about administrative political entities (cities, nations, regions and other entities), as well as natural features (mountains, rivers, etc.), and historical places.
Combined with business directories and POI (Point of Interest) databases, such as the POI Factory, enhanced by location aware devices, including GPS equipped mobile devices (phones), GeoIP services on desktops, and navigational devices in automobiles, mapping applications are offering individuals comprehensive and intuitive ways to obtain geographical information. Accordingly, geographical data are becoming an increasingly common component of different types of personal content, such as multimedia electronic notes supported by various note-taking software, for example, the Evernote service and software, developed by the Evernote Corporation of Redwood City, California.
In contrast with directly obtaining geographical information for personal and business needs, which is, for the most part, unambiguous and straightforward, the reverse task of extracting addresses and other geographical information from the personal content for advanced search and other purposes proves significantly more complex, ambiguous and challenging. Many geographical names are duplicating personal, historical and other non- toponymic terms, which introduces a significant uncertainty into the process of geographical labeling of a document. For example, a term The Manhattan Project has a conventional non- toponymic meaning, while the Manhattan Skyline is a toponymic term; analogously, a geographic name [Port] Darwin, the capital city of the Northern Territory, Australia is both associated with and contrasts with the name Charles Darwin, the naturalist and writer after whom the city was named. In some cases, a document may mention several geographical objects in the same area, hinting at the possibility of a meaningful generalization.
Additionally, in some cases, the purpose of including geographical names into a document may be discovered only by analyzing additional content of the document (which, in its turn, may provide verification of the validity of geographical name).
Accordingly, it is desirable to provide an efficient and relatively accurate mechanism for extraction of geographical names and addresses from personal content. SUMMARY OF THE INVENTION
According to the system described herein, extracting and clarifying ambiguities of addresses in documents includes constructing a plurality of context clouds, where each of the context clouds corresponds to place names and homographs therefor, determining place names in each of the documents using a dictionary of geographic names, clarifying ambiguities for the place names determined in each of the documents using the context clouds and residual documents, where the residual documents correspond to each of the documents having detected place names removed therefrom, and providing a relevance score for each remaining one of the detected place names. Using the context clouds and residual documents may include determining the relevance score of each of the place names and determining a relevance score of corresponding homographs for each of the place names. Extracting and clarifying ambiguities of addresses in documents may include constructing a list of place names contained in each of the documents with the corresponding relevance scores. A place name that is determined to be less relevant than a particular homograph therefor may be eliminated. Extracting and clarifying ambiguities of addresses in documents may include providing generalization for remaining ones of the detected place names to provide additional augmented place names. Constructing a plurality of context clouds may include constructing a list of geographic names from existing databases. Constructing a plurality of context clouds may also include consulting lexical resources to add homographs of the geographical names to detect ambiguities in the geographic names. For each of the homographs, additional information may be provided based on online resources that include at least one of: encyclopedia sites, news sites, travel sites, and business sites. Web crawlers may be used for at least one of: constructing the list of geographic names, adding
homographs, and building context clouds. The relevance score may be based on a tf-idf score and on uniqueness of terms in context clouds. The context clouds may be periodically refreshed to obtain upgrades. Detecting place names may include looking up place name words and bigrams in the dictionary of geographic names. Detecting place names may further include using a Named-Entity Recognition technique, wherein recognized Named- Entity Recognition entries are limited only to geographic items that have an associated place name in the dictionary of geographic names. The documents may include notes. A cloud- based content management system with mobile clients may extract and clarify ambiguities of addresses in documents. At least one of the mobile clients may be a mobile device that includes software that is one of: pre-loaded with the device, installed from an app store, and downloaded from a Web site. The mobile device may use an operating system selected from the group consisting of: iOS, Android OS, Windows Phone OS, Blackberry OS and mobile versions of Linux OS.
According further to the system described herein, computer software, provided in a non-transitory computer-readable medium, extracts and clarifies ambiguities of addresses in documents. The software includes executable code that constructs a plurality of context clouds, where each of the context clouds corresponds to place names and homographs therefor, executable code that determines place names in each of the documents using a dictionary of geographic names, executable code that clarifies ambiguities for the place names determined in each of the documents using the context clouds and residual documents, where the residual documents correspond to each of the documents having detected place names removed therefrom, and executable code that provides a relevance score for each remaining one of the detected place names. Using the context clouds and residual documents may include determining the relevance score of each of the place names and determining a relevance score of corresponding homographs for each of the place names. The software may also include executable code that constructs a list of place names contained in each of the documents with the corresponding relevance scores. A place name that is determined to be less relevant than a particular homograph therefor may be eliminated. The software may also include executable code that provides generalization for remaining ones of the detected place names to provide additional augmented place names. Executable code that constructs a plurality of context clouds may construct a list of geographic names from existing databases. Executable code that constructs a plurality of context clouds may consult lexical resources to add homographs of the geographical names to detect ambiguities in the geographic names. For each of the homographs, additional information may be provided based on online resources that include at least one of: encyclopedia sites, news sites, travel sites, and business sites. Web crawlers may be used for at least one of: constructing the list of geographic names, adding homographs, and building context clouds. The relevance score may be based on a tf-idf score and on uniqueness of terms in context clouds. The context clouds may be periodically refreshed to obtain upgrades. Executable code that detects place names may look up place name words and bigrams in the dictionary of geographic names. Executable code that detects place names may use a Named-Entity Recognition technique, where recognized Named-Entity Recognition entries are limited only to geographic items that have an associated place name in the dictionary of geographic names. The documents may include notes. According further to the system described herein, a cloud-based content management system that extracts and clarifies ambiguities of addresses in documents includes a plurality of context clouds, where each of the context clouds corresponds to place names and homographs therefor, a dictionary of geographic names, and at least one processor having executable code that clarifies ambiguities for the place names determined in each of the documents using the context clouds and residual documents, where the residual documents correspond to each of the documents having detected place names removed therefrom and having executable code that provides a relevance score for each remaining one of the detected place names with mobile clients extracts and clarifies ambiguities of addresses in documents. The cloud-based content management system may also include a mobile client device that includes software that is one of: pre-loaded with the device, installed from an app store, and downloaded from a Web site. The mobile device may use an operating system selected from the group consisting of: iOS, Android OS, Windows Phone OS, Blackberry OS and mobile versions of Linux OS. The proposed system uses a custom designed dictionary of geographic names, where entries are enhanced with additional information, for extraction, clarification of ambiguities, and possible generalization of geographic terms and address information contained in personal notes and other documents and for generating an output place list with document specific scores. A system output is intended for use by other applications, including, but not limited to: typeahead search, such as described in U.S. Patent Application No. 13/924,905 titled: " GENERATING AND RANKING INCREMENTAL SEARCH SUGGESTIONS FOR PERSONAL CONTENT", filed on June 24, 2013 by Ayzenshtat, et al. and incorporated by reference herein; design and execution of action notes, such as described in Published U.S. Patent Application No. US20130212463 titled: "SMART DOCUMENT PROCESSING WITH ASSOCIATED ONLINE DATA AND ACTION STREAMS", filed on October 31 , 2012 by Pachikov, et al. and incorporated by reference herein; enhancing a note atlas system, such as described in U.S. Patent Application No. 13/905,422 titled: "NOTE ATLAS", filed on May 30, 2013 by Constantinou, et al. and incorporated by reference herein; and possibly for other tasks. A custom designed dictionary of geographic names may include two components: • A hierarchical dictionary of place names with accompanying location and other information; and
• A context cloud accompanying each entry and containing key facts, points of interest, remarkable people, news and other data associated with a root entry, and possibly similar context clouds describing each homograph of the root entry.
The dictionary of geographic names is constructed in three phases. First, a set of available databases of geographic names may be merged into an initial version of the dictionary, which may represent, due to such combination, both address information (such as in postal databases or in the GNIS) and place information (such as in the TGN).
Additionally, each entry in the initial dictionary may belong to one or more hierarchies built from an administrative, feature or other categorization schemes; for example, the City of San Jose may belong to an administrative hierarchy as follows:
United States>California>Santa Clara County>City of San Jose and/or to an informal regional hierarchy such as:
United States>California>Northern California>San Francisco Bay Area>Silicon Valley>San Jose.
At a second phase, a custom web crawler may visit online encyclopedia, dictionary and other special resources to compile lists of homographs associated with dictionary entries and add basic information about each homograph. For example, parsing a disambiguation section of Wikipedia for a key entry "Moscow" yields 23 place names in the US, Canada, India and the UK, and five non-toponymic homographs. Additional search in the homograph section of the Merriam-Webster dictionary adds a place name Moscow River absent from the Wikipedia list.
At the third phase, another custom crawler may visit online resources, such as encyclopedia, news, travel, business and other sites related to dictionary entries (in an embodiment, the crawler also may visit similar sites for each homograph of a dictionary entry) and retrieves relevant information, which is subsequently compiled into context clouds, attached to dictionary terms and possibly to homographs; each of the context clouds may subsequently be used to clarify certain ambiguities. Context clouds may include substantive context defining terms based on various relevance metrics, such as tf-idf weighting; these metrics may also be used as a relevance score for retained context defining terms. Characteristic terms may subsequently receive Bayesian weighting based on uniqueness of the characteristic terms; for example, a term that is common for context clouds of both a place name and a homograph may receive a lower score compared with another term that has occurrences only in a context cloud of a place name and is not associated with context clouds of homographs corresponding to the place name. In an embodiment, an original tf-idf score of a term in a context cloud (a metrics of the relevance of the term within a given corpus of texts forming the context cloud) may be modified by multiplying the score over a non-uniqueness coefficient. The non-uniqueness coefficient may be calculated as one minus average relative tf-idf score of the same term in context clouds corresponding to homographs (rather than the original place name), where a relative tf-idf score of a term is the tf-idf score of the term divided by the maximum of such scores for all terms in a context cloud: m = i - ( tf * idf t) /maxsec(h tf * id/(s) ) /\{h E H \t E C(/ )} | , where ϋ(ΐ) is a non-uniqueness coefficient of a term t;
H is a set of homographs of a given place name (not including the place name itself); C( i) is a context cloud of a homograph h;
tf * idf is a tf-idf score of a respective term in the corresponding context cloud.
Context clouds may be periodically refreshed by the crawler through re-visiting relevant sites to obtain upgrades or through searching for new sites associated with dictionary terms.
Extraction of geographic information from notes or documents may also be done in several steps. First, place names may be found in a note by looking up place name words and bigrams (ordered pairs of adjacent words) in the dictionary of geographic names, as explained elsewhere herein. Additionally or alternatively, a Named-Entity Recognition (NER) technique may be used where recognized NER entries may be limited only to those geographic items that have associated place names in the dictionary.
At a second step, ambiguities in extracted place names may be clarified by using a remaining portion of content of a note or a document to verify a relevance of each extracted place name based on context clouds associated with core dictionary entries. In an
embodiment, synonyms of dictionary entries may also be used to provide context clouds for clarifying ambiguities. This may result in an acceptance or rejection of extracted
geographical names and may also define relative scores of retained names. Thus, strong connotation of an extracted term with a non-toponymic homonym of a term may cause an immediate rejection like in the above example of a word combination Manhattan project. In other cases, one or several substantive terms of a context cloud found in a note and associated with a geographical name extracted from the note may serve as evidence for accepting a name. For example, neither of the words Australia or travel may represent a satisfactory basis for clarifying ambiguous data to determine whether a word Darwin in a note is being used as a place name; the data may remain ambiguous since Charles Darwin has extensively traveled during his lifetime, including Australia. However, an occurrence of a word lightning in the same note may provide a sufficient piece of evidence in favor of the geographic interpretation, since Port Darwin is known as a world capital of lightning, which term would likely be substantive in a correctly compiled context cloud for the geographic name [Port] Darwin and has no particular reason to score high and be substantive in a context cloud for the homograph [Charles] Darwin.
In the process of clarifying ambiguities, the system may be utilizing not only context clouds of immediately extracted place names but also context clouds of adjacent place names in a geographic hierarchy and their geographic neighbors. For example, a word aquarium in a note with a place name Santa Cruz may add evidence in favor of selecting the City of Santa Cruz, California vs. hundreds of toponymic and non-toponymic homographs of the term; such evidence may be obtained by scanning a geographic hierarchy and extracting a neighbor city of Monterey with its landmark Monterey Bay Aquarium.
At a third step, an additional analysis of the retained place names may be performed for possible generalization. For example, in an event that multiple neighboring places are named in a note or a document, one or more of the parents of the events in the dictionary hierarchy may be added as a suggestion to the list of extracted names (for example, if a note contains three place names Cupertino, Sunnyvale and Redwood City then a list of suggestions may include Santa Clara County and/or Silicon Valley Cities as a generalized place name).
Following the above three steps of the process, the system generates a list of place names contained in a note, with corresponding scores indicating certainty and possibly with the scores of closest contenders of the place names in associated lists of homographs.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the system described herein will now be explained in more detail in accordance with the figures of the drawings, which are briefly described as follows. FIG. 1 is a schematic illustration of creation of a custom dictionary of geographic names, according to an embodiment of to the system described herein.
FIG. 2 is a schematic illustration of retrieval of place names from a note, clarifying ambiguities of the place names, and generalization, according to an embodiment of the system described herein. FIG. 3 is a principal system flow diagram describing a process of extracting and clarifying ambiguities of place names from notes, according to an embodiment of the system described herein.
FIG. 4 is a system flow diagram describing the process of building a dictionary of geographic names, according to an embodiment of the system described herein. FIG. 5 is a system flow diagram describing clarifying ambiguities of place names using context clouds, according to an embodiment of the system described herein.
FIG. 6 is a system flow diagram describing in more detail clarifying ambiguities of place names using context clouds, according to an embodiment of the system described herein. DETAILED DESCRIPTION OF VARIOUS EMBODIMENTS The system described herein provides a mechanism for extracting place names, such as geographic locations and postal addresses, from notes and documents, and for clarifying ambiguities and generalizing extracted place names. The system uses a specially designed dictionary of geographic names with context clouds accompanying entries in the system to identify and verify geographical terms, separate the terms from homographs and generalize extracted names when possible. The resulting information may be used by other applications such as typeahead search, action notes, note atlas, etc., as explained elsewhere herein.
FIG. 1 is a schematic illustration 100 showing creation of a custom dictionary of geographic names. Three phases of creation of the dictionary of geographic names are enumerated I, II, and III, as shown by items 110, 120, 130. In phase I, various pre-existing sources, such as databases of geographic names 140a, 140b, 140c, may be merged into an initial, preliminary version of the dictionary 150. In addition to publicly available databases, the system may use other licensable sources, such as the official USPS database of US postal addresses and similar databases in other countries. In phase II, a web crawler 160 may visit dedicated lexical resources, for example, sections of Wikipedia, a homograph section of the Merriam- Webster dictionary, etc. The purpose of phase II is to add homographs as potential sources of ambiguity to the initial version of the dictionary of geographic names, as exemplified by a list of homograph entries 170. In phase III, another set of web crawlers 180 may visit diverse online resources related to geographic info and place names, such as encyclopedia, news, travel, business, postal and other sites, to compile context clouds 190 associated with each root entry. In some embodiments, the web crawlers 180 may also compile context clouds for some or all homographs of dictionary entries.
FIG. 2 is a schematic illustration 200 illustrating retrieval of place names from a note, clarification of ambiguities relating to the place names, and generalization. Similarly to FIG. 1, subsequent phases of data retrieval and processing are enumerated I, II, III, IV as shown by items 210, 215, 220, 225.
In phase I, a note or a document 230 is analyzed for the presence of place names. Such analysis may be performed by looking up in a dictionary of geographic names 235 for unigrams, bigrams and possibly for results of Named Entity Recognition in a textual part of the note 230, including, in some embodiments, note attachments, results of handwriting, image, voice and video recognition, etc. Identified place names may be compiled into a preliminary list 240.
In phase 2, ambiguities relating to place names may be clarified using context clouds as follows: a residual note 245 remaining after putting aside detected place names retrieved at phase I (the residual note) is compared with a plurality of sets of context clouds 250 to assess comparative relevance of the residual note to context clouds associated with the actual place names vs. homographs of the actual place names. Each of the three sets of context clouds 250 shown on the illustration 200 corresponds to a place name from the preliminary list 240 shown as placel, place2, ..., placeN and may represent several context clouds from the dictionary of geographic name: one for the actual extracted place and the remaining for homographs of the extracted place in the dictionary. Sets of context clouds may be symbolically depicted by double cloud shapes where a smaller cloud shape with a solid boundary illustrates the context cloud for the place name (such as Port Darwin or Moscow, Russia) while a larger outer cloud shape with a dashed boundary illustrates a subset of context clouds for homographs of place names (such as Charles Darwin, Moscow Cantata by Tchaikovsky or MoSCoW Method). For each found place name, relevance scores may be assigned to pairs (residual note, place name) and (residual note, homograph) for all homographs; the scores may measure degrees of relevance of content of the residual note to context clouds corresponding to the place name, on the one hand, and to each homograph, on the other hand. If a homograph is more relevant to the residual note content than a place name then, in some embodiments, a place name with a low score may be immediately dropped from the list, as illustrated by a place name 255 and a trash bin 260. Alternatively, a score a place name with a homograph that is more relevant to the residual note content than the place name may be decreased and a homograph may be added to the list next to the place name to allow users or other applications additional data that may be used for analysis.
In phase III, place names that survive phase II may undergo a generalization process. Thus, if several names on the list of place names extracted from a note belong to a same geographic area, the place names may be supplemented with one or more area names retrieved from the dictionary of geographic names. This may be provided by a lookup in the dictionary of geographic names 265which retrieves an Area name 270 to augment several extracted place names 275 {place!, ... placeK . The remainder of the place names on the list through the last item placeN may not be generalized in this schematic example. For instance, place names Cupertino, Sunnyvale, Redwood City present in one note may lead to
augmenting the list with the terms Santa Clara County, Silicon Valley and/or San Francisco Bay Area, each with a corresponding score, which may depend on other content found in the note. In phase IV, a list of place names 280 with potentially added general terms and with confidence scores of all items on the list may be compiled and presented as a system output that may be utilized by other applications, by services and/or directly by users. In some embodiments, the list of place names 280 may be ordered by decreasing scores, with due respect to potential grouping of items introduced in the generalization phase IV. Referring to FIG. 3, a flow diagram 300 illustrates extracting and clarifying place names from notes. Processing begins at a step 310 where the dictionary of geographic names with context clouds is built. This step is explained in more detail elsewhere herein. After the step 310, processing proceeds to a step 320 where the system scans through a note to detect unigrams, bigrams and, optionally, using results of Named Entity Recognition, looks up the dictionary of geographic names to identify place names among scanned terms. After the step 320, processing proceeds to a step 330 where ambiguities in preliminary place names are clarified using context clouds, as explained elsewhere herein. See, for example, phase II of the illustration 200, described above. The step 330 may result in dropping some of the preliminary place names. After the step 330, processing proceeds to a step 340 where generalized place names may be added, as explained in more detail elsewhere herein. See, for example, phase III in the illustration 200. After the step 340, processing proceeds to a step 350 where an output list of place names with scores is formed, potentially augmented by generalized place names with their scores, as explained in more detail elsewhere herein. See, for example, phase IV of the illustration 200. After the step 350, processing is complete. Referring to FIG. 4, a flow diagram 400 illustrates in more detail building a dictionary of geographic names as set forth in the step 310 of the flow diagram 300, described above. Processing begins at a step 410 where existing dictionary sources are merged into a preliminary version of the dictionary of geographic names. After the step 410, processing proceeds to a step 420 where a special web crawler visits dedicated online sites (such as encyclopedia and online dictionaries) and compiles lists of homographs for some or all entries of the preliminary version of the dictionary, as explained elsewhere herein. See, for example, phase II of the illustration 100. In addition to visiting web sites, the crawler or a standalone utility may analyze existing offline resources to identify homographs for dictionary entries. After the step 420, processing proceeds to a step 430 where another (or possibly the same) web crawler may visit a variety of relevant web sites (or, in some embodiments, a standalone utility may parse relevant offline resources) to compile context clouds for basic entries of the dictionary of geographic names and, in some embodiments, to also build context clouds for some or each of the homographs identified at the step 420.
After the step 430, processing proceeds to a test step 440 where it is determined whether the context clouds (built previously) need a refresh. If so, then processing proceeds back to the step 430; otherwise, following the step 440, processing is complete.
Referring to FIG. 5, a flow diagram 500 illustrates clarifying ambiguity for place names using context clouds as set forth in the step 330 of the flow diagram 300, described above. Processing begins at a step 510 where the system chooses the first extracted place name from a preliminary list built from a note, as explained elsewhere herein. After the step 510, processing proceeds to a step 520 where a relevance score for the place name and possibly other (related) places names and relevance scores are determined. Processing at the step 520 is described in more detail elsewhere herein. After the step 520, processing proceeds to a step 530, where a chosen place name and a corresponding relevance score is added to the output list; this step reflects the situation when the relevance of the chosen place name beats all its homographs in the dictionary. After the step 530 processing proceeds to a test step 540 where it is determined whether the chosen place name is the last place name on the preliminary list extracted from the note. If not, processing proceeds to a step 550 where a next place name is chosen from the preliminary list. After the step 550, processing proceeds back to the step 520, discussed above, and an assessment process for a new place name starts. If it is determined at the test step 540 that the chosen place name is the last place name on the preliminary list extracted from the note, control transfers from the test step 540 to a step 560 where an output list of place names is sent to a generalization module explained elsewhere herein in connection with phase III of the illustration 200. After the step 560, processing is complete. Referring to FIG. 6, a flow diagram 600 illustrates in more detail the step 520 of the flow diagram 500, described above. Processing begins at a step 615 where a context cloud corresponding to the chosen place name is retrieved from a dictionary of geographic names. After the step 615, processing proceeds to a step 620 where a residual note is compared with the context cloud and relevance score is calculated. After the step 620, processing proceeds to a step 625 where the system looks up the dictionary for homographs of the chosen place name. After the step 625, processing proceeds to a test step 630 where it is determined if homographs exist. If so, then processing proceeds to a step 635 where a first homograph is chosen; otherwise, processing proceeds to a step 665. After the step 635, processing proceeds to a step 640 where a context cloud corresponding to the chosen homograph is retrieved from the dictionary.
After the step 640, processing proceeds to a step 645 where a residual note is compared with the context cloud for the chosen homograph and a relevance score is calculated. After the step 645, processing proceeds to a test step 650 where it is determined if the chosen place name is more relevant than the currently evaluated homograph (in other words, if the relevance score for the place name is higher than for the homograph). If it is determined at the test step 650 that the chosen place name is not more relevant than the currently evaluated homograph, then processing is complete. Otherwise, processing proceeds to a test step 655 where it is determined if the current homograph is the last entry on the list of homographs for the chosen place name. If it is determined at the test step 655 that the current homograph is not the last entry on the list of homographs for the chosen place name, then processing proceeds to a step 660 where the next homograph is chosen. After the step 660, processing proceeds back to the step 640, discussed above, which can be independently reached from the step 635. If it is determined at the test step 655 that the current homograph is the last entry on the list of homographs for the chosen place name, processing proceeds to the step 665 where the chosen place name and a corresponding relevance score of the place name is added to an output list. Note that the step 665 can also be reached from the test step 630 if no homographs are found. After the step 665, processing is complete.
The system may also be implemented on and/or in cooperation with a cloud based content management system with mobile clients, such as the system provided by the Evernote service and software, developed by the Evernote Corporation of Redwood City, California. One or more of the mobile clients may be a mobile device, such as a conventional hand-held smartphone running one of several major mobile platforms, including iOS, Android,
Windows Phone OS, Blackberry OS and mobile versions of Linux. The mobile device may run additional user software (e.g., an application) that provides the functionality described herein. The user software may be bundled (pre-loaded), installed from an app store or downloaded from a Web site.
Various embodiments discussed herein may be combined with each other in appropriate combinations in connection with the system described herein. Additionally, in some instances, the order of steps in the flowcharts, flow diagrams and/or described flow processing may be modified, where appropriate. Subsequently, elements and areas of screen described in screen layouts may vary from the illustrations presented herein. Further, various aspects of the system described herein may be implemented using software, hardware, a combination of software and hardware and/or other computer-implemented modules or devices having the described features and performing the described functions. The mobile device may be a cell phone, although other devices are also possible.
Software implementations of the system described herein may include executable code that is stored in a computer readable medium and executed by one or more processors. The computer readable medium may be non-transitory and include a computer hard drive, ROM, RAM, flash memory, portable computer storage media such as a CD-ROM, a DVD- ROM, a flash drive, an SD card and/or other drive with, for example, a universal serial bus (USB) interface, and/or any other appropriate tangible or non-transitory computer readable medium or computer memory on which executable code may be stored and executed by a processor. The system described herein may be used in connection with any appropriate operating system.
Other embodiments of the invention will be apparent to those skilled in the art from a consideration of the specification or practice of the invention disclosed herein. It is intended that the specification and examples be considered as exemplary only, with the true scope and spirit of the invention being indicated by the following claims.

Claims

What is claimed is:
1. A method of extracting and clarifying ambiguities of addresses in documents, comprising: constructing a plurality of context clouds, wherein each of the context clouds corresponds to place names and homographs therefor;
determining place names in each of the documents using a dictionary of geographic names;
clarifying ambiguities for the place names determined in each of the documents using the context clouds and residual documents, wherein the residual documents correspond to each of the documents having detected place names removed therefrom; and
providing a relevance score for each remaining one of the detected place names.
2. A method, according to claim 1, wherein using the context clouds and residual documents includes determining the relevance score of each of the place names and determining a relevance score of corresponding homographs for each of the place names.
3. A method, according to claim 1, further comprising:
constructing a list of place names contained in each of the documents with the corresponding relevance scores.
4. A method, according to claim 1, wherein a place name that is determined to be less relevant than a particular homograph therefor is eliminated.
5. A method, according to claim 1, further comprising:
providing generalization for remaining ones of the detected place names to provide additional augmented place names.
6. A method, according to claim 1, wherein constructing a plurality of context clouds includes constructing a list of geographic names from existing databases.
7. A method, according to claim 6, wherein constructing a plurality of context clouds also includes consulting lexical resources to add homographs of the geographical names to detect ambiguities in the geographic names.
8. A method, according to claim 7, wherein, for each of the homographs, additional information is provided based on online resources that include at least one of: encyclopedia sites, news sites, travel sites, and business sites.
9. A method, according to claim 8, wherein web crawlers are used for at least one of:
constructing the list of geographic names, adding homographs, and building context clouds.
10. A method, according to claim 1, wherein the relevance score is based on a tf-idf score and on uniqueness of terms in context clouds.
11. A method, according to claim 1 , wherein the context clouds are periodically refreshed to obtain upgrades.
12. A method, according to claim 1, wherein detecting place names includes looking up place name words and bigrams in the dictionary of geographic names.
13. A method, according to claim 1, wherein detecting place names further includes using a Named-Entity Recognition technique, wherein recognized Named-Entity Recognition entries are limited only to geographic items that have an associated place name in the dictionary of geographic names.
14. A method, according to claim 1, wherein the documents include notes.
15. A method, according to claim 1, wherein a cloud-based content management system with mobile clients extracts and clarifies ambiguities of addresses in documents.
16. A method, according to claim 15, wherein at least one of the mobile clients is a mobile device that includes software that is one of: pre-loaded with the device, installed from an app store, and downloaded from a Web site.
17. A method, according to claim 16, wherein the mobile device uses an operating system selected from the group consisting of: iOS, Android OS, Windows Phone OS, Blackberry OS and mobile versions of Linux OS.
18. Computer software, provided in a non-transitory computer-readable medium, that extracts and clarifies ambiguities of addresses in documents, the software comprising:
executable code that constructs a plurality of context clouds, wherein each of the context clouds corresponds to place names and homographs therefor;
executable code that determines place names in each of the documents using a dictionary of geographic names;
executable code that clarifies ambiguities for the place names determined in each of the documents using the context clouds and residual documents, wherein the residual documents correspond to each of the documents having detected place names removed therefrom; and
executable code that provides a relevance score for each remaining one of the detected place names.
19. Computer software, according to claim 18, wherein using the context clouds and residual documents includes determining the relevance score of each of the place names and determining a relevance score of corresponding homographs for each of the place names.
20. Computer software, according to claim 18, further comprising:
executable code that constructs a list of place names contained in each of the documents with the corresponding relevance scores.
21. Computer software, according to claim 18, wherein a place name that is determined to be less relevant than a particular homograph therefor is eliminated.
22. Computer software, according to claim 18, further comprising:
executable code that provides generalization for remaining ones of the detected place names to provide additional augmented place names.
23. Computer software, according to claim 18, wherein executable code that constructs a plurality of context clouds constructs a list of geographic names from existing databases.
24. Computer software, according to claim 23, wherein executable code that constructs a plurality of context clouds consults lexical resources to add homographs of the geographical names to detect ambiguities in the geographic names.
25. Computer software, according to claim 24, wherein, for each of the homographs, additional information is provided based on online resources that include at least one of: encyclopedia sites, news sites, travel sites, and business sites.
26. Computer software, according to claim 25, wherein web crawlers are used for at least one of: constructing the list of geographic names, adding homographs, and building context clouds.
27. Computer software, according to claim 18, wherein the relevance score is based on a tf- idf score and on uniqueness of terms in context clouds.
28. Computer software, according to claim 18, wherein the context clouds are periodically refreshed to obtain upgrades.
29. Computer software, according to claim 18, wherein executable code that detects place names looks up place name words and bigrams in the dictionary of geographic names.
30. Computer software, according to claim 18, wherein executable code that detects place names uses a Named-Entity Recognition technique, wherein recognized Named-Entity Recognition entries are limited only to geographic items that have an associated place name in the dictionary of geographic names.
31. Computer software, according to claim 18, wherein the documents include notes.
32. A cloud-based content management system that extracts and clarifies ambiguities of addresses in documents, comprising:
a plurality of context clouds, wherein each of the context clouds corresponds to place names and homographs therefor;
a dictionary of geographic names; and
at least one processor having executable code that clarifies ambiguities for the place names determined in each of the documents using the context clouds and residual documents, wherein the residual documents correspond to each of the documents having detected place names removed therefrom and having executable code that provides a relevance score for each remaining one of the detected place names with mobile clients extracts and clarifies ambiguities of addresses in documents.
33. A cloud-based content management system, according to claim 32, further comprising: a mobile client device that includes software that is one of: pre-loaded with the device, installed from an app store, and downloaded from a Web site.
34. A cloud-based content management system, according to claim 33, wherein the mobile device uses an operating system selected from the group consisting of: iOS, Android OS, Windows Phone OS, Blackberry OS and mobile versions of Linux OS.
PCT/US2013/066493 2012-11-08 2013-10-24 Extraction and clarification of ambiguities for addresses in documents WO2014074317A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201261723880P 2012-11-08 2012-11-08
US61/723,880 2012-11-08
US201314056184A 2013-10-17 2013-10-17
US14/056,184 2013-10-17

Publications (1)

Publication Number Publication Date
WO2014074317A1 true WO2014074317A1 (en) 2014-05-15

Family

ID=50685073

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2013/066493 WO2014074317A1 (en) 2012-11-08 2013-10-24 Extraction and clarification of ambiguities for addresses in documents

Country Status (1)

Country Link
WO (1) WO2014074317A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110597943A (en) * 2019-09-16 2019-12-20 腾讯科技(深圳)有限公司 Interest point processing method and device based on artificial intelligence and electronic equipment
CN111797628A (en) * 2020-06-03 2020-10-20 武汉理工大学 Travel memory place name disambiguation method based on time geography
CN112966511A (en) * 2021-02-08 2021-06-15 广州探迹科技有限公司 Entity word recognition method and device
US20230093453A1 (en) * 2021-09-16 2023-03-23 Centripetal Networks, Inc. Malicious homoglyphic domain name generation and associated cyber security applications
WO2024000656A1 (en) * 2022-06-29 2024-01-04 青岛海尔科技有限公司 Place name recognition method, system and apparatus, and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040103117A1 (en) * 2002-11-27 2004-05-27 Michael Segler Building a geographic database
US20050080613A1 (en) * 2003-08-21 2005-04-14 Matthew Colledge System and method for processing text utilizing a suite of disambiguation techniques
US6996520B2 (en) * 2002-11-22 2006-02-07 Transclick, Inc. Language translation system and method using specialized dictionaries
US20080104019A1 (en) * 2006-10-26 2008-05-01 Microsoft Corporation Associating Geographic-Related Information with Objects
US20090144609A1 (en) * 2007-10-17 2009-06-04 Jisheng Liang NLP-based entity recognition and disambiguation
US20090187538A1 (en) * 2008-01-17 2009-07-23 Navteq North America, Llc Method of Prioritizing Similar Names of Locations for use by a Navigation System
US20100179754A1 (en) * 2009-01-15 2010-07-15 Robert Bosch Gmbh Location based system utilizing geographical information from documents in natural language
US20100229082A1 (en) * 2005-09-21 2010-09-09 Amit Karmarkar Dynamic context-data tag cloud
US20110145348A1 (en) * 2009-12-11 2011-06-16 CitizenNet, Inc. Systems and methods for identifying terms relevant to web pages using social network messages
US20120246731A1 (en) * 2011-03-21 2012-09-27 Mocana Corporation Secure execution of unsecured apps on a device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6996520B2 (en) * 2002-11-22 2006-02-07 Transclick, Inc. Language translation system and method using specialized dictionaries
US20040103117A1 (en) * 2002-11-27 2004-05-27 Michael Segler Building a geographic database
US20050080613A1 (en) * 2003-08-21 2005-04-14 Matthew Colledge System and method for processing text utilizing a suite of disambiguation techniques
US20100229082A1 (en) * 2005-09-21 2010-09-09 Amit Karmarkar Dynamic context-data tag cloud
US20080104019A1 (en) * 2006-10-26 2008-05-01 Microsoft Corporation Associating Geographic-Related Information with Objects
US20090144609A1 (en) * 2007-10-17 2009-06-04 Jisheng Liang NLP-based entity recognition and disambiguation
US20090187538A1 (en) * 2008-01-17 2009-07-23 Navteq North America, Llc Method of Prioritizing Similar Names of Locations for use by a Navigation System
US20100179754A1 (en) * 2009-01-15 2010-07-15 Robert Bosch Gmbh Location based system utilizing geographical information from documents in natural language
US20110145348A1 (en) * 2009-12-11 2011-06-16 CitizenNet, Inc. Systems and methods for identifying terms relevant to web pages using social network messages
US20120246731A1 (en) * 2011-03-21 2012-09-27 Mocana Corporation Secure execution of unsecured apps on a device

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110597943A (en) * 2019-09-16 2019-12-20 腾讯科技(深圳)有限公司 Interest point processing method and device based on artificial intelligence and electronic equipment
CN111797628A (en) * 2020-06-03 2020-10-20 武汉理工大学 Travel memory place name disambiguation method based on time geography
CN111797628B (en) * 2020-06-03 2024-03-08 武汉理工大学 Method for disambiguating tourist names based on time geography
CN112966511A (en) * 2021-02-08 2021-06-15 广州探迹科技有限公司 Entity word recognition method and device
CN112966511B (en) * 2021-02-08 2024-03-15 广州探迹科技有限公司 Entity word recognition method and device
US20230093453A1 (en) * 2021-09-16 2023-03-23 Centripetal Networks, Inc. Malicious homoglyphic domain name generation and associated cyber security applications
US11856005B2 (en) * 2021-09-16 2023-12-26 Centripetal Networks, Llc Malicious homoglyphic domain name generation and associated cyber security applications
WO2024000656A1 (en) * 2022-06-29 2024-01-04 青岛海尔科技有限公司 Place name recognition method, system and apparatus, and storage medium

Similar Documents

Publication Publication Date Title
De Bruijn et al. TAGGS: Grouping tweets to improve global geoparsing for disaster response
EP2209073A1 (en) Location based system utilizing geographical information from documents in natural language
Rae et al. Mining the web for points of interest
US8953887B2 (en) Processing time-based geospatial data
JP7023821B2 (en) Information retrieval system
US20140101544A1 (en) Displaying information according to selected entity type
WO2014074317A1 (en) Extraction and clarification of ambiguities for addresses in documents
WO2014072767A1 (en) Apparatus and method for displaying image-based representations of geographical locations in an electronic text
US11216499B2 (en) Information retrieval apparatus, information retrieval system, and information retrieval method
JP2008047101A (en) Natural language-based location query system, keyword-based location query system, and natural language-based/keyword-based location query system
US9471596B2 (en) Systems and methods for processing search queries utilizing hierarchically organized data
CN105589936A (en) Data query method and system
US9529823B2 (en) Geo-ontology extraction from entities with spatial and non-spatial attributes
Li et al. A hybrid method for Chinese address segmentation
US8694512B1 (en) Query suggestions
US20120143598A1 (en) Server, dictionary creation method, dictionary creation program, and computer-readable recording medium recording the program
Hosseini et al. Location oriented phrase detection in microblogs
Borges et al. Ontology-driven discovery of geospatial evidence in web pages
US9792378B2 (en) Computerized systems and methods for identifying a character string for a point of interest
JP5639549B2 (en) Information retrieval apparatus, method, and program
Campelo et al. A model for geographic knowledge extraction on web documents
Mahmood et al. Public bus commuter assistance through the named entity recognition of twitter feeds and intelligent route finding
Schockaert et al. Mining topological relations from the web
Tabarcea et al. Framework for location-aware search engine
Bui Automatic construction of POI address lists at city streets from geo-tagged photos and web data: a case study of San Jose City

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13852460

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13852460

Country of ref document: EP

Kind code of ref document: A1