US20130338996A1 - Transliteration For Query Expansion - Google Patents

Transliteration For Query Expansion Download PDF

Info

Publication number
US20130338996A1
US20130338996A1 US14/010,204 US201314010204A US2013338996A1 US 20130338996 A1 US20130338996 A1 US 20130338996A1 US 201314010204 A US201314010204 A US 201314010204A US 2013338996 A1 US2013338996 A1 US 2013338996A1
Authority
US
United States
Prior art keywords
transliterated
term
terms
source
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US14/010,204
Inventor
Lalitesh Kumar Katragadda
Vineet Gupta
Piyush Prahladka
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google LLC filed Critical Google LLC
Priority to US14/010,204 priority Critical patent/US20130338996A1/en
Assigned to GOOGLE INC. reassignment GOOGLE INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: GUPTA, VINEET, KATRAGADDA, Lalitesh Kumar, PRAHLADKA, PIYUSH
Publication of US20130338996A1 publication Critical patent/US20130338996A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • G06F17/30976
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9032Query formulation
    • G06F16/90332Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language

Definitions

  • This specification relates to query expansion for users submitting queries to search engines.
  • Search engines and, in particular, Internet search engines—aim to identify resources (e.g., web pages, images, text documents, multimedia context) that are relevant to a user's needs and to present information about the resources in a manner that is most useful to the user.
  • Internet search engines return search results in response to a user submitted query. If a user is dissatisfied with the search results returned for a query, the user can attempt to refine the query to better match the user's needs.
  • Some search engines provide a user with suggested alternative queries, for example, expanded queries, that the search engine identifies as being related to the user's query.
  • Techniques for finding synonyms of query words for query expansion typically depend on natural language models or user search log data. The identified synonyms of query words can be used to expand a query in an attempt to identify additional or more relevant resources to improve user search experience.
  • Electronic documents are typically written in many different languages.
  • Each language is normally expressed in a particular writing system (i.e., a script), which is usually characterized by a particular alphabet.
  • a script which is usually characterized by a particular alphabet.
  • the English language is expressed using the Latin alphabet while the Hindi language is normally expressed using the Devan ⁇ gar ⁇ alphabet.
  • the scripts used by some languages include a particular alphabet that has been extended to include additional marks or characters.
  • the script of one language is used to represent words normally written in the script of another language.
  • a transliterated term can be a term that has been converted from one script to another script or a phonetic representation in one script of a term in another script.
  • Techniques for finding synonyms of query words for query expansion may not work well for finding synonyms of query terms that are transliterated terms.
  • current natural language techniques do not work well with transliterated data, and search log data typically provide poor coverage for most transliterated variations.
  • This specification describes technologies relating to identifying candidate synonyms of transliterated terms for query expansion.
  • one aspect of the subject matter described in this specification can be embodied in computer-implemented methods that include the actions of identifying, using one or more computers, multiple transliterated terms in a target language, for each transliterated term of the multiple transliterated terms in the target language, mapping the transliterated term to one or more terms in a source language, and for a first transliterated term of the multiple transliterated terms in the target language, identifying one or more second transliterated terms of the multiple transliterated terms in the target language as candidate synonyms of the first transliterated term, where each of the one or more second transliterated terms is mapped to at least one term in the source language that is also mapped from the first transliterated term.
  • Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.
  • Identifying the multiple transliterated terms in the target language can further include identifying from web resources terms containing only characters of the target language.
  • the aspect can further include computing a statistic for each identified term containing only characters of the target language, comparing the statistic for each identified term to a specified threshold, and including a particular identified term in the multiple transliterated terms in the target language if the statistic for the particular identified term exceeds the specified threshold.
  • the statistic for each identified term can be a ratio of a probability of occurrence of the identified term in web resources of a top-level domain associated with one or more locales where the source language is spoken to a probability of occurrence of the identified term in web resources of a top-level domain associated with any locale.
  • the statistic for each identified term can be a ratio of a probability of occurrence of the identified term in web resources associated with one or more locales where the source language is spoken to a probability of occurrence of the identified term in web resources associated with any locale.
  • the association of a web resource with a locale where the source language is spoken can be determined by a top-level domain of the web resource.
  • Mapping the transliterated term to one or more terms in the source language can further include transliterating the transliterated term in the target language to the one or more terms in the source language.
  • Each of the one or more second transliterated terms identified as candidate synonyms of the first transliterated term can have a confidence value with respect to the first transliterated term that is above a specified threshold.
  • the confidence value of a second transliterated term can be a function of a number of terms in the source language that are mapped from both the first transliterated term and the second transliterated term.
  • Transliterating the transliterated term in the target language to a term in the source language can further include generating a transliteration score for the transliteration of the transliterated term in the target language to the term in the source language.
  • the confidence value of a second transliterated term can be a function of one or more of a probability of occurrence of the second transliterated term in web resources, the transliteration score for the transliteration of the second transliterated term to a term in the source language that is also mapped from the first transliterated term, and the transliteration score for the transliteration of the first transliterated term to the term in the source language.
  • the aspect can further include, for the first transliterated term of the multiple transliterated terms in the target language, identifying one or more terms in the source language that are mapped from the first transliterated term and from at least one of the one or more second transliterated terms as candidate synonyms of the first transliterated term.
  • the aspect can further include receiving a query including the first transliterated term, expanding the query with one or more of the candidate synonyms of the first transliterated term, providing the expanded query to a search engine, and receiving search results for the expanded query.
  • the aspect can further include receiving a query including the first transliterated term, and providing one or more expanded queries for selection by a user, each expanded query including the query and one or more of the candidate synonyms of the first transliterated term.
  • the aspect can further include receiving a query including the first transliterated term, providing the query to a search engine, where the search engine identifies as a possible search result for the query a web resource that includes at least one of the candidate synonyms of the first transliterated term but does not include any term in the query, and modifying a score associated with the web resource, the score for use in ranking possible search results for the query.
  • the aspect can further include receiving a query including the first transliterated term, providing the query to a search engine, where the search engine identifies as a possible search result for the query a web resource that includes at least one of the terms in the source language that is mapped from the first transliterated term and from at least one of the one or more second transliterated terms but does not include any term in the query, and modifying an information retrieval score associated with the web resource, the information retrieval score for use in ranking possible search results for the query.
  • Another aspect of the subject matter described in this specification can be embodied in computer-implemented methods that include the actions of generating, using one or more computers, a training group of possible transliterated synonyms in a target language, training a probabilistic model using the training group to learn probabilities of spelling variations in transliterated synonyms in the target language, and applying the probabilistic model to a particular transliterated term in the target language to identify one or more candidate synonyms of the particular transliterated term.
  • Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.
  • Another aspect of the subject matter described in this specification can be embodied in computer-implemented methods that include the actions of identifying, using one or more computers, multiple transliterated terms in a target language, for a first transliterated term of the multiple transliterated terms in the target language, identifying one or more second transliterated terms of the multiple transliterated terms in the target language as candidate synonyms of the first transliterated term, and using the candidate synonyms of the first transliterated term to expand queries including the first transliterated term.
  • Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.
  • Transliterated terms are identified as candidate synonyms for a particular transliterated term, where the candidate synonyms can be used for expanding queries including the particular transliterated term.
  • Transliterated synonyms in a target language can be identified for newer transliterated terms (e.g., terms transliterated from terms in a source language from current news stories or current cultural references), which may have poor coverage in user search log data.
  • a system that can expand a user's query to include candidate transliterated synonyms for a given transliterated term may return better search results than a search system that does not have the same query expansion capability.
  • FIG. 1 is a block diagram of an example search system.
  • FIGS. 2A-2C illustrate an example technique for identifying candidate synonyms for a transliterated term.
  • FIG. 3 is a flow chart of an example process for identifying candidate synonyms for a transliterated term.
  • FIG. 4 is a flow chart of an example process for providing search results for an expanded query that includes a transliterated term and a candidate synonym.
  • FIG. 5 is a flow chart of an example process for identifying candidate synonyms for a transliterated term.
  • FIG. 1 is a block diagram of an example search system 114 that can be used to provide search results relevant to submitted queries as can be implemented in an Internet, an intranet, or another client and server environment.
  • the search system 114 is an example of an information retrieval system in which the systems, components, and techniques described below can be implemented.
  • a user 102 can interact with the search system 114 through a client device 104 .
  • the client 104 can be a computer coupled to the search system 114 through a local area network (LAN) or wide area network (WAN), e.g., the Internet.
  • the search system 114 and the client device 104 can be one machine.
  • a user can install a desktop search application on the client device 104 .
  • the client device 104 will generally include a random access memory (RAM) 106 and a processor 108 .
  • RAM random access memory
  • a user 102 can submit a query 110 to a search engine 130 within a search system 114 .
  • the query 110 is transmitted through a network to the search system 114 .
  • the search system 114 can be implemented as, for example, computer programs running on one or more computers in one or more locations that are coupled to each other through a network.
  • the search system 114 includes an index database 122 and a search engine 130 .
  • the search system 114 responds to the query 110 by generating search results 128 , which are transmitted through the network to the client device 104 in a form that can be presented to the user 102 (e.g., as a search results web page to be displayed in a web browser running on the client device 104 ).
  • the search engine 130 identifies resources that match the query 110 .
  • the search engine 130 will generally include an indexing engine 120 that indexes resources (e.g., web pages, images, or news articles on the Internet), an index database 122 that stores the index information, and a ranking engine 152 (or other software) that ranks the resources that match the query 110 .
  • the search engine 130 can transmit the search results 128 through the network to the client device 104 for presentation to the user 102 .
  • a query includes one or more terms that are transliterated terms.
  • Transliteration converts a term in a source language to a transliterated term in a target language. After conversion, the letters or characters of the term in the source language are represented by letters or characters of the target language.
  • a machine learning technique for transliteration is described, for example, in U.S. patent application Ser. No. 12/043,854, titled “Machine Learning for Transliteration,” filed Mar. 6, 2008.
  • Transliterations do not have a notion of correct spelling. As a result, there often exist multiple spellings in a target language for transliterations of a word in a source language. For a particular term in a source language having multiple transliterations in a target language, transliterated terms in the target language that vary from a given transliterated term in the target language can be treated as candidate synonyms of the given transliterated term. These candidate transliterated synonyms are different possible transliterations of the same term in the source language.
  • the Hindi word can be transliterated into English as “chakrabarti” or “chakrabarty”.
  • the transliterated term “chakrabarty” can be identified as a candidate synonym of the given transliterated term, “chakrabarti”.
  • Candidate synonyms identified for a given transliterated term can be used to expand queries that include the given transliterated term. For example, if there is a popular new Hindi song available on several websites on the Internet, a user may find it difficult to search for the song if the websites transliterate a Hindi word in the song title to a first transliterated term while the user enters a query with a second transliterated term for the same Hindi word. A search system that can expand the user's query to include candidate transliterated synonyms for the second transliterated term may return better search results than a search system that does not have the same query expansion capability.
  • FIGS. 2A-2C illustrate an example technique for identifying candidate synonyms for a transliterated term.
  • the example technique can be used to expand a query including the transliterated term to include synonyms of the transliterated term in an attempt to improve the search results returned for the query.
  • the example technique uses transliteration techniques to determine what terms in a target language (e.g., English) are transliterated from the same term in a source language (e.g., Hindi).
  • a target language e.g., English
  • a source language e.g., Hindi
  • FIG. 2A illustrates a list 210 of possible transliterated terms in English, the target language, where the source language is Hindi.
  • a system can generate or identify the list 210 of possible transliterated terms in any number of different ways.
  • the system can identify the possible transliterated terms of the list 210 from web resources as terms containing only characters of the target language (e.g., Latin characters).
  • the identified terms containing only characters of the target language include words with meaning in the target language and possible transliterated terms without meaning in the target language.
  • the system can compute statistics for the identified terms containing only characters of the target language and can compare the statistics to a specified threshold. That is, for each identified term, a statistic is computed and compared to a threshold, where the system includes the identified term in the list 210 of possible transliterated terms if the statistic for the identified term exceeds the specified threshold.
  • transliterated terms in English may have a higher probability of occurring on an Indian web resource than on non-Indian web resources.
  • the statistic for each identified term containing only Latin characters can be a function of the probability of occurrence on an Indian web resource.
  • the statistic for each identified term is a ratio of the probability of occurrence of the identified term in web resources of a top-level domain associated with one or more locales (e.g., countries or regions) where the source language is spoken to the probability of occurrence of the identified term in web resources of a top-level domain associated with any locale.
  • the statistic could be the ratio of the probability of the identified term occurring on an Indian web page to the probability of the identified term occurring on any web page. If the statistic computed for a particular identified term exceeds a specified threshold, the particular identified term can be included in the list 210 of possible transliterated terms.
  • the statistic for each identified term is a ratio of the probability of occurrence of the identified term in web resources associated with one or more locales (e.g., countries or regions) where the source language is spoken to the probability of occurrence of the identified term in web resources associated with any locale.
  • the association of a web resource with a locale where the source language is spoken can be determined by a top-level domain of the web resource.
  • the statistic could be the ratio of the probability of the identified term occurring on an Indian web domain to the probability of the identified term occurring on any web domain. If the statistic computed for a particular identified term exceeds a specified threshold, the particular identified term can be included in the list 210 of possible transliterated terms.
  • a particular web page or a particular web domain may use a particular identified term an exceptionally large number of times, which could skew the statistic for the particular identified term.
  • the system caps the statistic for each identified term or a component of the statistic for each identified term at a specified limit to prevent skewing of the statistic. For example, the system can cap the per-page contribution of the identified term on Indian web pages or the per-domain contribution of the identified term on Indian domains.
  • the statistic for each identified term is a ratio of the probability of the identified term being included in a query submitted to a search engine having an interface in the source language to the probability of the identified term being included in a query submitted to a search engine having an interface in any language.
  • the system can compute the statistic using Indian and non-Indian search logs.
  • the system computes multiple statistics for each identified term containing only characters of the target language and compares the multiple statistics to respective thresholds. If the multiple statistics for a particular identified term each exceed a respective threshold, the system can include the particular identified term in the list 210 of possible transliterated terms.
  • the possible transliterated terms of the list 210 can alternatively be identified by crawling only known web resources associated with the source language.
  • the system can identify the possible transliterated terms by crawling known Indian websites, for example, Indian blog sites or websites that translate Hindi songs or Hindi technical textbooks.
  • FIG. 2B illustrates relations 215 between each possible transliterated term of the list 210 and one or more terms 220 in the source language, Hindi.
  • Each relation 215 is the result of mapping an element of a first group (i.e., the possible transliterated terms in the target language) to one or more elements of a second group (i.e., the terms 220 in the source language). That is, mapping forms a one-way relation between a possible transliterated term in the target language and one or more terms 220 in the source language.
  • the relations 215 are the result of mapping by transliteration performed, for example, by an English-to-Hindi machine transliterator, implemented as an element of a system.
  • mapping includes generating a transliteration score 225 for each transliteration from a possible transliterated term in the target language to a term 220 in the source language.
  • FIG. 2B illustrates the transliteration score 225 for each transliteration, including the score from “sreeram” to H 2 (e.g., score E1 to H2 ), the score from “shriram” to H 2 (e.g., score E3 to H2 ), and the score from “shreeram” to H 6 (e.g., score E4 to H6 ).
  • the transliteration score 225 of a given possible transliterated term of the list 210 can be a component of a confidence value of the given possible transliterated term with respect to another possible transliterated term.
  • the system can use these confidence values in identifying the possible transliterated terms that should be considered as candidate synonyms for a particular transliterated term.
  • the transliteration scores 225 and the confidence values are described in more detail with respect to FIG. 2C .
  • FIG. 2C illustrates identifying, for a first possible transliterated term 230 , one or more second possible transliterated terms 240 as candidate synonyms of the first possible transliterated term 230 .
  • the transliterator maps a term 220 in the source language from two or more possible transliterated terms in the target language, this suggests a synonym relationship between the two or more possible transliterated terms in the target language.
  • H 2 is a Hindi word in the source language that is mapped by the transliterator from three possible transliterated terms: “sreeram”, “shriram”, and “shreeram”, suggesting that the three transliterated terms are synonyms.
  • the system identifies the second possible transliterated terms 240 as candidate synonyms of the first possible transliterated term 230 by identifying the possible transliterated terms of the list 210 that are mapped to at least one term 220 in the source language that is also mapped from the first possible transliterated term 230 . Intersections of the terms 220 in the source language give candidate groups for transliterated synonyms. Several techniques can be implemented to increase the reliability of the candidate groups for transliterated synonyms. In some implementations, each of the possible transliterated terms of the list 210 other than the first possible transliterated term 230 has a confidence value with respect to the first possible transliterated term 230 .
  • the particular possible transliterated term if a particular possible transliterated term has a confidence value with respect to the first possible transliterated term 230 that is above a specified threshold, the particular possible transliterated term is a second possible transliterated term 240 identified as a candidate synonym of the first possible transliterated term 230 . If mapping does not produce a transliteration score 225 for each transliteration, the confidence value for a given second possible transliterated term 240 can be a function of the number of terms 220 in the source language that are mapped from both the first possible transliterated term 230 and the given second possible transliterated term 240 .
  • “shriram” and “sriraam” each map to only one term 220 (i.e., H 2 and H 6 , respectively) that is also mapped from “sreeram”, the first possible transliterated term 230 .
  • the transliterated term “shreeram” maps to two terms 220 (i.e., H 2 and H 6 ) that are also mapped from “sreeram”, the first possible transliterated term 230 .
  • the confidence value for a given second possible transliterated term 240 can be a function of the transliteration scores 225 of the first possible transliterated term 230 and of the given second possible transliterated term 240 .
  • the confidence value for “shriram”, a second possible transliterated term 240 , with respect to “sreeram”, the first possible transliterated term 230 , where both transliterated terms map to H 2 can be a function of the transliteration scores 225 score E1 to H2 and score E3 to H2 .
  • the confidence value for a given second possible transliterated term 240 is a function of a probability of occurrence of the given second possible transliterated term 240 in web resources.
  • the probability of occurrence can be the per-page contribution in web resources or the per-domain contribution in web resources of the given second possible transliterated term 240 .
  • a higher probability of occurrence suggests that the given second possible transliterated term 240 is a more common form of the transliteration from the term in the source language.
  • a higher probability suggests higher confidence in the common transliterated term, which can be reflected in a higher confidence value for the transliterated term.
  • the confidence value for a given second possible transliterated term 240 is a function of multiple components, e.g., the transliteration scores 225 and a probability of occurrence.
  • FIG. 2C includes as second possible transliterated terms 240 all possible transliterated terms that map to a term 220 in the source language that are also mapped from the first possible transliterated term 230 , implementation of any of the above techniques for increasing the reliability of candidate groups can reduce the group of candidate synonyms to a subgroup of the second possible transliterated terms 240 illustrated in FIG. 2C .
  • the system identifies one or more of the terms 220 in the source language that are mapped from the first possible transliterated term 230 and from at least one of the second possible transliterated terms 240 as candidate synonyms of the first possible transliterated term 230 in addition to or instead of the second possible transliterated terms 240 .
  • the system can identify the terms H 2 and H 6 as candidate synonyms of “sreeram”.
  • the system identifies the terms 220 in the source language that are mapped from the same transliterated term in the target language as a candidate synonym group. For the example of FIG. 2C , the system can identify the terms H 2 and H 6 , mapped from the same transliterated terms “sreeram” and “shreeram”, as a candidate synonym group.
  • the system can use the candidate transliterated synonyms (i.e., the second possible transliterated terms 240 ) for query expansion.
  • a search system e.g., the search system 114 of FIG. 1
  • the search system can identify one or more candidate transliterated synonyms of the first possible transliterated term 230 .
  • the query can be expanded with one or more of the identified candidate transliterated synonyms of the first possible transliterated term 230 .
  • the system can expand a query including “sreeram” to include one or more of “shriram”, “shreeram’, and “sriraam”.
  • the system ranks the candidate synonyms by confidence value, and the system selects only N candidate synonyms with the N highest confidence values for including in expanded queries.
  • the system provides the expanded query to a search engine (e.g., the search engine 130 of FIG. 1 ), and receives search results for the expanded query.
  • the system selects a possible transliterated term as a candidate transliterated synonym for a given transliterated term, the system also selects the given transliterated term as a candidate transliterated synonym for the possible transliterated term. In other implementations, if the system selects a possible transliterated term as a candidate transliterated synonym for a given transliterated term, the system does not select the given transliterated term as a candidate transliterated synonym for the possible transliterated term. That is, there may or may not be reverse mapping of transliterated synonyms.
  • mapping candidate transliterated synonyms to a given transliterated term occurs on the document side of a query search. For the above example, if a user submits a query including the transliterated term “b” but not the transliterated term “a” and if a web document contains “a” but not “b,” the search system (e.g., the search system 114 of FIG.
  • the search system can treat the web document as if the web document also contains “b,” so that the web document is a candidate search result for the search including “b.” However, since the web document does not actually include “b,” the search system can reduce a score associated with the web document (e.g., an information retrieval score for ranking the web document as a candidate search result), which, consequently, can reduce the chance of the web document being returned for the search.
  • a score associated with the web document e.g., an information retrieval score for ranking the web document as a candidate search result
  • document-level mapping of candidate synonyms includes one or more terms 220 in the source language.
  • the search system can treat a web document containing “sreeram” as if the web document also contains the Hindi word H 2 or H 6 .
  • the search system can also reduce a score associated with the web document accordingly.
  • FIG. 3 is a flow chart of an example process 300 for identifying candidate synonyms for a transliterated term. For convenience, the example process 300 will be described with reference to the example technique of FIGS. 2A-2C and a system that performs the process 300 .
  • the system identifies multiple transliterated terms in a target language (step 310 ). For example, the system identifies the possible transliterated terms of the list 210 in FIG. 2A .
  • the system maps the transliterated term to one or more terms in a source language (step 320 ).
  • FIG. 2B illustrates an example of mapping using an English-to-Hindi transliterator.
  • the system For a first transliterated term of the multiple transliterated terms in the target language, the system identifies one or more second transliterated terms of the multiple transliterated terms as candidate synonyms of the first transliterated term (step 330 ). Each of the one or more second transliterated terms is mapped to at least one term in the source language that is also mapped from the first transliterated term.
  • FIG. 2C illustrates second possible transliterated terms 240 (i.e., “shriram”, “shreeram”, and “sriraam”) identified as candidate synonyms of a first possible transliterated term 230 (i.e., “sreeram”).
  • the candidate synonyms can be used for query expansion, for example, as described with respect to FIG. 4 .
  • FIG. 4 is a flow chart of an example process 400 for providing search results for an expanded query that includes a transliterated term and a candidate synonym.
  • the example process 400 will be described with reference to the example technique of FIG. 2A-2C and a system that performs the process 400 .
  • the system receives a query including a first transliterated term (step 410 ).
  • the query can include the transliterated term “sreeram” illustrated in FIG. 2C .
  • the system provides one or more expanded queries for selection by a user, where each expanded query includes the query and one or more candidate synonyms of the first transliterated term (step 420 ).
  • the candidate synonyms can be identified, for example, using the example process 300 of FIG. 3 .
  • the system can provide expanded queries that also include one or more of “shriram”, “shreeram”, and “sriraam”, as illustrated in FIG. 2C .
  • the system receives a selection of an expanded query from the user (step 430 ).
  • the expanded queries can be presented to the user as selectable hyperlinks on an interface of a web browser running on a client device (e.g., the client device 104 of FIG. 1 ).
  • the system can receive the selection of an expanded query as a selection by the user of the hyperlink for the selected expanded query.
  • the system generates an expanded query with one or more of the candidate synonyms and proceeds to step 440 without performing steps 420 and 430 .
  • the system provides the expanded query to a search engine (step 440 ).
  • the system can submit the expanded query to the search engine 130 of FIG. 1 .
  • the search engine performs the search, sending search results for the expanded query to the system.
  • the system receives the search results for the expanded query (step 450 ).
  • the system provides the received query of step 410 to the search engine without expanding the query. Instead, the system performs document-level mapping as described above with respect to FIG. 2C .
  • the search engine can identify as a possible search result for the query a web resource that includes at least one of the candidate synonyms of the first transliterated term but does not include any term (e.g., the first transliterated term) in the query.
  • the search engine can identify as a possible search result for the query a web resource that does not include any term (e.g., the first transliterated term) in the query but that does include at least one of the terms in a source language that is mapped from the first transliterated term and from at least one of the candidate synonyms.
  • the system can modify (e.g., reduce) a score for use in ranking that is associated with the web resource identified as a possible search result.
  • FIG. 5 is a flow chart of an example process 500 for identifying candidate synonyms for a transliterated term.
  • the example process 500 will be described with reference to a system that performs the process 500 .
  • the process 500 directly learns possible variations in spelling for transliterated terms in a target language. Since transliterated synonyms are generally phonetically similar, the variations between the transliterated synonyms are language specific.
  • the system generates a training group of possible transliterated synonyms in a target language (step 510 ).
  • the system trains a probabilistic model using the training group to learn probabilities of spelling variations in transliterated synonyms in the target language (step 520 ).
  • the system applies the probabilistic model to a particular transliterated term in the target language to identify one or more candidate synonyms of the particular transliterated term (step 530 ).
  • the system can use the candidate synonyms for query expansion as described above.
  • Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them.
  • Embodiments of the subject matter described in this specification can be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a tangible program carrier for execution by, or to control the operation of, data processing apparatus.
  • the tangible program carrier can be a propagated signal or a computer-readable medium.
  • the propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a computer.
  • the computer-readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more of them.
  • data processing apparatus encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers.
  • the apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
  • a computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.
  • a computer program does not necessarily correspond to a file in a file system.
  • a program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
  • a computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • the processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output.
  • the processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
  • processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer.
  • a processor will receive instructions and data from a read-only memory or a random access memory or both.
  • the essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data.
  • a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
  • mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks.
  • a computer need not have such devices.
  • a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, a device with spoken language input, to name just a few.
  • PDA personal digital assistant
  • GPS Global Positioning System
  • a smart phone is an example of a device with spoken language input, which can accept voice input (e.g., a user query spoken into a microphone on the device).
  • Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks.
  • semiconductor memory devices e.g., EPROM, EEPROM, and flash memory devices
  • magnetic disks e.g., internal hard disks or removable disks
  • magneto-optical disks e.g., CD-ROM and DVD-ROM disks.
  • the processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
  • embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • keyboard and a pointing device e.g., a mouse or a trackball
  • Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
  • Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described is this specification, or any combination of one or more such back-end, middleware, or front-end components.
  • the components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
  • LAN local area network
  • WAN wide area network
  • the computing system can include clients and servers.
  • a client and server are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Abstract

Methods, systems, and apparatus, including computer program products, for identifying candidate synonyms of transliterated terms for query expansion. In one aspect, a method includes identifying multiple transliterated terms in a target language. For each transliterated term of the multiple transliterated terms in the target language, the transliterated term is mapped to one or more terms in a source language. For a first transliterated term of the multiple transliterated terms in the target language, one or more second transliterated terms of the multiple transliterated terms in the target language are identified as candidate synonyms of the first transliterated term, where each of the one or more second transliterated terms is mapped to at least one term in the source language that is also mapped from the first transliterated term.

Description

    CROSS-REFERENCE TO RELATED APPLICATION
  • This Application is a continuation under 35 U.S.C. §120 of U.S. patent application Ser. No. 12/503,806, filed Jul. 15, 2009 and entitled “Transliteration for Query Expansion”, which claims the benefit under 35 U.S.C. §119(e) of U.S. patent application Ser. No. 61/082,165, filed Jul. 18, 2008. All of the aforementioned patent applications are hereby incorporated by reference in their entirety.
  • BACKGROUND
  • This specification relates to query expansion for users submitting queries to search engines.
  • Search engines—and, in particular, Internet search engines—aim to identify resources (e.g., web pages, images, text documents, multimedia context) that are relevant to a user's needs and to present information about the resources in a manner that is most useful to the user. Internet search engines return search results in response to a user submitted query. If a user is dissatisfied with the search results returned for a query, the user can attempt to refine the query to better match the user's needs.
  • Some search engines provide a user with suggested alternative queries, for example, expanded queries, that the search engine identifies as being related to the user's query. Techniques for finding synonyms of query words for query expansion typically depend on natural language models or user search log data. The identified synonyms of query words can be used to expand a query in an attempt to identify additional or more relevant resources to improve user search experience.
  • Electronic documents are typically written in many different languages. Each language is normally expressed in a particular writing system (i.e., a script), which is usually characterized by a particular alphabet. For example, the English language is expressed using the Latin alphabet while the Hindi language is normally expressed using the Devanāgarī alphabet. The scripts used by some languages include a particular alphabet that has been extended to include additional marks or characters. In transliteration, the script of one language is used to represent words normally written in the script of another language. For example, a transliterated term can be a term that has been converted from one script to another script or a phonetic representation in one script of a term in another script. Techniques for finding synonyms of query words for query expansion may not work well for finding synonyms of query terms that are transliterated terms. For example, current natural language techniques do not work well with transliterated data, and search log data typically provide poor coverage for most transliterated variations.
  • SUMMARY
  • This specification describes technologies relating to identifying candidate synonyms of transliterated terms for query expansion.
  • In general, one aspect of the subject matter described in this specification can be embodied in computer-implemented methods that include the actions of identifying, using one or more computers, multiple transliterated terms in a target language, for each transliterated term of the multiple transliterated terms in the target language, mapping the transliterated term to one or more terms in a source language, and for a first transliterated term of the multiple transliterated terms in the target language, identifying one or more second transliterated terms of the multiple transliterated terms in the target language as candidate synonyms of the first transliterated term, where each of the one or more second transliterated terms is mapped to at least one term in the source language that is also mapped from the first transliterated term. Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.
  • These and other embodiments can optionally include one or more of the following features. Identifying the multiple transliterated terms in the target language can further include identifying from web resources terms containing only characters of the target language. The aspect can further include computing a statistic for each identified term containing only characters of the target language, comparing the statistic for each identified term to a specified threshold, and including a particular identified term in the multiple transliterated terms in the target language if the statistic for the particular identified term exceeds the specified threshold.
  • The statistic for each identified term can be a ratio of a probability of occurrence of the identified term in web resources of a top-level domain associated with one or more locales where the source language is spoken to a probability of occurrence of the identified term in web resources of a top-level domain associated with any locale. The statistic for each identified term can be a ratio of a probability of occurrence of the identified term in web resources associated with one or more locales where the source language is spoken to a probability of occurrence of the identified term in web resources associated with any locale. The association of a web resource with a locale where the source language is spoken can be determined by a top-level domain of the web resource.
  • Mapping the transliterated term to one or more terms in the source language can further include transliterating the transliterated term in the target language to the one or more terms in the source language. Each of the one or more second transliterated terms identified as candidate synonyms of the first transliterated term can have a confidence value with respect to the first transliterated term that is above a specified threshold. The confidence value of a second transliterated term can be a function of a number of terms in the source language that are mapped from both the first transliterated term and the second transliterated term. Transliterating the transliterated term in the target language to a term in the source language can further include generating a transliteration score for the transliteration of the transliterated term in the target language to the term in the source language. The confidence value of a second transliterated term can be a function of one or more of a probability of occurrence of the second transliterated term in web resources, the transliteration score for the transliteration of the second transliterated term to a term in the source language that is also mapped from the first transliterated term, and the transliteration score for the transliteration of the first transliterated term to the term in the source language.
  • The aspect can further include, for the first transliterated term of the multiple transliterated terms in the target language, identifying one or more terms in the source language that are mapped from the first transliterated term and from at least one of the one or more second transliterated terms as candidate synonyms of the first transliterated term. The aspect can further include receiving a query including the first transliterated term, expanding the query with one or more of the candidate synonyms of the first transliterated term, providing the expanded query to a search engine, and receiving search results for the expanded query. The aspect can further include receiving a query including the first transliterated term, and providing one or more expanded queries for selection by a user, each expanded query including the query and one or more of the candidate synonyms of the first transliterated term.
  • The aspect can further include receiving a query including the first transliterated term, providing the query to a search engine, where the search engine identifies as a possible search result for the query a web resource that includes at least one of the candidate synonyms of the first transliterated term but does not include any term in the query, and modifying a score associated with the web resource, the score for use in ranking possible search results for the query. The aspect can further include receiving a query including the first transliterated term, providing the query to a search engine, where the search engine identifies as a possible search result for the query a web resource that includes at least one of the terms in the source language that is mapped from the first transliterated term and from at least one of the one or more second transliterated terms but does not include any term in the query, and modifying an information retrieval score associated with the web resource, the information retrieval score for use in ranking possible search results for the query.
  • Another aspect of the subject matter described in this specification can be embodied in computer-implemented methods that include the actions of generating, using one or more computers, a training group of possible transliterated synonyms in a target language, training a probabilistic model using the training group to learn probabilities of spelling variations in transliterated synonyms in the target language, and applying the probabilistic model to a particular transliterated term in the target language to identify one or more candidate synonyms of the particular transliterated term. Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.
  • Another aspect of the subject matter described in this specification can be embodied in computer-implemented methods that include the actions of identifying, using one or more computers, multiple transliterated terms in a target language, for a first transliterated term of the multiple transliterated terms in the target language, identifying one or more second transliterated terms of the multiple transliterated terms in the target language as candidate synonyms of the first transliterated term, and using the candidate synonyms of the first transliterated term to expand queries including the first transliterated term. Other embodiments of this aspect include corresponding systems, apparatus, and computer program products.
  • Particular embodiments of the subject matter described in this specification can be implemented to realize one or more of the following advantages. Transliterated terms are identified as candidate synonyms for a particular transliterated term, where the candidate synonyms can be used for expanding queries including the particular transliterated term. Transliterated synonyms in a target language can be identified for newer transliterated terms (e.g., terms transliterated from terms in a source language from current news stories or current cultural references), which may have poor coverage in user search log data. A system that can expand a user's query to include candidate transliterated synonyms for a given transliterated term may return better search results than a search system that does not have the same query expansion capability.
  • The details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the description below.
  • Other features, objects, and advantages of the subject matter will be apparent from the description, the drawings, and the claims.
  • BRIEF DESCRIPTION OF DRAWINGS
  • FIG. 1 is a block diagram of an example search system.
  • FIGS. 2A-2C illustrate an example technique for identifying candidate synonyms for a transliterated term.
  • FIG. 3 is a flow chart of an example process for identifying candidate synonyms for a transliterated term.
  • FIG. 4 is a flow chart of an example process for providing search results for an expanded query that includes a transliterated term and a candidate synonym.
  • FIG. 5 is a flow chart of an example process for identifying candidate synonyms for a transliterated term.
  • Like reference symbols and designations in the various drawings indicate like elements.
  • DETAILED DESCRIPTION
  • FIG. 1 is a block diagram of an example search system 114 that can be used to provide search results relevant to submitted queries as can be implemented in an Internet, an intranet, or another client and server environment. The search system 114 is an example of an information retrieval system in which the systems, components, and techniques described below can be implemented.
  • A user 102 can interact with the search system 114 through a client device 104. For example, the client 104 can be a computer coupled to the search system 114 through a local area network (LAN) or wide area network (WAN), e.g., the Internet. In some implementations, the search system 114 and the client device 104 can be one machine. For example, a user can install a desktop search application on the client device 104. The client device 104 will generally include a random access memory (RAM) 106 and a processor 108.
  • A user 102 can submit a query 110 to a search engine 130 within a search system 114. When the user 102 submits a query 110, the query 110 is transmitted through a network to the search system 114. The search system 114 can be implemented as, for example, computer programs running on one or more computers in one or more locations that are coupled to each other through a network. The search system 114 includes an index database 122 and a search engine 130. The search system 114 responds to the query 110 by generating search results 128, which are transmitted through the network to the client device 104 in a form that can be presented to the user 102 (e.g., as a search results web page to be displayed in a web browser running on the client device 104).
  • When the query 110 is received by the search engine 130, the search engine 130 identifies resources that match the query 110. The search engine 130 will generally include an indexing engine 120 that indexes resources (e.g., web pages, images, or news articles on the Internet), an index database 122 that stores the index information, and a ranking engine 152 (or other software) that ranks the resources that match the query 110. The search engine 130 can transmit the search results 128 through the network to the client device 104 for presentation to the user 102.
  • In some scenarios, a query includes one or more terms that are transliterated terms. Transliteration converts a term in a source language to a transliterated term in a target language. After conversion, the letters or characters of the term in the source language are represented by letters or characters of the target language. A machine learning technique for transliteration is described, for example, in U.S. patent application Ser. No. 12/043,854, titled “Machine Learning for Transliteration,” filed Mar. 6, 2008.
  • Terms transliterated from one language to another language can be used in Internet resources. For example, Indic languages like Hindi, Tamil, Telugu, Kannada, and Malayalam are sometimes transliterated to English on Internet resources (e.g., Indian blogs or electronic Indian technical textbooks). These languages, along with some non-Indic languages (e.g., Chinese and other logographic writing systems) often do not have well-developed alternate input mechanisms, such that it is cumbersome to enter characters in these languages.
  • Transliterations do not have a notion of correct spelling. As a result, there often exist multiple spellings in a target language for transliterations of a word in a source language. For a particular term in a source language having multiple transliterations in a target language, transliterated terms in the target language that vary from a given transliterated term in the target language can be treated as candidate synonyms of the given transliterated term. These candidate transliterated synonyms are different possible transliterations of the same term in the source language.
  • As an example, the Hindi word,
    Figure US20130338996A1-20131219-P00001
    , can be transliterated into English as “chakrabarti” or “chakrabarty”. Thus, the transliterated term “chakrabarty” can be identified as a candidate synonym of the given transliterated term, “chakrabarti”.
  • Candidate synonyms identified for a given transliterated term can be used to expand queries that include the given transliterated term. For example, if there is a popular new Hindi song available on several websites on the Internet, a user may find it difficult to search for the song if the websites transliterate a Hindi word in the song title to a first transliterated term while the user enters a query with a second transliterated term for the same Hindi word. A search system that can expand the user's query to include candidate transliterated synonyms for the second transliterated term may return better search results than a search system that does not have the same query expansion capability.
  • FIGS. 2A-2C illustrate an example technique for identifying candidate synonyms for a transliterated term. For convenience, the example technique will be described with reference to a system that performs the technique. The example technique can be used to expand a query including the transliterated term to include synonyms of the transliterated term in an attempt to improve the search results returned for the query. The example technique uses transliteration techniques to determine what terms in a target language (e.g., English) are transliterated from the same term in a source language (e.g., Hindi). Several techniques can be implemented to increase the precision or quality of the candidate synonyms.
  • FIG. 2A illustrates a list 210 of possible transliterated terms in English, the target language, where the source language is Hindi. A system can generate or identify the list 210 of possible transliterated terms in any number of different ways.
  • For example, the system can identify the possible transliterated terms of the list 210 from web resources as terms containing only characters of the target language (e.g., Latin characters). The identified terms containing only characters of the target language include words with meaning in the target language and possible transliterated terms without meaning in the target language.
  • To separate the possible transliterated terms from non-transliterated terms (e.g., the words with meaning), the system can compute statistics for the identified terms containing only characters of the target language and can compare the statistics to a specified threshold. That is, for each identified term, a statistic is computed and compared to a threshold, where the system includes the identified term in the list 210 of possible transliterated terms if the statistic for the identified term exceeds the specified threshold.
  • In one example where English is the target language and Hindi is the source language, transliterated terms in English may have a higher probability of occurring on an Indian web resource than on non-Indian web resources. In this example, the statistic for each identified term containing only Latin characters can be a function of the probability of occurrence on an Indian web resource.
  • In some implementations, the statistic for each identified term is a ratio of the probability of occurrence of the identified term in web resources of a top-level domain associated with one or more locales (e.g., countries or regions) where the source language is spoken to the probability of occurrence of the identified term in web resources of a top-level domain associated with any locale. For example, the statistic could be the ratio of the probability of the identified term occurring on an Indian web page to the probability of the identified term occurring on any web page. If the statistic computed for a particular identified term exceeds a specified threshold, the particular identified term can be included in the list 210 of possible transliterated terms.
  • In some other implementations, the statistic for each identified term is a ratio of the probability of occurrence of the identified term in web resources associated with one or more locales (e.g., countries or regions) where the source language is spoken to the probability of occurrence of the identified term in web resources associated with any locale. The association of a web resource with a locale where the source language is spoken can be determined by a top-level domain of the web resource. For example, the statistic could be the ratio of the probability of the identified term occurring on an Indian web domain to the probability of the identified term occurring on any web domain. If the statistic computed for a particular identified term exceeds a specified threshold, the particular identified term can be included in the list 210 of possible transliterated terms.
  • In some scenarios, a particular web page or a particular web domain may use a particular identified term an exceptionally large number of times, which could skew the statistic for the particular identified term. In some implementations, the system caps the statistic for each identified term or a component of the statistic for each identified term at a specified limit to prevent skewing of the statistic. For example, the system can cap the per-page contribution of the identified term on Indian web pages or the per-domain contribution of the identified term on Indian domains.
  • In some implementations, the statistic for each identified term is a ratio of the probability of the identified term being included in a query submitted to a search engine having an interface in the source language to the probability of the identified term being included in a query submitted to a search engine having an interface in any language. For the example, the system can compute the statistic using Indian and non-Indian search logs.
  • In some implementations, to separate the possible transliterated terms from the non-transliterated terms (e.g., words with meaning in the target language), the system computes multiple statistics for each identified term containing only characters of the target language and compares the multiple statistics to respective thresholds. If the multiple statistics for a particular identified term each exceed a respective threshold, the system can include the particular identified term in the list 210 of possible transliterated terms.
  • The possible transliterated terms of the list 210 can alternatively be identified by crawling only known web resources associated with the source language. For the example where the source language is Hindi, the system can identify the possible transliterated terms by crawling known Indian websites, for example, Indian blog sites or websites that translate Hindi songs or Hindi technical textbooks.
  • FIG. 2B illustrates relations 215 between each possible transliterated term of the list 210 and one or more terms 220 in the source language, Hindi. Each relation 215 is the result of mapping an element of a first group (i.e., the possible transliterated terms in the target language) to one or more elements of a second group (i.e., the terms 220 in the source language). That is, mapping forms a one-way relation between a possible transliterated term in the target language and one or more terms 220 in the source language. In the example technique of FIG. 2B, the relations 215 are the result of mapping by transliteration performed, for example, by an English-to-Hindi machine transliterator, implemented as an element of a system.
  • In some implementations, mapping includes generating a transliteration score 225 for each transliteration from a possible transliterated term in the target language to a term 220 in the source language. For example, FIG. 2B illustrates the transliteration score 225 for each transliteration, including the score from “sreeram” to H2 (e.g., scoreE1 to H2), the score from “shriram” to H2 (e.g., scoreE3 to H2), and the score from “shreeram” to H6 (e.g., scoreE4 to H6).
  • If transliteration scores 225 are generated by mapping, the transliteration score 225 of a given possible transliterated term of the list 210 can be a component of a confidence value of the given possible transliterated term with respect to another possible transliterated term. The system can use these confidence values in identifying the possible transliterated terms that should be considered as candidate synonyms for a particular transliterated term. The transliteration scores 225 and the confidence values are described in more detail with respect to FIG. 2C.
  • FIG. 2C illustrates identifying, for a first possible transliterated term 230, one or more second possible transliterated terms 240 as candidate synonyms of the first possible transliterated term 230.
  • If the transliterator maps a term 220 in the source language from two or more possible transliterated terms in the target language, this suggests a synonym relationship between the two or more possible transliterated terms in the target language. For example, H2 is a Hindi word in the source language that is mapped by the transliterator from three possible transliterated terms: “sreeram”, “shriram”, and “shreeram”, suggesting that the three transliterated terms are synonyms.
  • In the example technique of FIG. 2C, the system identifies the second possible transliterated terms 240 as candidate synonyms of the first possible transliterated term 230 by identifying the possible transliterated terms of the list 210 that are mapped to at least one term 220 in the source language that is also mapped from the first possible transliterated term 230. Intersections of the terms 220 in the source language give candidate groups for transliterated synonyms. Several techniques can be implemented to increase the reliability of the candidate groups for transliterated synonyms. In some implementations, each of the possible transliterated terms of the list 210 other than the first possible transliterated term 230 has a confidence value with respect to the first possible transliterated term 230. In these implementations, if a particular possible transliterated term has a confidence value with respect to the first possible transliterated term 230 that is above a specified threshold, the particular possible transliterated term is a second possible transliterated term 240 identified as a candidate synonym of the first possible transliterated term 230. If mapping does not produce a transliteration score 225 for each transliteration, the confidence value for a given second possible transliterated term 240 can be a function of the number of terms 220 in the source language that are mapped from both the first possible transliterated term 230 and the given second possible transliterated term 240.
  • For example, “shriram” and “sriraam” each map to only one term 220 (i.e., H2 and H6, respectively) that is also mapped from “sreeram”, the first possible transliterated term 230. The transliterated term “shreeram” maps to two terms 220 (i.e., H2 and H6) that are also mapped from “sreeram”, the first possible transliterated term 230. The overlap with “sreeram” of mapped terms 220 in the source language is greater for “shreeram” than for “shriram” and “sriraam”, suggesting that “shreeram” might be a more reliable candidate synonym for “sreeram” than either “shriram” or “sriraam”. This increased reliability can be reflected in a higher confidence value for “shreeram” with respect to “sreeram”.
  • If mapping produces a transliteration score 225 for each transliteration, the confidence value for a given second possible transliterated term 240 can be a function of the transliteration scores 225 of the first possible transliterated term 230 and of the given second possible transliterated term 240. For example, the confidence value for “shriram”, a second possible transliterated term 240, with respect to “sreeram”, the first possible transliterated term 230, where both transliterated terms map to H2, can be a function of the transliteration scores 225 scoreE1 to H2 and scoreE3 to H2.
  • In some implementations, the confidence value for a given second possible transliterated term 240 is a function of a probability of occurrence of the given second possible transliterated term 240 in web resources. For example, the probability of occurrence can be the per-page contribution in web resources or the per-domain contribution in web resources of the given second possible transliterated term 240. Generally, a higher probability of occurrence suggests that the given second possible transliterated term 240 is a more common form of the transliteration from the term in the source language. A higher probability suggests higher confidence in the common transliterated term, which can be reflected in a higher confidence value for the transliterated term.
  • In some implementations, the confidence value for a given second possible transliterated term 240 is a function of multiple components, e.g., the transliteration scores 225 and a probability of occurrence. Although FIG. 2C includes as second possible transliterated terms 240 all possible transliterated terms that map to a term 220 in the source language that are also mapped from the first possible transliterated term 230, implementation of any of the above techniques for increasing the reliability of candidate groups can reduce the group of candidate synonyms to a subgroup of the second possible transliterated terms 240 illustrated in FIG. 2C.
  • In some implementations, the system identifies one or more of the terms 220 in the source language that are mapped from the first possible transliterated term 230 and from at least one of the second possible transliterated terms 240 as candidate synonyms of the first possible transliterated term 230 in addition to or instead of the second possible transliterated terms 240. For example, for the first possible transliterated term 230, “sreeram”, the system can identify the terms H2 and H6 as candidate synonyms of “sreeram”. In some implementations, the system identifies the terms 220 in the source language that are mapped from the same transliterated term in the target language as a candidate synonym group. For the example of FIG. 2C, the system can identify the terms H2 and H6, mapped from the same transliterated terms “sreeram” and “shreeram”, as a candidate synonym group.
  • The system can use the candidate transliterated synonyms (i.e., the second possible transliterated terms 240) for query expansion. For example, when a search system (e.g., the search system 114 of FIG. 1) receives a query including the first possible transliterated term 230, the search system can identify one or more candidate transliterated synonyms of the first possible transliterated term 230. The query can be expanded with one or more of the identified candidate transliterated synonyms of the first possible transliterated term 230. In the example of FIG. 2C, the system can expand a query including “sreeram” to include one or more of “shriram”, “shreeram’, and “sriraam”. In some implementations, the system ranks the candidate synonyms by confidence value, and the system selects only N candidate synonyms with the N highest confidence values for including in expanded queries. The system provides the expanded query to a search engine (e.g., the search engine 130 of FIG. 1), and receives search results for the expanded query.
  • In some implementations, if the system selects a possible transliterated term as a candidate transliterated synonym for a given transliterated term, the system also selects the given transliterated term as a candidate transliterated synonym for the possible transliterated term. In other implementations, if the system selects a possible transliterated term as a candidate transliterated synonym for a given transliterated term, the system does not select the given transliterated term as a candidate transliterated synonym for the possible transliterated term. That is, there may or may not be reverse mapping of transliterated synonyms. For example, if a first transliterated term “a” is rarely used and a second transliterated term “b” is often used, query expansion of “a” with “b” generally makes sense, because the expansion will result in more search results returned. However, automatically expanding queries of “b” with “a” may not make sense, because the expansion may return irrelevant search results.
  • In some implementations, instead of expanding a query with one or more candidate transliterated synonyms, mapping candidate transliterated synonyms to a given transliterated term occurs on the document side of a query search. For the above example, if a user submits a query including the transliterated term “b” but not the transliterated term “a” and if a web document contains “a” but not “b,” the search system (e.g., the search system 114 of FIG. 1) can treat the web document as if the web document also contains “b,” so that the web document is a candidate search result for the search including “b.” However, since the web document does not actually include “b,” the search system can reduce a score associated with the web document (e.g., an information retrieval score for ranking the web document as a candidate search result), which, consequently, can reduce the chance of the web document being returned for the search.
  • In some implementations, document-level mapping of candidate synonyms includes one or more terms 220 in the source language. For the example of FIG. 2C, the search system can treat a web document containing “sreeram” as if the web document also contains the Hindi word H2 or H6. The search system can also reduce a score associated with the web document accordingly.
  • FIG. 3 is a flow chart of an example process 300 for identifying candidate synonyms for a transliterated term. For convenience, the example process 300 will be described with reference to the example technique of FIGS. 2A-2C and a system that performs the process 300.
  • The system identifies multiple transliterated terms in a target language (step 310). For example, the system identifies the possible transliterated terms of the list 210 in FIG. 2A.
  • For each transliterated term of the multiple transliterated terms in the target language, the system maps the transliterated term to one or more terms in a source language (step 320). FIG. 2B illustrates an example of mapping using an English-to-Hindi transliterator.
  • For a first transliterated term of the multiple transliterated terms in the target language, the system identifies one or more second transliterated terms of the multiple transliterated terms as candidate synonyms of the first transliterated term (step 330). Each of the one or more second transliterated terms is mapped to at least one term in the source language that is also mapped from the first transliterated term. For example, FIG. 2C illustrates second possible transliterated terms 240 (i.e., “shriram”, “shreeram”, and “sriraam”) identified as candidate synonyms of a first possible transliterated term 230 (i.e., “sreeram”). The candidate synonyms can be used for query expansion, for example, as described with respect to FIG. 4.
  • FIG. 4 is a flow chart of an example process 400 for providing search results for an expanded query that includes a transliterated term and a candidate synonym. For convenience, the example process 400 will be described with reference to the example technique of FIG. 2A-2C and a system that performs the process 400.
  • The system receives a query including a first transliterated term (step 410). For example, the query can include the transliterated term “sreeram” illustrated in FIG. 2C.
  • The system provides one or more expanded queries for selection by a user, where each expanded query includes the query and one or more candidate synonyms of the first transliterated term (step 420). The candidate synonyms can be identified, for example, using the example process 300 of FIG. 3. For a query including the transliterated term “sreeram”, the system can provide expanded queries that also include one or more of “shriram”, “shreeram”, and “sriraam”, as illustrated in FIG. 2C.
  • The system receives a selection of an expanded query from the user (step 430). For example, the expanded queries can be presented to the user as selectable hyperlinks on an interface of a web browser running on a client device (e.g., the client device 104 of FIG. 1). The system can receive the selection of an expanded query as a selection by the user of the hyperlink for the selected expanded query. In some implementations, the system generates an expanded query with one or more of the candidate synonyms and proceeds to step 440 without performing steps 420 and 430. The system provides the expanded query to a search engine (step 440). For example, the system can submit the expanded query to the search engine 130 of FIG. 1. The search engine performs the search, sending search results for the expanded query to the system. The system receives the search results for the expanded query (step 450).
  • In some implementations, the system provides the received query of step 410 to the search engine without expanding the query. Instead, the system performs document-level mapping as described above with respect to FIG. 2C. For example, the search engine can identify as a possible search result for the query a web resource that includes at least one of the candidate synonyms of the first transliterated term but does not include any term (e.g., the first transliterated term) in the query. Alternatively, the search engine can identify as a possible search result for the query a web resource that does not include any term (e.g., the first transliterated term) in the query but that does include at least one of the terms in a source language that is mapped from the first transliterated term and from at least one of the candidate synonyms. When document-level mapping is implemented, the system can modify (e.g., reduce) a score for use in ranking that is associated with the web resource identified as a possible search result.
  • FIG. 5 is a flow chart of an example process 500 for identifying candidate synonyms for a transliterated term. For convenience, the example process 500 will be described with reference to a system that performs the process 500. In general, the process 500 directly learns possible variations in spelling for transliterated terms in a target language. Since transliterated synonyms are generally phonetically similar, the variations between the transliterated synonyms are language specific.
  • The system generates a training group of possible transliterated synonyms in a target language (step 510). The system trains a probabilistic model using the training group to learn probabilities of spelling variations in transliterated synonyms in the target language (step 520). The system applies the probabilistic model to a particular transliterated term in the target language to identify one or more candidate synonyms of the particular transliterated term (step 530). The system can use the candidate synonyms for query expansion as described above.
  • Embodiments of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, or in computer software, firmware, or hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a tangible program carrier for execution by, or to control the operation of, data processing apparatus. The tangible program carrier can be a propagated signal or a computer-readable medium. The propagated signal is an artificially generated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a computer. The computer-readable medium can be a machine-readable storage device, a machine-readable storage substrate, a memory device, a composition of matter effecting a machine-readable propagated signal, or a combination of one or more of them.
  • The term “data processing apparatus” encompasses all apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, or multiple processors or computers. The apparatus can include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them.
  • A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.
  • The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
  • Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for performing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, a device with spoken language input, to name just a few. A smart phone is an example of a device with spoken language input, which can accept voice input (e.g., a user query spoken into a microphone on the device).
  • Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
  • To provide for interaction with a user, embodiments of the subject matter described in this specification can be implemented on a computer having a display device, e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse or a trackball, by which the user can provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.
  • Embodiments of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described is this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, e.g., a communication network. Examples of communication networks include a local area network (“LAN”) and a wide area network (“WAN”), e.g., the Internet.
  • The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
  • While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
  • Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
  • Particular embodiments of the subject matter described in this specification have been described. Other embodiments are within the scope of the following claims. For example, the actions recited in the claims can be performed in a different order and still achieve desirable results. As one example, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.

Claims (22)

What is claimed is:
1. A computer-implemented method comprising:
identifying, using one or more computers, a plurality of transliterated terms in a target language, each of the transliterated terms representing a conversion in the target language of one or more respective source terms in a source language;
for each transliterated term of the plurality of transliterated terms in the target language, mapping the transliterated term to one or more of the source terms in the source language;
determining that a first transliterated term of the transliterated terms is mapped to one or more terms of the source terms in the source language and that one or more second transliterated terms of the transliterated terms are also mapped to the one or more terms; and
identifying, based on the one or more terms being mapped to the first transliterated term and being mapped to the one or more second transliterated terms, the one or more second transliterated terms as candidate synonyms of the first transliterated term.
2. The method of claim 1, where mapping the transliterated term to one or more source terms in the source language further comprises:
transliterating the transliterated term in the target language to the one or more source terms in the source language.
3. The method of claim 2, where each of the one or more second transliterated terms identified as candidate synonyms of the first transliterated term has a confidence value with respect to the first transliterated term that satisfies a specified threshold.
4. The method of claim 3, where the confidence value of a second transliterated term of the one or more second transliterated terms is a function of a number of the one or more terms in the source language that are mapped from both the first transliterated term and the second transliterated term.
5. The method of claim 3, where transliterating the transliterated term in the target language to the one or more source terms in the source language further comprises:
generating a transliteration score for each transliteration of the transliterated term in the target language to a respective source term of the one or more source terms.
6. The method of claim 5, where the confidence value of a second transliterated term of the one or more second transliterated terms is a function of one or more of a probability of occurrence of the second transliterated term in web resources, the transliteration scores for the transliteration of the second transliterated term to the one or more same terms that are also mapped from the first transliterated term, and the transliteration scores for the transliteration of the first transliterated term to the one or more same terms.
7. The method of claim 1, where identifying the plurality of transliterated terms in the target language further comprises:
identifying terms containing only characters of the target language.
8. The method of claim 7, further comprising:
computing a statistic for each identified term containing only characters of the target language;
comparing the statistic for each said identified term to a specified threshold; and
including a particular said identified term in the plurality of transliterated terms in the target language if the statistic for the particular identified term satisfies the specified threshold.
9. The method of claim 1, further comprising:
for the first transliterated term of the plurality of transliterated terms in the target language, identifying the one or more same terms in the source language that are mapped from the first transliterated term and from the one or more second transliterated terms as a candidate synonym of the first transliterated term.
10. The method of claim 1, further comprising:
receiving a query including the first transliterated term;
expanding the query with one or more of the candidate synonyms of the first transliterated term;
providing the expanded query to a search engine; and
receiving search results for the expanded query.
11. The method of claim 1, further comprising:
receiving a query including the first transliterated term; and
providing one or more expanded queries for selection by a user, each expanded query including the query and one or more of the candidate synonyms of the first transliterated term.
12. The method of claim 1, further comprising:
receiving a query including the first transliterated term;
providing the query to a search engine, where the search engine identifies as a possible search result for the query a web resource that includes at least one of the candidate synonyms of the first transliterated term but does not include any term in the query; and
modifying a score associated with the web resource, the score for use in ranking possible search results for the query.
13. The method of claim 1, further comprising:
receiving a query including the first transliterated term;
providing the query to a search engine, where the search engine identifies as a possible search result for the query a web resource that includes at least one of the terms in the source language that is mapped from the first transliterated term and from the one or more second transliterated terms but does not include any term in the query; and
modifying an information retrieval score associated with the web resource, the information retrieval score for use in ranking possible search results for the query.
14. A system comprising:
one or more computers configured to perform operations including:
identifying, using one or more computers, a plurality of transliterated terms in a target language, each of the transliterated terms representing a conversion in the target language of one or more respective source terms in a source language;
for each transliterated term of the plurality of transliterated terms in the target language, mapping the transliterated term to one or more of the source terms in the source language;
determining that a first transliterated term of the transliterated terms is mapped to one or more terms of the source terms in the source language and that one or more second transliterated terms of the transliterated terms are also mapped to the one or more terms; and
identifying, based on the one or more terms being mapped to the first transliterated term and being mapped to the one or more second transliterated terms, the one or more second transliterated terms as candidate synonyms of the first transliterated term.
15. The system of claim 14, where mapping the transliterated term to one or more source terms in the source language further comprises:
transliterating the transliterated term in the target language to the one or more source terms in the source language.
16. The system of claim 15, where each of the one or more second transliterated terms identified as candidate synonyms of the first transliterated term has a confidence value with respect to the first transliterated term that satisfies a specified threshold.
17. The system of claim 16, where the confidence value of a second transliterated term of the one or more second transliterated terms is a function of a number of the one or more terms in the source language that are mapped from both the first transliterated term and the second transliterated term.
18. The system of claim 16, where transliterating the transliterated term in the target language to the one or more source terms in the source language further comprises:
generating a transliteration score for each transliteration of the transliterated term in the target language to a respective source term of the one or more source terms.
19. The system of claim 18, where the confidence value of a second transliterated term of the one or more second transliterated terms is a function of one or more of a probability of occurrence of the second transliterated term in web resources, the transliteration scores for the transliteration of the second transliterated term to the one or more same terms that are also mapped from the first transliterated term, and the transliteration scores for the transliteration of the first transliterated term to the one or more same terms.
20. The system of claim 14, where identifying the plurality of transliterated terms in the target language further comprises:
identifying terms containing only characters of the target language.
21. The system of claim 20, where the one or more computers are further configured to perform operations including:
computing a statistic for each identified term containing only characters of the target language;
comparing the statistic for each said identified term to a specified threshold; and
including a particular said identified term in the plurality of transliterated terms in the target language if the statistic for the particular identified term satisfies the specified threshold.
22. The system of claim 14, where the one or more computers are further configured to perform operations including:
for the first transliterated term of the plurality of transliterated terms in the target language, identifying the one or more same terms in the source language that are mapped from the first transliterated term and from the one or more second transliterated terms as a candidate synonym of the first transliterated term.
US14/010,204 2008-07-18 2013-08-26 Transliteration For Query Expansion Abandoned US20130338996A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/010,204 US20130338996A1 (en) 2008-07-18 2013-08-26 Transliteration For Query Expansion

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US8216508P 2008-07-18 2008-07-18
US12/503,806 US8521761B2 (en) 2008-07-18 2009-07-15 Transliteration for query expansion
US14/010,204 US20130338996A1 (en) 2008-07-18 2013-08-26 Transliteration For Query Expansion

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US12/503,806 Continuation US8521761B2 (en) 2008-07-18 2009-07-15 Transliteration for query expansion

Publications (1)

Publication Number Publication Date
US20130338996A1 true US20130338996A1 (en) 2013-12-19

Family

ID=41531175

Family Applications (2)

Application Number Title Priority Date Filing Date
US12/503,806 Active 2032-04-26 US8521761B2 (en) 2008-07-18 2009-07-15 Transliteration for query expansion
US14/010,204 Abandoned US20130338996A1 (en) 2008-07-18 2013-08-26 Transliteration For Query Expansion

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US12/503,806 Active 2032-04-26 US8521761B2 (en) 2008-07-18 2009-07-15 Transliteration for query expansion

Country Status (3)

Country Link
US (2) US8521761B2 (en)
KR (1) KR20100009520A (en)
CN (2) CN104111972B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140012563A1 (en) * 2012-07-06 2014-01-09 International Business Machines Corporation Providing multi-lingual searching of mono-lingual content
KR20200100360A (en) * 2019-02-18 2020-08-26 네이버 주식회사 Method and system for extracting foreign synonym using transliteration model
US11263208B2 (en) 2019-03-05 2022-03-01 International Business Machines Corporation Context-sensitive cross-lingual searches

Families Citing this family (152)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8677377B2 (en) 2005-09-08 2014-03-18 Apple Inc. Method and apparatus for building an intelligent automated assistant
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US10002189B2 (en) 2007-12-20 2018-06-19 Apple Inc. Method and apparatus for searching using an active ontology
US9330720B2 (en) 2008-01-03 2016-05-03 Apple Inc. Methods and apparatus for altering audio output signals
US8996376B2 (en) 2008-04-05 2015-03-31 Apple Inc. Intelligent text-to-speech conversion
US20100030549A1 (en) 2008-07-31 2010-02-04 Lee Michael M Mobile device having human language translation capability with positional feedback
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10241644B2 (en) 2011-06-03 2019-03-26 Apple Inc. Actionable reminder entries
US10241752B2 (en) 2011-09-30 2019-03-26 Apple Inc. Interface for a virtual digital assistant
US20120311585A1 (en) 2011-06-03 2012-12-06 Apple Inc. Organizing task items that represent tasks to perform
US9431006B2 (en) 2009-07-02 2016-08-30 Apple Inc. Methods and apparatuses for automatic speech recognition
JP5558772B2 (en) * 2009-10-08 2014-07-23 東レエンジニアリング株式会社 STAMPER FOR MICRO NEEDLE SHEET, PROCESS FOR PRODUCING THE SAME, AND METHOD FOR MANUFACTURING MICRO NEEDLE USING THE SAME
US10276170B2 (en) 2010-01-18 2019-04-30 Apple Inc. Intelligent automated assistant
US8682667B2 (en) 2010-02-25 2014-03-25 Apple Inc. User profiling for selecting user specific voice input processing information
JP5090547B2 (en) * 2011-03-04 2012-12-05 楽天株式会社 Transliteration processing device, transliteration processing program, computer-readable recording medium recording transliteration processing program, and transliteration processing method
US9262612B2 (en) 2011-03-21 2016-02-16 Apple Inc. Device access using voice authentication
CA2830338C (en) 2011-04-01 2016-11-15 Wyeth Llc Antibody-drug conjugates
US20120278302A1 (en) * 2011-04-29 2012-11-01 Microsoft Corporation Multilingual search for transliterated content
US10057736B2 (en) 2011-06-03 2018-08-21 Apple Inc. Active transport based notifications
US9613029B2 (en) * 2012-02-28 2017-04-04 Google Inc. Techniques for transliterating input text from a first character set to a second character set
US10134385B2 (en) 2012-03-02 2018-11-20 Apple Inc. Systems and methods for name pronunciation
US9280610B2 (en) 2012-05-14 2016-03-08 Apple Inc. Crowd sourcing information to fulfill user requests
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
US9721563B2 (en) 2012-06-08 2017-08-01 Apple Inc. Name recognition system
US9547647B2 (en) 2012-09-19 2017-01-17 Apple Inc. Voice-based media searching
US9411803B2 (en) * 2012-09-28 2016-08-09 Hewlett Packard Enterprise Development Lp Responding to natural language queries
CN103810993B (en) * 2012-11-14 2020-07-10 北京百度网讯科技有限公司 Text phonetic notation method and device
US10199051B2 (en) 2013-02-07 2019-02-05 Apple Inc. Voice trigger for a digital assistant
US10652394B2 (en) 2013-03-14 2020-05-12 Apple Inc. System and method for processing voicemail
US10748529B1 (en) 2013-03-15 2020-08-18 Apple Inc. Voice activated device for use with a voice-based digital assistant
US9058805B2 (en) * 2013-05-13 2015-06-16 Google Inc. Multiple recognizer speech recognition
WO2014197334A2 (en) 2013-06-07 2014-12-11 Apple Inc. System and method for user-specified pronunciation of words for speech synthesis and recognition
WO2014197335A1 (en) 2013-06-08 2014-12-11 Apple Inc. Interpreting and acting upon commands that involve sharing information with remote devices
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
CN110442699A (en) 2013-06-09 2019-11-12 苹果公司 Operate method, computer-readable medium, electronic equipment and the system of digital assistants
US10296160B2 (en) 2013-12-06 2019-05-21 Apple Inc. Method for extracting salient dialog usage from live data
US9633004B2 (en) 2014-05-30 2017-04-25 Apple Inc. Better resolution when referencing to concepts
EP3480811A1 (en) 2014-05-30 2019-05-08 Apple Inc. Multi-command single utterance input method
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US9430463B2 (en) 2014-05-30 2016-08-30 Apple Inc. Exemplar-based natural language processing
US10186282B2 (en) * 2014-06-19 2019-01-22 Apple Inc. Robust end-pointing of speech signals using speaker recognition
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US9818400B2 (en) 2014-09-11 2017-11-14 Apple Inc. Method and apparatus for discovering trending terms in speech requests
US10074360B2 (en) 2014-09-30 2018-09-11 Apple Inc. Providing an indication of the suitability of speech recognition
US10127911B2 (en) 2014-09-30 2018-11-13 Apple Inc. Speaker identification and unsupervised speaker adaptation techniques
US9668121B2 (en) 2014-09-30 2017-05-30 Apple Inc. Social reminders
CN105786802B (en) * 2014-12-26 2019-04-12 广州爱九游信息技术有限公司 A kind of transliteration method and device of foreign language
US10152299B2 (en) 2015-03-06 2018-12-11 Apple Inc. Reducing response latency of intelligent automated assistants
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US10567477B2 (en) 2015-03-08 2020-02-18 Apple Inc. Virtual assistant continuity
US9721566B2 (en) 2015-03-08 2017-08-01 Apple Inc. Competing devices responding to voice triggers
US10460227B2 (en) 2015-05-15 2019-10-29 Apple Inc. Virtual assistant in a communication session
US10083688B2 (en) 2015-05-27 2018-09-25 Apple Inc. Device voice control for selecting a displayed affordance
US9578173B2 (en) 2015-06-05 2017-02-21 Apple Inc. Virtual assistant aided communication with 3rd party service in a communication session
US11025565B2 (en) 2015-06-07 2021-06-01 Apple Inc. Personalized prediction of responses for instant messaging
US20160378747A1 (en) 2015-06-29 2016-12-29 Apple Inc. Virtual assistant for media playback
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US10366158B2 (en) 2015-09-29 2019-07-30 Apple Inc. Efficient word encoding for recurrent neural network language models
US11010550B2 (en) 2015-09-29 2021-05-18 Apple Inc. Unified language modeling framework for word prediction, auto-completion and auto-correction
US11587559B2 (en) 2015-09-30 2023-02-21 Apple Inc. Intelligent device identification
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10049668B2 (en) 2015-12-02 2018-08-14 Apple Inc. Applying neural network language models to weighted finite state transducers for automatic speech recognition
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US10446143B2 (en) 2016-03-14 2019-10-15 Apple Inc. Identification of voice inputs providing credentials
US9934775B2 (en) 2016-05-26 2018-04-03 Apple Inc. Unit-selection text-to-speech synthesis based on predicted concatenation parameters
US9972304B2 (en) 2016-06-03 2018-05-15 Apple Inc. Privacy preserving distributed evaluation framework for embedded personalized systems
US11227589B2 (en) 2016-06-06 2022-01-18 Apple Inc. Intelligent list reading
US10249300B2 (en) 2016-06-06 2019-04-02 Apple Inc. Intelligent list reading
US10049663B2 (en) 2016-06-08 2018-08-14 Apple, Inc. Intelligent automated assistant for media exploration
DK179309B1 (en) 2016-06-09 2018-04-23 Apple Inc Intelligent automated assistant in a home environment
US10509862B2 (en) 2016-06-10 2019-12-17 Apple Inc. Dynamic phrase expansion of language input
US10586535B2 (en) 2016-06-10 2020-03-10 Apple Inc. Intelligent digital assistant in a multi-tasking environment
US10067938B2 (en) 2016-06-10 2018-09-04 Apple Inc. Multilingual word prediction
US10490187B2 (en) 2016-06-10 2019-11-26 Apple Inc. Digital assistant providing automated status report
US10192552B2 (en) 2016-06-10 2019-01-29 Apple Inc. Digital assistant providing whispered speech
DK179415B1 (en) 2016-06-11 2018-06-14 Apple Inc Intelligent device arbitration and control
DK179049B1 (en) 2016-06-11 2017-09-18 Apple Inc Data driven natural language event detection and classification
DK201670540A1 (en) 2016-06-11 2018-01-08 Apple Inc Application integration with a digital assistant
DK179343B1 (en) 2016-06-11 2018-05-14 Apple Inc Intelligent task discovery
US10235432B1 (en) * 2016-07-07 2019-03-19 Google Llc Document retrieval using multiple sort orders
US10474753B2 (en) 2016-09-07 2019-11-12 Apple Inc. Language identification using recurrent neural networks
US10043516B2 (en) 2016-09-23 2018-08-07 Apple Inc. Intelligent automated assistant
US11550751B2 (en) * 2016-11-18 2023-01-10 Microsoft Technology Licensing, Llc Sequence expander for data entry/information retrieval
US11281993B2 (en) 2016-12-05 2022-03-22 Apple Inc. Model and ensemble compression for metric learning
US10593346B2 (en) 2016-12-22 2020-03-17 Apple Inc. Rank-reduced token representation for automatic speech recognition
US11204787B2 (en) 2017-01-09 2021-12-21 Apple Inc. Application integration with a digital assistant
DK201770383A1 (en) 2017-05-09 2018-12-14 Apple Inc. User interface for correcting recognition errors
US10417266B2 (en) 2017-05-09 2019-09-17 Apple Inc. Context-aware ranking of intelligent response suggestions
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
US10395654B2 (en) 2017-05-11 2019-08-27 Apple Inc. Text normalization based on a data-driven learning network
DK201770439A1 (en) 2017-05-11 2018-12-13 Apple Inc. Offline personal assistant
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
US11301477B2 (en) 2017-05-12 2022-04-12 Apple Inc. Feedback analysis of a digital assistant
DK179745B1 (en) 2017-05-12 2019-05-01 Apple Inc. SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT
DK201770428A1 (en) 2017-05-12 2019-02-18 Apple Inc. Low-latency intelligent automated assistant
DK201770431A1 (en) 2017-05-15 2018-12-20 Apple Inc. Optimizing dialogue policy decisions for digital assistants using implicit feedback
DK201770432A1 (en) 2017-05-15 2018-12-21 Apple Inc. Hierarchical belief states for digital assistants
US10311144B2 (en) 2017-05-16 2019-06-04 Apple Inc. Emoji word sense disambiguation
US20180336275A1 (en) 2017-05-16 2018-11-22 Apple Inc. Intelligent automated assistant for media exploration
DK179549B1 (en) 2017-05-16 2019-02-12 Apple Inc. Far-field extension for digital assistant services
US10403278B2 (en) 2017-05-16 2019-09-03 Apple Inc. Methods and systems for phonetic matching in digital assistant services
US20180336892A1 (en) 2017-05-16 2018-11-22 Apple Inc. Detecting a trigger of a digital assistant
US10657328B2 (en) 2017-06-02 2020-05-19 Apple Inc. Multi-task recurrent neural network architecture for efficient morphology handling in neural language modeling
CN109213777A (en) * 2017-06-29 2019-01-15 杭州九阳小家电有限公司 A kind of voice-based recipe processing method and system
US10713269B2 (en) 2017-07-29 2020-07-14 Splunk Inc. Determining a presentation format for search results based on a presentation recommendation machine learning model
US11170016B2 (en) * 2017-07-29 2021-11-09 Splunk Inc. Navigating hierarchical components based on an expansion recommendation machine learning model
US11120344B2 (en) 2017-07-29 2021-09-14 Splunk Inc. Suggesting follow-up queries based on a follow-up recommendation machine learning model
US10565196B2 (en) 2017-07-29 2020-02-18 Splunk Inc. Determining a user-specific approach for disambiguation based on an interaction recommendation machine learning model
US10885026B2 (en) 2017-07-29 2021-01-05 Splunk Inc. Translating a natural language request to a domain-specific language request using templates
US10445429B2 (en) 2017-09-21 2019-10-15 Apple Inc. Natural language understanding using vocabularies with compressed serialized tries
US10755051B2 (en) 2017-09-29 2020-08-25 Apple Inc. Rule-based natural language processing
US11036938B2 (en) * 2017-10-20 2021-06-15 ConceptDrop Inc. Machine learning system for optimizing projects
US10636424B2 (en) 2017-11-30 2020-04-28 Apple Inc. Multi-turn canned dialog
US10733982B2 (en) 2018-01-08 2020-08-04 Apple Inc. Multi-directional dialog
US10733375B2 (en) 2018-01-31 2020-08-04 Apple Inc. Knowledge-based framework for improving natural language understanding
US10789959B2 (en) 2018-03-02 2020-09-29 Apple Inc. Training speaker recognition models for digital assistants
US10592604B2 (en) 2018-03-12 2020-03-17 Apple Inc. Inverse text normalization for automatic speech recognition
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
US10909331B2 (en) 2018-03-30 2021-02-02 Apple Inc. Implicit identification of translation payload with neural machine translation
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US10984780B2 (en) 2018-05-21 2021-04-20 Apple Inc. Global semantic word embeddings using bi-directional recurrent neural networks
DK180639B1 (en) 2018-06-01 2021-11-04 Apple Inc DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT
US11386266B2 (en) 2018-06-01 2022-07-12 Apple Inc. Text correction
DK201870355A1 (en) 2018-06-01 2019-12-16 Apple Inc. Virtual assistant operation in multi-device environments
DK179822B1 (en) 2018-06-01 2019-07-12 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
US11076039B2 (en) 2018-06-03 2021-07-27 Apple Inc. Accelerated task performance
US11010561B2 (en) 2018-09-27 2021-05-18 Apple Inc. Sentiment prediction from textual data
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US10839159B2 (en) 2018-09-28 2020-11-17 Apple Inc. Named entity normalization in a spoken dialog system
US11170166B2 (en) 2018-09-28 2021-11-09 Apple Inc. Neural typographical error modeling via generative adversarial networks
US11475898B2 (en) 2018-10-26 2022-10-18 Apple Inc. Low-latency multi-speaker speech recognition
US11638059B2 (en) 2019-01-04 2023-04-25 Apple Inc. Content playback on multiple devices
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
DK201970509A1 (en) 2019-05-06 2021-01-15 Apple Inc Spoken notifications
US11475884B2 (en) 2019-05-06 2022-10-18 Apple Inc. Reducing digital assistant latency when a language is incorrectly determined
US11423908B2 (en) 2019-05-06 2022-08-23 Apple Inc. Interpreting spoken requests
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
US11496600B2 (en) 2019-05-31 2022-11-08 Apple Inc. Remote execution of machine-learned models
DK201970510A1 (en) 2019-05-31 2021-02-11 Apple Inc Voice identification in digital assistant systems
DK180129B1 (en) 2019-05-31 2020-06-02 Apple Inc. User activity shortcut suggestions
US11289073B2 (en) 2019-05-31 2022-03-29 Apple Inc. Device text to speech
US11360641B2 (en) 2019-06-01 2022-06-14 Apple Inc. Increasing the relevance of new available information
WO2021056255A1 (en) 2019-09-25 2021-04-01 Apple Inc. Text detection using global geometry estimators
US11443122B2 (en) * 2020-03-03 2022-09-13 Dell Products L.P. Image analysis-based adaptation techniques for localization of content presentation
US11455456B2 (en) 2020-03-03 2022-09-27 Dell Products L.P. Content design structure adaptation techniques for localization of content presentation
US11494567B2 (en) * 2020-03-03 2022-11-08 Dell Products L.P. Content adaptation techniques for localization of content presentation
CN117672190A (en) * 2022-09-07 2024-03-08 华为技术有限公司 Transliteration method and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060112091A1 (en) * 2004-11-24 2006-05-25 Harbinger Associates, Llc Method and system for obtaining collection of variants of search query subjects
US20070011154A1 (en) * 2005-04-11 2007-01-11 Textdigger, Inc. System and method for searching for a query
US20070288448A1 (en) * 2006-04-19 2007-12-13 Datta Ruchira S Augmenting queries with synonyms from synonyms map

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0877173A (en) 1994-09-01 1996-03-22 Fujitsu Ltd System and method for correcting character string
US5787452A (en) 1996-05-21 1998-07-28 Sybase, Inc. Client/server database system with methods for multi-threaded data processing in a heterogeneous language environment
US7610189B2 (en) 2001-10-18 2009-10-27 Nuance Communications, Inc. Method and apparatus for efficient segmentation of compound words using probabilistic breakpoint traversal
US7031911B2 (en) 2002-06-28 2006-04-18 Microsoft Corporation System and method for automatic detection of collocation mistakes in documents
CN100437573C (en) * 2003-09-17 2008-11-26 国际商业机器公司 Identifying related names
US20050216253A1 (en) 2004-03-25 2005-09-29 Microsoft Corporation System and method for reverse transliteration using statistical alignment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060112091A1 (en) * 2004-11-24 2006-05-25 Harbinger Associates, Llc Method and system for obtaining collection of variants of search query subjects
US20070011154A1 (en) * 2005-04-11 2007-01-11 Textdigger, Inc. System and method for searching for a query
US20070288448A1 (en) * 2006-04-19 2007-12-13 Datta Ruchira S Augmenting queries with synonyms from synonyms map

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140012563A1 (en) * 2012-07-06 2014-01-09 International Business Machines Corporation Providing multi-lingual searching of mono-lingual content
US8918308B2 (en) * 2012-07-06 2014-12-23 International Business Machines Corporation Providing multi-lingual searching of mono-lingual content
US9418158B2 (en) 2012-07-06 2016-08-16 International Business Machines Corporation Providing multi-lingual searching of mono-lingual content
US9792367B2 (en) 2012-07-06 2017-10-17 International Business Machines Corporation Providing multi-lingual searching of mono-lingual content
US10140371B2 (en) 2012-07-06 2018-11-27 International Business Machines Corporation Providing multi-lingual searching of mono-lingual content
KR20200100360A (en) * 2019-02-18 2020-08-26 네이버 주식회사 Method and system for extracting foreign synonym using transliteration model
JP2020135877A (en) * 2019-02-18 2020-08-31 ネイバー コーポレーションNAVER Corporation Method and system for automatic extraction of synonymous loan word using transliteration model
KR102192376B1 (en) * 2019-02-18 2020-12-17 네이버 주식회사 Method and system for extracting foreign synonym using transliteration model
JP7014830B2 (en) 2019-02-18 2022-02-01 ネイバー コーポレーション Methods and systems for automatically extracting foreign synonyms using a transliteration model
US11263208B2 (en) 2019-03-05 2022-03-01 International Business Machines Corporation Context-sensitive cross-lingual searches

Also Published As

Publication number Publication date
CN104111972A (en) 2014-10-22
CN101630333B (en) 2014-07-16
US20100017382A1 (en) 2010-01-21
KR20100009520A (en) 2010-01-27
US8521761B2 (en) 2013-08-27
CN101630333A (en) 2010-01-20
CN104111972B (en) 2018-01-09

Similar Documents

Publication Publication Date Title
US8521761B2 (en) Transliteration for query expansion
US8386237B2 (en) Automatic correction of user input based on dictionary
US9542476B1 (en) Refining search queries
US10691680B1 (en) Query refinements using search data
CN107092615B (en) Query suggestions from documents
US8595252B2 (en) Suggesting alternative queries in query results
US8745051B2 (en) Resource locator suggestions from input character sequence
RU2363983C2 (en) System and method for searching using queries, written in language and/or set of characters, distinct from that of target pages
US8688727B1 (en) Generating query refinements
US10360225B1 (en) Query suggestions based on entity collections of one or more past queries
US20120330990A1 (en) Evaluating query translations for cross-language query suggestion
US20090193003A1 (en) Cross-Language Search
WO2009000103A1 (en) Word probability determination
US20160132501A1 (en) Determining answers to interrogative queries using web resources
US20120330989A1 (en) Detecting source languages of search queries
US20150169756A1 (en) Displaying multiple spelling suggestions
US20200159765A1 (en) Performing image search using content labels
US9208232B1 (en) Generating synthetic descriptive text
KR102552811B1 (en) System for providing cloud based grammar checker service

Legal Events

Date Code Title Description
AS Assignment

Owner name: GOOGLE INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KATRAGADDA, LALITESH KUMAR;GUPTA, VINEET;PRAHLADKA, PIYUSH;SIGNING DATES FROM 20130909 TO 20130923;REEL/FRAME:031258/0664

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION