CA2640035A1

CA2640035A1 - Formulating data search queries

Info

Publication number: CA2640035A1
Application number: CA002640035A
Authority: CA
Inventors: Eric M. Robinson; Edward L. Walter
Original assignee: Individual
Current assignee: FTI Technology LLC
Priority date: 2006-01-27
Filing date: 2007-01-26
Publication date: 2007-08-09
Anticipated expiration: 2027-01-26
Also published as: CA2640035C; US20070179940A1; WO2007089672A1; EP1977350A1

Abstract

A system (10) and method (80) for formulating data search queries (142) is presented. A user interface (50) operable to specify an unstructured search criteria for a search query (142) on one or more documents (40) is provided.
An input portal (23) is exported to receive a data excerpt (51) selected to be searched against the documents (40). A selectable inclusiveness control (52) is exported to specify a granularity of inclusion (141) of matching tokens (142) within each document (40). A selectable proximity control (53) is exported to specify a degree of nearness (140) of the tokens (142) within each document (40). Tokens (142) derived from the data excerpt (51) and parameters corresponding to the granularity of inclusion (141) and the degree of nearness (140) are compiled into the search query (142).

Claims

1. A system (10) for formulating data search queries (142), comprising:
a user interface (50) operable to specify an unstructured search criteria for a search query (142) on one or more documents (40), comprising:
an input portal (23) to receive a data excerpt (51) selected to be searched against the documents (40);
a selectable inclusiveness control (52) to specify a granularity of inclusion (141) of matching tokens (142) within each document (40);
a selectable proximity control (53) to specify a degree of nearness (140) of the tokens (142) within each document (40); and a document searcher (35) to compile tokens (142) derived from the data excerpt (51) and parameters corresponding to the granularity of inclusion (141) and the degree of nearness (140) into the search query (142).

2. A system (10) according to Claim 1, further comprising:
a storage (136) to maintain the target corpus (137) comprising the documents (40) indexed to facilitate searching; and a search engine (135) to execute the search query (142) against the documents (40) maintained in the target corpus (137), wherein search results (56) identified by the search query (142) execution are presented (90).

3. A system (10) according to Claim 1, further comprising:
a parser to extract the tokens (142) from the data excerpt (51).

4. A system (10) according to Claim 1, wherein the granularity of inclusiveness (141) on a continuum vary between a Boolean OR operation of all tokens (142) and a Boolean AND operation of all tokens (142).

5. A system (10) according to Claim 1, wherein a number of tokens h (142) that must be matched by one or more words (41-46) in each target document (40) are determined in accordance with the equation:

h = int(N * p + 1) where N is a total number of the tokens (142) and 0.0 <= p < 1.0 is a value representing the granularity of inclusiveness (141) specified through the selectable inclusiveness control (52).

6. A system (10) according to Claim 1, wherein the degree of nearness (140) on a continuum vary between a span equal to a number of the tokens (142) and a number of terms (41-46) in each document (40).

7. A system (10) according to Claim 1, wherein a span s to be applied and a number of tokens (142) to combine c during searching of each document (40) are determined in accordance with the equations:

s = p c = MaxInt(2, N * p2) where N is a number of the tokens (142) and 0.0 < p <= 1.0 is a value representing the degree of nearness (140) specified through the selectable proximity control (53).

8. A system (10) according to Claim 1, further comprising:
a document analyzer to assign weights to terms (41-46) based on structural location within each document (40), wherein the search query terms (142) are modified to favor the terms (41-46) having higher weights over the terms (41-46) having lower weights.

9. A system (10) according to Claim 8, wherein the higher weights are assigned to the terms (41-46) occurring in a structural location selected from the group comprising titles, headings, tables of content, and indexes.

10. A system (10) according to Claim 1, further comprising:
a query processor to broaden the tokens (142), comprising:
a word analyzer to derive a normalized root stem for each token (142) and to identify one or more synonyms for the normalized root stem, wherein the synonyms are conjunctively included with the token (142) in the search query (142).

11. A system (10) according to Claim 1, further comprising:
a selection control operable to specify at least one of one or more required terms (41-46) and one or more optional terms (41-46) in the data excerpt (51), wherein the search query terms (142) are modified to always include the required terms (41-46) and to permissively include the optional terms (41-46).

12. A system (10) according to Claim 1, further comprising:
an ordering control operable to specify precedence of the tokens (142), wherein the search query terms (142) are modified to favor the terms (41-46) having higher precedence.

13. A system (10) according to Claim 1, further comprising:
a search scope control operable to specify documents (40) to be searched, wherein the search query (142) is modified to search the specified documents (40).

14. A system (10) according to Claim 1, wherein the selectable inclusiveness control (52) and the selectable proximity control (53) are provided as a one of single selectable controls or combined controls selected from the group comprising rotary or gimbal knobs, slider bars, radio buttons, and user input mechanisms that allow continuous or discrete selection over a fixed range of rotation, movement, or selection.

15. A system (10) according to Claim 1, wherein the data excerpt (51) comprises at least one of textual data, binary data, and an encapsulated search query (142).

16. A method (80) for formulating data search queries (142), comprising:
providing (82) a user interface (50) operable to specify an unstructured search criteria for a search query (142) on one or more documents (40), comprising:

exporting an input portal (23) to receive a data excerpt (51) selected to be searched against the documents (40);
exporting a selectable inclusiveness control (52) to specify a granularity of inclusion (141) of matching tokens (142) within each document (40);
exporting a selectable proximity control (53) to specify a degree of nearness (140) of the tokens (142) within each document (40); and compiling tokens (142) derived from the data excerpt (51) and parameters corresponding to the granularity of inclusion (141) and the degree of nearness (140) into the search query (142).

17. A method (80) according to Claim 16, further comprising:
maintaining the target corpus (137) comprising the documents (40) indexed to facilitate searching;
executing the search query (142) against the documents (40) maintained in the target corpus (137); and presenting (90) search results (56) identified by the search query (142) execution.

18. A method (80) according to Claim 16, further comprising:
extracting the tokens (142) from the data excerpt (51).

19. A method (80) according to Claim 16, further comprising:
varying the granularity of inclusiveness (141) on a continuum between a Boolean OR operation of all tokens (142) and a Boolean AND operation of all tokens (142).

20. A method (80) according to Claim 16, further comprising:
determining a number of tokens h (142) that must be matched by one or more words (41-46) in each target document (40) in accordance with the equation:

h=int(N* p+1) where N is a total number of the tokens (142) and 0.0 <= p < 1.0 is a value representing the granularity of inclusiveness (141) specified through the selectable inclusiveness control (52).

21. A method (80) according to Claim 16, further comprising:
varying the degree of nearness (140) on a continuum between a span equal to a number of the tokens (142) and a number of terms (41-46) in each document (40).

22. A method (80) according to Claim 16, further comprising:
determining a span s to be applied and a number of tokens (142) to combine c during searching of each document (40) in accordance with the equations:

s= c = MaxInt(2, N * p2) where N is a number of the tokens (142) and 0.0 < p <= 1.0 is a value representing the degree of nearness (140) specified through the selectable proximity control (53).

23. A method (80) according to Claim 16, further comprising:
assigning weights to terms (41-46) based on structural location within each document (40); and modifying the search query terms (142) to favor the terms (41-46) having higher weights over the terms (41-46) having lower weights.

24. A method (80) according to Claim 23, wherein the higher weights are assigned to the terms (41-46) occurring in a structural location selected from the group comprising titles, headings, tables of content, and indexes.

25. A method (80) according to Claim 16, further comprising:
broadening the tokens (142), comprising:
deriving a normalized root stem for each token (142);

identifying one or more synonyms for the normalized root stem; and conjunctively including the synonyms with the token (142) in the search query (142).

26. A method (80) according to Claim 16, further comprising:
exporting a selection control operable to specify at least one of one or more required terms (41-46) and one or more optional terms (41-46) in the data excerpt (S 1); and modifying the search query terms (142) to always include the required terms (41-46) and to permissively include the optional terms (41-46).

27. A method (80) according to Claim 16, further comprising:
exporting an ordering control operable to specify precedence of the tokens (142); and modifying the search query terms (142) to favor the terms (41-46) having higher precedence.

28. A method (80) according to Claim 16, further comprising:
exporting a search scope control operable to specify documents (40) to be searched; and limiting the search query (142) to search the specified documents (40).

29. A method (80) according to Claim 16, further comprising:
providing the selectable inclusiveness control (52) and the selectable proximity control (53) as a one of single selectable controls or combined controls selected from the group comprising rotary or gimbal knobs, slider bars, radio buttons, and user input mechanisms that allow continuous or discrete selection over a fixed range of rotation, movement, or selection.

30. A method (80) according to Claim 16, wherein the data excerpt (S 1) comprises at least one of textual data, binary data, and an encapsulated search query (142).

31. A computer-readable storage medium holding code for performing the method (80) according to Claim 16.