Summary of the invention
The technical issues that need to address of the present invention are, can only be unidirectional at information transmission in the existing digital TV platform.Information can only be from the head end search server to the subscriber computer top box, and user profile and intention can't be passed server back, and the set-top box processes range of information is also very little, at these weak points, just needing increases the digital television to search function, to overcome the deficiency of prior art.The object of the present invention is to provide a kind of system that in wired TV one-way set-top box, carries out full-text search.This purpose of the present invention, rely on following technical scheme to realize, a kind of system that in wired TV one-way set-top box, carries out full-text search, this text retrieval system mainly comprises the two large divisions: its part is to comprise data typing module and index management module in interior retrieval server part in head end search server end; It is characterized in that its two part is to comprise the search client application program part that needs reception hint file and these two big class data of data file in query function module and the retrieving.Data typing module mainly is responsible for the additions and deletions of tables of data definition, typing, real data and is revised; Index management module is responsible for index of definition, generates index file and deletion index.Described index is based upon on the previously defined data sheet field, an index can only be based upon on the field, the corresponding index file of each index, and native system has adopted nested indexing means to set up index, promptly carry out index doubly, this will greatly strengthen search procedure.Described search client application program comprises the query function module, is responsible for the process user input, query processing and display result; In the primary retrieval process, being responsible for the search client application program needs reception hint file and data file two big class data.
About the data typing module of mentioning in this programme, it is the basis of setting up index, polling routine.To divide two fractions to describe below.
A. tables of data form definition
In the information of end user's typing, we write the form (binary system) of these data with bivariate table in the file of head end search server.Aspect programming, can stipulate five kinds of field types.Whether tables of data except possessing basic Database field attribute: be keyword, this attribute has and has only the attribute " whether being keyword " of row for true in a tables of data, can not set up index in such listing if also having an attribute.Five kinds of field types are as follows:
Text formatting (as: text)
Number format (as: number)
Date (as: date)
Timestamp (as: timestamp)
File format
Wherein the date is not to be with Hour Minute Second, timestamp band Hour Minute Second.
Its essence of file format is number format, and the content that is stored in this field is 0 or 1.The purpose that this field type is set is in order to allow program organize the path at multimedia file place according to field contents after reading this field type.Like this can be in follow-up intercepted data packet accurate locator data file (as picture, small video data).For example, certain file format type field PIC1 value is 1 in the tbl table, can determine that so the picture of seeking orientates as in the packet in future: tbl_PICl ID.GIF
After the definition of data sheet format, the typing of data can be developed as multiple modes such as typing and importing.The storage of final data is to write in the file system of head end search server with binary format.
B. tables of data descriptor definition
For the index file that makes the set-top box polling routine can know that server end is set up is at which field in the initial data to set up, specially designed the description document of this table information.Be the table information content that is described with the xml file format below, in implementation procedure, can consider to adopt other information description modes.
These information definitions the data tableau formats, the type of each field, and be based upon index file title on each field, the row title of index etc.This information can be used as a header file, is attached to the beginning part of index file.At the frequency that data are propagated, obtain these table data description information when reading index information downwards.Such packing manner remains further to be discussed, and only provides a thinking for reference here.
About the index management module of mentioning in this programme, described " index ", at the usually thicker usually attached keyword index table in books back (such as Beijing: 12,34 pages, Shanghai: 3,77 pages ...), it can help the reader to find the page number of related content faster.And database index can to improve the velocity principle of inquiry greatly also be the same, the speed of imagining the index search by the postscript face than turn over page by page content high what doubly ... and index efficient height why, the another one reason is that it is sorted.So set up the key of an efficient retrieval system is to set up one to be similar to the same reverse indexing mechanism of scientific and technological index, when data source (such as many pieces of articles) clooating sequence is stored, the sorted lists of keywords of another one is arranged, be used to store keyword==the article mapping relations, utilize such mapping relations index: [keyword==the article numbering of keyword appears, occurrence number (even comprises the position: the start offset amount, finish side-play amount), the frequency of occurrences], retrieving is exactly the process that fuzzy query is become the logical combination of a plurality of accurate inquiries that can utilize index.Thereby improved the efficient of multi-key word inquiry greatly.
Index management mainly comprises index of definition, sets up index and deletion index.Index of definition is to set up index for which field which is opened on the table.Setting up index is scan-data literary name section content, and statistics keyword, the frequency of occurrences and occurrence number are charged to statistics in the file then.The deletion index is the definition of cancellation index, the index file that deletion is corresponding.
For the index of all definition, in the data Kuku, there is a special index information table (sys_index) to be used for storing all index informations.The index information table definition is as follows:
Field information |
Type |
Information table major key ID |
Numeric type |
The tables of data table name |
Character type |
Data sheet field |
Character type |
Index file is deposited the path |
Character type |
Index is to be based upon on the field of tables of data.The individual character segment index only is provided at present, does not support combined index, just index can only be based upon on the field, can not be based upon on a plurality of fields basis.The corresponding index file of index
Index mainly contains two major types: text and numeral.For the date and time type, convert thereof into numeric type earlier, as be scaled the numerical value of millisecond unit, handle according to numeric type then.Database field type and index type corresponding relation are as shown in the table:
The index file of text type is made up of three parts, is respectively file header information, index file index information, data message again.The index file of numeric type is made up of file header information and data message two parts.
The file header information spinner will comprise the starting position of file header and end position, the starting position of the starting position of index information and end position, data content and end position, keyword sum again.
Index information mainly is at the index file of text type again, every big class starting position and the final position hereof of record.First Chinese characters phonetic initial according to keyword is classified, and has 26 big classes, according to the sequence arrangement of a to z.In query script in the future, at first find out index information again in file header information, obtain the start-stop position of these big class place data, dwindled the hunting zone like this.
Data message is made up of a plurality of tlv triple.Triple form is: keyword, occurrence number, occur the position [starting position and end position that database table record occurs in database file ...].For text type, keyword is arranged from a to z according to Chinese phonetic alphabet or letter; For numeric type, not needing it is carried out word segmentation processing, each different value all is keyword all as a keyword as 1,1.1.The order of the appearance position in the tlv triple is arranged in strict accordance with the priority that record in the data file occurs.
Text type index file form is as follows:
The starting position of file header message file head and end position, the starting position of the starting position of index information and end position, data content and end position, keyword sum again |
Index information (type, starting position, end position) again |
Data content keyword, occurrence number, occur the position [starting position and end position that database table record occurs in database file ...] |
Two kinds of index files are provided in the system, should have had two kinds of index to set up process mutually.As follows respectively:
1. do not need participle
Prerequisite is the keyword that record is provided in the tables of data.The index creation facilities program (CFP) thinks that the keyword that provides is exactly the result of text participle, in alphabetical order or Chinese phonetic alphabet initial character order each keyword tlv triple is deposited in the index file.
2. need participle
Key step is:
(1) utilizes the participle program that the actual value cutting in the indexed field is keyword one by one, deposit in the index file.
(2) at each keyword, the situation that keyword occurs in the record data record.
(3) according to the result of (2), set up index relative in the indexed file.
(4) set up index at each keyword, and sort, in order to user inquiring according to the appearance position that data are recorded in the former data file.
Index is set up in the process most important, also is that what to bother most is exactly participle, especially Chinese word segmentation.So-called Chinese word segmentation is meant the Chinese character sequence of Chinese is cut into significant speech, because English is unit with the word, be to separate by the space between the word, and Chinese is to be unit with the word, and all words link up and could describe a meaning in the sentence.
The participle program for the processing of actual value is: at first according to segmentation symbol (such as punctuation mark, the space, must waiting of invalid speech ground) actual value is separated, look at whether to have the character string of repetition then, if have, it is unnecessary just to abandon, only keep one, English or numeral have then been judged whether, if any, English or numeral are used as an integral body reservation and the incision of the Chinese of front and back, are adopted binary to divide word algorithm that the Chinese of cutting apart is cut apart once more at last, and what obtain at last is exactly keyword.So-called binary is divided word algorithm, and two promptly adjacent arbitrarily Chinese characters are all as a keyword.The advantage of this algorithm is simple and easy and can not omits possible keyword, and shortcoming is that having produced a large amount of is not " vocabulary " of vocabulary.Consider set-top box processes ability and cache size, it is more reasonably to select that binary is divided morphology.
For example following the words is carried out participle:
The developing history of search engine proves, do not accomplish to have only unexpected, allows people are more convenient to obtain the mission that information is search engine accurately.
Word segmentation result: [search] [index] [engine] [holding up] [send out] [development] [exhibition is gone through] [history] [history card] [proof] [not having] [have and do] [doing not] [less than] [to] [having only] [have and think] [thinking not] [less than] [allow people] [people] [more] [more square] [convenience] [just standard] [accurately] [true] [obtain] [obtaining] [winning the confidence] [information] [breath is] [being to search] [search] [index] [engine] [holding up] [make] [mission]
About client query
Enquiry module is main according to the index file, the tables of data descriptor that obtain from server end, and the data file is screened.The key step of handling is:
1. the index file packet of receiving belt keyword is resolved the character string participle of user's input, obtains keyword;
2. if desired, receive the index file packet that dichotomy generates, find the data recording that satisfies condition then in the indexed file.If multiple key, that is searched in each self-corresponding index file respectively earlier, then amalgamation result; If date type is arranged in the keyword, the numeric type that needs so to convert Millisecond earlier the date to inquiring about:
3. receive the not data file packet of tape file type field content, videotex information:
4. show multimedia if desired, the receiving multimedia data bag shows multimedia messages.
At the top first step and second step, two kinds of processing modes that may relate to are arranged all:
1. the query term of the crucial dictionary analytical engine set-top box customer input that obtains the index file according to the band keyword that obtains from server positions in key word index; If reaching a certain predetermined value, The selection result quantity just finishes inquiry; Otherwise enter second kind of inquiry mode.
2. utilize dichotomy that the query term of set-top box users input is carried out the participle parsing, then the side-play amount of location whole piece recorded information in data file in the index that dichotomy is set up.
The above two kinds of inquiry mode has following similar query steps:
Split participle
According to the location algorithm position the record
Respectively two kinds of inquiries are analyzed according to these 2 steps below.
The inquiry mode of a band keyword font file:
1. split participle
The participle method for splitting of this band keyword dictionary is also referred to as the vocabulary cutting.Two kinds of more typical algorithms are arranged: maximum matching process and maximum matching process to the right left.
Example: give an example here illustrate " Tian An-men, Beijing " by maximum to the right matching process be how to carry out cutting according to the speech habits of Chinese.Adding a Chinese character with " north " as the basis becomes " Beijing ", searches whether this phrase is arranged in the character library.If have then continue to add word to the right one by one and search; There is not the unit of the longest then current phrase as an inquiry.
It is very good that effect is no doubt inquired about in the vocabulary cutting, and also we classify this fractionation inquiry mode as first-selected reason for this.Its index efficient height is generally about 30% of original text size.So hit rate also is very high in inquiry, junk data is less.But it also has much unfavorable, and word list maintenance cost is very high, and language needs to safeguard respectively.Also need to comprise contents such as word frequency statistics.
Because in our cover system, be very limited from the keyword message of head end search server index.The keyword that magnanimity can not occur is so the index that we set up also is very little.If can at first inquire relative recording, be that we expect most according to this mode.
In conjunction with the pluses and minuses of vocabulary cutting recited above, we are that the ratio of having weighed efficient and hit rate has just been carried out such design.
2. according to the location algorithm position the record
Describe in setting up the index chapters and sections, the head part of index file has comprised a hash index with the pinyin order appointment.Each query unit that splits previously can position according to the phonetic of lead-in, obtains the start-stop position in the index.
The matched and searched of carrying out order then can navigate to the start-stop position of keyword in data file.
The inquiry mode of b dichotomy participle:
1. split participle
The method for splitting all fours that chapters and sections are described set up in participle method for splitting here and front index.
2. according to the location algorithm position the record
With keyword index to read mode similar.
Two kinds of methods to read indexed mode substantially identical, all be the pinyin indexes that has utilized the indexing head part, utilize secondary index again.
Utilize pinyin indexes to find out key range earlier, utilize the unit of inquiring about to compare again and can find out the start-stop position that is recorded in the data file.Finished the search work of index like this.
Can set up according to the table descriptor at last and store the data of from data file, extracting at the data structure of tables of data.
The invention has the beneficial effects as follows, solved the problem of carrying out full-text search in the one-way digital television platform, the information transmission no longer is unidirectional, and user's hand-held remote controller just can obtain function of search.
Embodiment
Represent overall system design figure with reference to Fig. 1, this text retrieval system of expression comprises retrieval server and search client application program two large divisions among the figure.Represent retrieval server flow chart and search client applicating flow chart respectively with reference to figure 2 Fig. 3, with reference to Fig. 4 Fig. 5, represent that respectively index sets up flow chart and client search routine figure, provided and carried out concrete operations step and the program that index is set up and shell family end is searched for, finished with " preservation file " at last.With reference to Fig. 6 Fig. 7, represent the results page figure of two embodiment respectively.Below specify.
Embodiment 1 sunlight government affairs search (search of full text word)
The sunlight government affairs are brand-new platforms that government exchanges with masses common people, and it provides up-to-date the most authoritative government affairs information.TV user can be checked the every profession and trade rules of country's promulgation by the sunlight government affairs, the every publicity announcement of officialism's brief introduction, hotline and government, in time understand the dynamic and policy indication of government work, furthered the significantly government and the common people's distance really accomplishes to make government affairs public.
The amount of information of these government affairs is sizable, has into ten thousand more than one hundred million.It is quite difficult that the user will become the information that finds him to want in ten thousand more than one hundred million government affairs informations at this.General processing mode is user's these information of leafing through one by one, until finding him to need.The user might spend three or four hours and search in government affairs information, has only spent three or four minutes and browse Useful Information, and this time energy to the user is great waste.
Sunlight government affairs function of search allows everything become very simple.TV user enters sunlight government affairs function of search, behind the inputted search keyword, sunlight government affairs function is searched in all government affairs informations and is satisfied condition, and demonstrates the government affairs information that finds then, and the user only need spend the rules or the bulletin that just can find him to need in three or four minutes like this.
Data typing and index management
A) the tables of data form is defined as follows:
B) tables of data and index descriptor
Here with the information of xml format description tables of data.These contents are attached to be set up later
The front portion of index file content is sent to client by search server together as header.
<?xml?version="1.0"encoding="gb2312"?>
<metainfo>
<table-name>tbl_org_info</table-name>
<columns>
<column type="number"
primaryKey="true">ID</column>
<columntype="text">TITLE</column>
<columntype="text"
is_keyword="true">CONTENT</column>
</columns>
<indexes>
<index?name="index_tbl_org_info_content">
<column>CONTENT</column>
</index>
</indexes>
</metainfo>
Client query
Because government affairs information has a lot of fixing keywords unlike Business Information, so we do not send the self-defining keyword character library of client at search server.Search utility will be skipped keyword character library query pattern, directly enter the inquiry mode of dichotomy.
The design of client's query interface is illustrated among the Figure of description 6a.
Following program will be carried out the dichotomy participle according to the querying condition (being the content of government affairs information) of user's input here.Such as user's input " heating fee ", program splits into two inquiry minimum units with key word of the inquiry so, i.e. " heating ", " the warm expense ".
Behind the header information that has read index file, can learn that the index of tables of data is based upon on the CONTENT field.Can seek " heating " " warm expense " pairing start-stop position (location in the index data file) according to the pinyin indexes that index file begins to locate below.Results page as shown in Figure 6.
Embodiment 2 discounting commercial articles searchings (literal adds the numeral that gives a discount)
The competition in market now is very fierce, and the eyeball that various discounting attract the consumer constantly does in each businessman, obtains the more volume of the flow of passengers, higher income.How does but quickness and high efficiency allow numerous consumers know to these informations of discount? how does the consumer know the information of discount of the merchandise news that he needs fast in miscellaneous information of discount? discounting commercial articles searching function can reach these requirements.
Data typing and index management
A) the tables of data form is defined as follows:
B) tables of data and index descriptor
<?xml?version="1.0"encoding="gb2312"?>
<metainfo>
<table-name>tbl_discount</table-name>
<columns>
<column?type="number"primaryKey="true">ID</column>
<column?type="text">NAME</column>
<column?type="text">TYPE</column>
<column?type="number">DISCOUNT</column>
<columntype="PICTURE">ICON_1</column>
</columns>
<indexes>
<index?name="tbl_discount_NAME">
<column>NAME</column>
</index>
<index?name="tbl_discount_DISCOUNT">
<column>DISCOUNT</column>
</index>
</indexes>
</metainfo>
Client query
In the inquiry of discounting commodity, can at first receive the keyword character library that a businessman provides at search client.The inquiry meeting is at first searched in the index file that this character library is formed, if the result set quantity that inquires has reached 10, then finishes inquiry.Do not reach 10 and will proceed the dichotomy inquiry.The inquiry mode of dichotomy and top sunlight government affairs just the same.
Polling routine can know that index is to be based upon NAME respectively and DISCOUNT two lists according to information header, and inquiry is the start-stop position of desired data in data, location in two index files respectively.
The user search end interface is shown in Figure of description 7a.
The Query List page and details page are seen accompanying drawing 7b.