CN104008171A - Legal database establishing method and legal retrieving service method - Google Patents

Legal database establishing method and legal retrieving service method Download PDF

Info

Publication number
CN104008171A
CN104008171A CN201410242810.8A CN201410242810A CN104008171A CN 104008171 A CN104008171 A CN 104008171A CN 201410242810 A CN201410242810 A CN 201410242810A CN 104008171 A CN104008171 A CN 104008171A
Authority
CN
China
Prior art keywords
law
document
lexical item
legal
entry document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410242810.8A
Other languages
Chinese (zh)
Inventor
刘婕
张程
赵晓芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN201410242810.8A priority Critical patent/CN104008171A/en
Publication of CN104008171A publication Critical patent/CN104008171A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services; Handling legal documents

Abstract

The invention provides a legal database establishing method. The legal database establishing method comprises 1) for a new legal text, splitting the received new legal text by item to obtain corresponding legal item documents and creating corresponding unique identities; 2) segmenting every legal item document, for every word item obtained through segmentation, establishing or updating an unique record corresponding to the word item in a content-based inverted index, wherein every record of the content-based inverted index comprises every legal item document and the corresponding index information of the corresponding word item containing the record; 3) returning to the step 1) to process a next legal text until all the legal texts are processed. The invention also provides a corresponding retrieving service method. The legal database establishing method and the legal retrieving service method can help obtain precise retrieval results of the legal items simply through one retrieval process.

Description

A kind of law databases construction method and legal retrieval method of servicing
Technical field
The present invention relates to computer version information retrieval, specifically, the present invention relates to a kind of law databases construction method and legal retrieval method of servicing.
Background technology
Information retrieval refers to be organized the data of recorded information in a certain way and stores, and finds out process for information about according to user's needs.Utilize information retrieval technique, the knowledge that searches out needs from the data of magnanimity that people can be more prone to, has improved the efficiency of knowledge acquisition.
Legal retrieval system is information retrieval technique to be acted on to a kind of application of laws and regulations text, can help office of the National People's Congress at different levels, Party and government offices, the staff of the law working mechanisms such as law court, procuratorate, lawyer's office, finds required laws and regulations information fast.Meanwhile, legal retrieval system also provides legal retrieval service to society.
Current legal retrieval system, as " the Chinese law regulation search system " of the National People's Congress, " Beijing University's magic weapon " of Peking University etc., be all for laws and regulations in full and title, date, issue department, rules classification, effect rank, the metadata combination information such as ageing retrieve, it is base unit in full that the result for retrieval returning be take laws and regulations.Yet user often needs to find the law that merit may be applicable, so after obtaining result for retrieval, user also needs to search voluntarily further relevant law.
On the other hand, user often expects to find all relevant law relevant to merit, and current legal retrieval is all the exact matching to key word, if key word is not accurate enough, just may there is omission in the result retrieving, the relevant law having may be within result for retrieval scope.So for finding more relevant law, user often needs to attempt using multiple key word or key combination, carries out repeatedly, repeatedly retrieves, and could finally find required a plurality of relevant law entries.Therefore, the convenience of existing legal retrieval is in urgent need to be improved.
Therefore, current in the urgent need to a kind of legal retrieval service plan that can help user to find more quickly required laws and regulations information.
Summary of the invention
Therefore, task of the present invention is to overcome the deficiencies in the prior art, and a kind of legal retrieval service plan that can help user to find more quickly required laws and regulations information is provided.
The invention provides a kind of law databases construction method, comprise the following steps:
1) law databases receives a new Law Text, splits the Law Text receiving by entry, obtains corresponding law entry document and creates corresponding unique identification;
2) each law entry document is carried out to participle, each lexical item for participle gained, in content-based inverted index, set up or upgrade the corresponding unique record of this lexical item, every record of described content-based inverted index includes: in content, occur each law entry document of the corresponding lexical item of this record institute and index information accordingly;
3) get back to step 1) receive next Law Text and process accordingly, until all Law Text are all disposed.
Wherein, described step 2), in, described index information comprises: the inverse document frequency of corresponding lexical item, and corresponding lexical item appear at the word frequency of each law entry document; Wherein, described inverse document frequency is the inverse document frequency of the law entry document based in law databases.
Wherein, described step 2) comprise following sub-step:
21) traversal splits each the law entry document obtaining, and for current law entry document, it is carried out to participle;
22) all lexical items that traversal participle obtains, to each lexical item, calculate current lexical item and appear at the word frequency in described current law entry document, in content-based inverted index, search the record corresponding to described current lexical item, if find the record of the described current lexical item of having deposited, in record, increase the sign of described current law entry document, and the word frequency that occurs of described current lexical item in described current law entry document, and upgrade the inverse document frequency of described current lexical item; If do not find the record of the described current lexical item of having deposited, in the dictionary of described content-based inverted index, increase described current lexical item, increase a new record simultaneously, described new record comprises the inverse document frequency of described current lexical item, the sign of described current law entry document, and the word frequency that occurs in described current law entry document of described current lexical item.
The present invention also provides a kind of legal retrieval method of servicing based on above-mentioned law databases, comprises the following steps:
4) obtain the retrieval vector that acts on content territory;
5), for each keyword in retrieval vector, according to content-based inverted index, find each the law entry document and the corresponding index information that in content, occur this keyword;
6) according to corresponding index information, the law entry document hitting is sorted.
Wherein, described step 5), in, described index information comprises: the inverse document frequency of corresponding lexical item, and corresponding lexical item appear at the word frequency of each law entry document; Wherein, described inverse document frequency is the inverse document frequency of the law entry document based in law databases.
Wherein, described step 6) comprise following sub-step:
61) for step 5) in each law entry document of hitting, obtain the law entry document vector that dimension is consistent with described retrieval vector, each element of described law entry document vector is corresponding to a keyword, the value of each element is according to step 5) inverse document frequency of this keyword of finding, and in the content of this law entry document, occur that the word frequency of this keyword draws;
62) law entry document vector sum is retrieved to vectorial similarity as the retrieval similarity of corresponding law entry document, the law entry document each being hit according to described retrieval similarity sorts.
Wherein, described step 62) in, it is that law entry document vector sum is retrieved vectorial cosine similarity that described law entry document vector sum is retrieved vectorial similarity.
Wherein, described step 6) in, in described law entry document vector, the value of each element is step 5) inverse document frequency of the corresponding keyword of this element that finds, and in the content of this law entry document, there is the product of the word frequency of the corresponding keyword of this element.
Wherein, described law entry document comprises metamessage and content, and described metamessage comprises the title of the affiliated Law Text of law entry, and affiliated chapters and sections and the numbering of law entry in affiliated Law Text.
Wherein, described step 6) also comprise: using the affiliated law of the law entry document hitting as hitting law, the described retrieval similarity of the law entry document hitting according to each, show that each retrieval similarity of hitting law hits law to each and sort, then according to sequencing display, each hits content and the metamessage of each law entry document hitting in law.
Wherein, described legal retrieval method of servicing also comprises step:
7) for each, hit law, the similarity of hitting other law in law and described law databases according to this, searches and shows that this hits the relevant law of law;
Described relevant law is determined according to the similarity between law, wherein, similarity between two laws draws as follows: all law titles are carried out to participle and obtain a series of lexical items, and extract and belong to subject structure in title according to part of speech, the lexical item of predicate structure and object structure, with extracted lexical item constitutive characteristic subspace, all law titles are all converted to the expression form of the lexical item vector on described proper subspace, similarity using the similarity at described proper subspace of two corresponding two lexical item vectors of law title between described two laws.
Wherein, described step 7) in, for each, hit law, show that this hits the incidence relation figure of law and its relevant law, described incidence relation figure comprises: series of points and the limit that is connected each point, this hits a relevant law of law described in each some representative, to hit law or one, shows the similarity between corresponding two laws of two end points on every limit.
Compared with prior art, the present invention has following technique effect:
1, primary retrieval can obtain the result for retrieval that is accurate to law entry.
2, can not only obtain the law entry of mating with retrieve statement, can also further obtain all relevant laws, thereby help user more fully to find all laws relevant to merit, reduce the retrieval difficulty of laws and regulations information.
Accompanying drawing explanation
Below, describe by reference to the accompanying drawings embodiments of the invention in detail, wherein:
Fig. 1 shows the overall flow schematic diagram of one embodiment of the invention;
Fig. 2 shows in one embodiment of the invention to set up take the schematic flow sheet of the law databases that law entry document is storage unit;
Fig. 3 shows the structure example of dictionary and index record table in the inverted index in one embodiment of the invention;
Fig. 4 shows the schematic flow sheet of the retrieval service in one embodiment of the invention;
Fig. 5 shows the schematic flow sheet of the associative search service in one embodiment of the invention;
Fig. 6 shows the incidence relation illustrated example of hitting law and relevant law thereof in one embodiment of the invention.
Embodiment
A kind of legal retrieval method of servicing is provided according to one embodiment of present invention, and as shown in Figure 1, it comprises three parts.First is: set up and take the law databases that law entry document is storage unit and corresponding inverted index.Second portion is: receive retrieve statement, based on law databases and corresponding inverted index, return to the result for retrieval that is accurate to law entry.Third part is: the result for retrieval based on second portion, and further search the relevant law of the affiliated law of result for retrieval, and found relevant law is added to result for retrieval.Below these three parts are described in detail respectively.
One, set up and take the law databases that law entry document is storage unit and corresponding inverted index.In prior art, conventionally by whole law, form a law documentation, law databases be take law documentation conventionally as unit storage law data.And in the present embodiment, the law entry document of take in law databases is basic unit of storage.Be that each law entry forms a document separately.For ease of understanding, take the < < National People's Congress of the People's Republic of China (PRC) and this law documentation of the election law > > of local people's congress at all levels below to describe as example.In this law documentation, its text mainly comprises: title, note, catalogue, text.
Fig. 2 shows in one embodiment of the invention to set up take the schematic flow sheet of the law databases that law entry document is storage unit, with reference to figure 2, law documentation is inputted to law databases successively, for each law documentation, carries out the following step 11 to 14.
Step 11: law documentation structure is identified and split.By predefined rule, the structural information of identification law documentation, as a piece of writing, chapter, joint; Further identify and locate each entry in Law Text, and it is split one by one by entry.Wherein, text is split as to N subdocument by entry, the law documentation of the < < National People's Congress of the People's Republic of China (PRC) and the election law > > of local people's congress at all levels of take is example, its entry has 66, so, be split as 66 subdocuments.Wherein, each subdocument includes: legal provision content, affiliated law title and the hierarchical structure in affiliated law thereof.For example: the < < National People's Congress of the People's Republic of China (PRC) and local people's congress at all levels's corresponding subdocument of election law > > article one store legal provision content: according to the 12 of Chinese People's Political Consultative Conference's common program, the National People's Congress of the People's Republic of China (PRC) and local people's congress at all levels produce it by the various nationalities people by general election method; Affiliated law title: the National People's Congress of the People's Republic of China (PRC) and local people's congress at all levels's election law; Hierarchical structure in affiliated law: chapter 1 article one.
Step 12: the subdocument splitting (being law entry document) is set up to index.After splitting, using each entry as a subdocument, on content territory, carry out participle (the content part of subdocument being carried out to participle), each lexical item obtaining for participle (vocabulary repeating is considered same lexical item), add up its word frequency (tf) and inverse document frequency (idf), and at this Foundation inverted index.Inverted index is divided into dictionary and arranges record sheet two parts.Fig. 3 shows the structure example of dictionary and index record table in an inverted index.As shown in Figure 3, for a record, by a lexical item, as its unique identification, be stored in the dictionary of inverted index.Simultaneously, in dictionary, also the link of corresponding record and the inverse document frequency of the lexical item of this record in law databases in record sheet are arranged in storage, should be noted that this inverse document frequency is the inverse document frequency that all law entry documents based in law databases calculate, but not the common inverse document frequency calculating based on law documentation.In arranging record sheet, every record is stored with the form of chained list, the law entry that has comprised this lexical item that occurs, as arranging in record sheet corresponding to the record of lexical item 1 of Fig. 3, its four nodes represent respectively law entry document 1,2,3,4, this represents law entry document 1, in 2,3,4, all there is lexical item 1, record corresponding to lexical item 2, two node represents that respectively law entry document 5,6 represents all to have occurred lexical item 2 in law entry document 5,6.Wherein, each represents that the node of law entry document all records the id of law entry, and lexical item appears at the frequency in this law entry subdocument, and lexical item appears at other information such as position in this law entry subdocument.
Step 13: the title to law, and issuing time, index is set up in the out of Memory territories such as body release.Wherein, title is carried out to participle and then set up corresponding inverted index, participle is not carried out in other each territory, but using the whole content in each territory as a lexical item.For example: when body release is Central People's Government Committee, in this inverted index, " Central People's Government Committee " is whole as a lexical item.
Step 14: by content territory, title field and issuing time, a plurality of inverted indexs in other metadata information territory such as body release are stored in system with the form of file.
Two, receive retrieve statement, based on law databases and corresponding inverted index, return to the result for retrieval that is accurate to law entry.The present embodiment can provide the service of multiple domain combined retrieval.Meanwhile, it can be classified as a class by many relevant entrys that belong to same portion laws and regulations, and combination shows.As a rule, retrieval service is divided simple and the senior two kinds of patterns of can be.Simple mode is to retrieve identical retrieve statement on title and content territory, and under this pattern, user directly inputs retrieve statement.Fine mode can be supported the screening of enumerating for metadata by metadata information territory, and under this pattern, user need to specify needs the territory of retrieval and on this territory, inputs retrieve statement or select enumerated value.For example " content: consumption interest and right protection & title: Protection Code & body release (enumerated value): the National People's Congress ".Retrieval service is returned to entry contents and its metadata information that retrieval is relevant.Retrieve statement can be vocabulary (as " economy "), lexical set (as " economic policy ") or phrase (as " economic policy ").For different information fields, conventionally have different retrieval service modes, for example; to content territory and title field; retrieve statement need to carry out participle conventionally, and the retrieve statement in other metamessage territory is not done word segmentation processing, the directly keyword using retrieve statement as corresponding metamessage territory.The retrieval that is accurate to law entry of the present embodiment mainly refers in the retrieval service that acts on content territory, therefore hereinafter mainly to acting on the retrieval service in content territory, is described, and the part that all the other and purport of the present invention are irrelevant, repeats no more herein.
Fig. 4 shows the schematic flow sheet of the retrieval service in one embodiment of the invention, and with reference to figure 4, retrieval service comprises the following steps 21 to 24.
Step 21: receive the retrieve statement that acts on content territory.As mentioned before, retrieve statement can be vocabulary (as " economy "), lexical set (as " economic policy ") or phrase (as " economic policy ").
Step 22: retrieve statement is carried out to participle, obtain corresponding one or more search key, form retrieval vector.
Step 23: on content territory, for each keyword, the inverted index based on this territory, finds the inverse document frequency of this keyword, occurs each law entry document of this keyword, and the word frequency of this keyword in corresponding law entry document.In inverted index, store the index record that belongs to all lexical items in this territory in law databases, found the index of the lexical item that keyword is corresponding, just can obtain required information.For example, when keyword is " economy " and " policy ", at inverted index, find respectively the index record of lexical item " economy " and the index record of " policy ", so just can from the index record of " economy ", obtain the inverse document frequency of lexical item " economy ", each the law entry document that contains " economy ", and " economy " word frequency of occurring in each law entry document.Similarly, from the index record of " policy ", obtain the inverse document frequency of lexical item " policy ", each the law entry document that contains " policy ", and " policy " word frequency of occurring in each law entry document.Now, the law entry lists of documents of lexical item " economy " and " policy " is got to union, just obtained the documents relevant to retrieval all on this territory.If selected, be advanced search pattern, on all territories, the one or more keywords based on correspondence are retrieved.
Step 24: calculate the retrieval degree of correlation of each the law entry document finding, according to the retrieval degree of correlation, each law entry document finding is sorted, wherein retrieve the degree of correlation larger, sort more forward.Then using the information of each law entry document of finding described in after sequence as result for retrieval.Wherein, for the retrieval that only acts on content territory, based on step 23, obtain the law entry document vector that dimension is consistent with described retrieval vector, each element of described law entry document vector is corresponding to a keyword, the inverse document frequency of this keyword that the value of each element finds according to step 23, and in the content of this law entry document, occur that the word frequency of this keyword draws.Can directly law entry document vector sum be retrieved to vectorial similarity as the retrieval similarity in content territory of corresponding law entry document, the law entry document each being hit according to described retrieval similarity sorts.So just can present the integrated retrieval result of retrieve statement on content territory.It is that law entry document vector sum is retrieved vectorial cosine similarity that described law entry document vector sum is retrieved vectorial similarity.In described law entry document vector, the value of each element is in the inverse document frequency of the corresponding keyword of this element that finds of step 23 and the content of this law entry document, to occur the product of the word frequency of the corresponding keyword of this element.
And for the retrieval of fine mode, the retrieval degree of correlation of a law entry document be this law entry document corresponding to the linear weighted function of the degree of correlation in each territory and, law entry document equals under the vector space model of text corresponding to the degree of correlation in a territory, the cosine similarity of the vector representation of entry document on this territory and the vector representation of retrieval of content (i.e. retrieval vector).In the vector representation of entry document, use the inverse document frequency of lexical item and the product of the word frequency of this lexical item in this law entry document as the numerical value of every one dimension, in the vector representation of retrieval of content, only use the word frequency of lexical item as the numerical value of every one dimension.So just can present the integrated retrieval result of retrieve statement, and its sequence considered each territory, and the Different Effects of a plurality of keywords on each territory.
Further, in one embodiment, according to the affiliated law of the law entry document finding in step 24 (the law entry document hitting), take law as basis law entry document is integrated to classification.Calculate the retrieval degree of correlation of whole Law Text the retrieval degree of correlation based on whole Law Text and participate in retrieval relevancy ranking, the retrieval degree of correlation of whole Law Text equals the retrieval degree of correlation sum of found each law entry document that belongs to it.The item list so just retrieval being obtained is integrated classification according to law, and on the basis of original entry degree of correlation, recalculate the relevance degree of law, rearrangement, reach and take law as unit centralized displaying, and only list relevant entry in this law but not in full, and the entry in law by the degree of correlation orderly present effect.This scheme can make result for retrieval more have logicality, more attractive in appearance and be convenient to user and browse.
Three, the result for retrieval based on second portion, further searches the relevant law of the affiliated law of result for retrieval, and found relevant law is added to result for retrieval.This part is in fact a kind of associative search service, it is for this text with certain normalized structure of laws and regulations, carry out the calculating of the degree of association, and extract associated graphical description, thereby show that more intuitively laws and regulations are directly associated, so that user consults the information being associated with result for retrieval.
Fig. 5 shows the schematic flow sheet of the associative search service in one embodiment of the invention, and with reference to figure 5, associative search service comprises the following steps 31 to 34.
Step 31: legal characteristics is extracted.Because Law Text has certain normalized structure, particularly its name, has shown field and theme that laws and regulations are concerned about to a great extent.Therefore, can obtain legal subject matter by its title is analyzed, and be used the vector representation under proper vector subspace.Wherein, according to the analysis to laws and regulations title, the syntactic structure of its title is relatively simple, and the content that laws and regulations are mainly expressed contained substantially in the subject in title, object (noun part) and predicate (verb part).By participle and part of speech, analyze, can be easy to find subject and predicate, object component in title, and be extracted as the feature that represents title.
Object lesson below in conjunction with concrete three pieces of law titles describes.First pass through Chinese word segmentation, the title of law is split into lexical item one by one.Wherein, for the title of law 1: Income Tax Law of The People's Republic of China for Enterprises with Foreign Investment and Foreign Enterprises, its word segmentation result is:
Income Tax Law of The People's Republic of China for Enterprises with Foreign Investment and Foreign Enterprises
Title for law 2: merge the regulation of resident enterprise about foreign investor, its word segmentation result
For:
About foreign investor, merge the regulation of resident enterprise
Title for law 3: about the regulation of electronics patented claim, its word segmentation result is:
Regulation about electronics patented claim
The vector space of these three pieces of law title compositions is the set that all lexical items form, specific as follows: merge,, electronics, method, about, regulation, and, domestic, enterprise, application, income tax, investment, investor, foreign country, foreign trader, the People's Republic of China (PRC), patent }.
By every piece of law title, all with the vector representation that belongs to above-mentioned vector space, in vector, each element represents a lexical item, and the value of this element represents corresponding word frequency.
The vector representation of three pieces of law titles is as follows particularly:
Further, in order to get rid of the interference with the irrelevant lexical item of legal subject matter, can also after to law title participle, carry out part of speech identification, find subject and predicate, object component in title, and be extracted as the feature that represents title, and then constitutive characteristic vector.Wherein, the fixedly suffix of law title, regulation for example, notice, methods etc., also can be considered the lexical item irrelevant with legal subject matter, and the irrelevant suffix of these and content is also removed.
In example, for the title of law 1: Income Tax Law of The People's Republic of China for Enterprises with Foreign Investment and Foreign Enterprises, word segmentation result is:
The People's Republic of China (PRC)/noun foreign trader/noun investment/verb enterprise/noun and/conjunction foreign country/noun enterprise/noun income tax/name morphology/noun
Title for law 2: merge the regulation of resident enterprise about foreign investor, word segmentation result is:
About/preposition foreign country/noun investor/noun merger/verb domestic/place word enterprise/noun/auxiliary word regulation/noun
Title for law 3: about the regulation of electronics patented claim, its word segmentation result: about/preposition electronics/noun patent/noun application/verb/auxiliary word regulation/noun
Now, the proper subspace of three of acquisition pieces of titles is:
{ electronics, enterprise, income tax, investment, investor, foreign country, foreign trader, the People's Republic of China (PRC), patent }
The vector representation of three pieces of laws in proper subspace is as follows:
Electronics Enterprise Income tax Investment Investor Foreign country Foreign trader The People's Republic of China (PRC) Patent
0 2 1 1 0 1 1 1 0
0 1 0 0 1 1 0 0 0
1 0 0 0 0 0 0 0 1
Step 32: law similarity is calculated.As described above, by feature extraction, the title of laws and regulations can be described as to the lexical item vector in proper subspace.Use keyword vector space model rule title, but space constraint is the feature lexical item of all extraction.Now, similarity that can be using the title similarity of laws and regulations in proper subspace as law.
In one embodiment, the similarity of law is the cosine similarity that two pieces of law titles calculate on proper vector subspace.
CosSimilarity ( A &RightArrow; , B &RightArrow; ) = A &RightArrow; &CenterDot; B &RightArrow; | A &RightArrow; | | B &RightArrow; | = &Sigma; ( a i b i ) &Sigma; ( a i ) 2 &Sigma; ( b i ) 2
For law 1, law 2, law 3 in example above, similarity result of calculation is as follows:
CosSimilarity (law 1, law 3)=0
CosSimilarity (law 2, law 3)=0
Step 33: the similarity based between law, return to the law being associated with law in second portion result for retrieval.In order to reduce actual calculated amount, and avoid generating the too small incidence relation of the degree of association, before extracting incidence relation, first law is carried out to cluster, generate the laws and regulations set that several inside has larger similarity.Wherein, utilize the cosine similarity calculating to carry out cluster to law, adopt hierarchy clustering method, the threshold value based on default, it is a class that the larger law of similarity is gathered.For example, it is a class that law 1 and law 2 will be gathered, and law 3 belongs to another kind of.Record similarity value between any two in cluster, so that sort when returning to associative search result.The extraction of incidence relation is only carried out in cluster inside.In an example, according to key word of the inquiry, the law that system has obtained N portion coupling, as retrieval " income tax ", will return to " Income Tax Law of The People's Republic of China for Enterprises with Foreign Investment and Foreign Enterprises ".Simultaneously, system, by the pre-stored law association cluster result of retrieval, obtains the cluster at law 1 place, obtains the front K portion's relative laws (sorting with similarity value) that meets Threshold from this cluster, as law 2, the associative search result as law 1 is returned.3 of laws, because not belonging to same cluster, can not return as the associative search result of law 1.
Further, when returning to association results, as a plurality of laws of associative search result, can sort according to the similarity between the law in it and second portion result for retrieval.Forward with the relative laws sequence that the similarity of law in second portion result for retrieval is larger.
Meanwhile, for the relative laws that proposes to obtain, according to pre-stored law similarity value between any two, the graph structure that generates incidence relation is described: G (V, E).Point (V) representative comprises law in result for retrieval and the set of its relative laws, limit (E) is representing between two nodes (two laws) of its connection and is having incidence relation, the length on limit is short, and explanation relation is tightr, and the similarity of two laws is larger.On every limit, can also further show the similarity numerical value between corresponding two laws of two end points.
Fig. 6 shows the incidence relation illustrated example of hitting law and relevant law thereof in one embodiment of the invention.As shown in Figure 6, hit law and relative laws 2 similarities maximums, minimum with relative laws 3 similarities, and, between relative laws 1 and relative laws 2, also there is similarity.
In above-described embodiment, based on law entry document and corresponding inverted index, build brand-new law databases, make primary retrieval can obtain the result for retrieval that is accurate to law entry.And above-described embodiment can not only obtain the law entry of mating with retrieve statement, can also further obtain all relevant laws.And in prior art, for finding more relevant law, user often needs to attempt using multiple key word or key combination, carry out repeatedly, repeatedly retrieve, could finally find required a plurality of relevant law entries.Therefore, the present invention can more conveniently help user to find all laws relevant to merit all sidedly, has reduced the retrieval difficulty of laws and regulations information.
The foregoing is only the schematic embodiment of the present invention, not in order to limit scope of the present invention.Any those skilled in the art, not departing from equivalent variations, modification and the combination of doing under the prerequisite of design of the present invention and principle, all should belong to the scope of protection of the invention.

Claims (12)

1. a law databases construction method, comprises the following steps:
1) for a new Law Text, by entry, split the Law Text receiving, obtain corresponding law entry document and create corresponding unique identification;
2) each law entry document is carried out to participle, each lexical item for participle gained, in content-based inverted index, set up or upgrade the corresponding unique record of this lexical item, every record of described content-based inverted index includes: in content, occur each law entry document of the corresponding lexical item of this record institute and index information accordingly;
3) get back to step 1) process next Law Text until all Law Text are all disposed.
2. law databases construction method according to claim 1, is characterized in that, described step 2) in, described index information comprises: the inverse document frequency of corresponding lexical item, and corresponding lexical item appear at the word frequency of each law entry document; Wherein, described inverse document frequency is the inverse document frequency of the law entry document based in law databases.
3. law databases construction method according to claim 2, is characterized in that, described step 2) comprise following sub-step:
21) traversal splits each the law entry document obtaining, and for current law entry document, it is carried out to participle;
22) all lexical items that traversal participle obtains, to each lexical item, calculate current lexical item and appear at the word frequency in described current law entry document, in content-based inverted index, search the record corresponding to described current lexical item, if find the record of the described current lexical item of having deposited, in record, increase the sign of described current law entry document, and the word frequency that occurs of described current lexical item in described current law entry document, and upgrade the inverse document frequency of described current lexical item; If do not find the record of the described current lexical item of having deposited, in the dictionary of described content-based inverted index, increase described current lexical item, increase a new record simultaneously, described new record comprises the inverse document frequency of described current lexical item, the sign of described current law entry document, and the word frequency that occurs in described current law entry document of described current lexical item.
4. the legal retrieval method of servicing based on law databases construction method described in claim 1, comprises the following steps:
4) obtain the retrieval vector that acts on content territory;
5), for each keyword in retrieval vector, according to content-based inverted index, find each the law entry document and the corresponding index information that in content, occur this keyword;
6) according to corresponding index information, the law entry document hitting is sorted.
5. legal retrieval method of servicing according to claim 4, is characterized in that, described step 5) in, described index information comprises: the inverse document frequency of corresponding lexical item, and corresponding lexical item appear at the word frequency of each law entry document; Wherein, described inverse document frequency is the inverse document frequency of the law entry document based in law databases.
6. legal retrieval method of servicing according to claim 5, is characterized in that, described step 6) comprise following sub-step:
61) for step 5) in each law entry document of hitting, obtain the law entry document vector that dimension is consistent with described retrieval vector, each element of described law entry document vector is corresponding to a keyword, the value of each element is according to step 5) inverse document frequency of this keyword of finding, and in the content of this law entry document, occur that the word frequency of this keyword draws;
62) law entry document vector sum is retrieved vectorial similarity as corresponding law entry document the retrieval similarity in content territory, the law entry document each being hit according to described retrieval similarity sorts.
7. legal retrieval method of servicing according to claim 6, is characterized in that, described step 62) in, it is that law entry document vector sum is retrieved vectorial cosine similarity that described law entry document vector sum is retrieved vectorial similarity.
8. legal retrieval method of servicing according to claim 7, it is characterized in that, described step 6) in, in described law entry document vector, the value of each element is step 5) inverse document frequency of the corresponding keyword of this element that finds, and in the content of this law entry document, there is the product of the word frequency of the corresponding keyword of this element.
9. legal retrieval method of servicing according to claim 6, it is characterized in that, described law entry document comprises metamessage and content, and described metamessage comprises the title of the affiliated Law Text of law entry, and affiliated chapters and sections and the numbering of law entry in affiliated Law Text.
10. legal retrieval method of servicing according to claim 9, it is characterized in that, described step 6) also comprise: using the affiliated law of the law entry document hitting as hitting law, the described retrieval similarity of the law entry document hitting according to each, show that each retrieval similarity of hitting law hits law to each and sort, then according to sequencing display, each hits content and the metamessage of each law entry document hitting in law.
11. legal retrieval method of servicing according to claim 10, is characterized in that, described legal retrieval method of servicing also comprises step:
7) for each, hit law, the similarity of hitting other law in law and described law databases according to this, searches and shows that this hits the relevant law of law;
Described relevant law is determined according to the similarity between law, wherein, similarity between two laws draws as follows: all law titles are carried out to participle and obtain a series of lexical items, and extract and belong to subject structure in title according to part of speech, the lexical item of predicate structure and object structure, with extracted lexical item constitutive characteristic subspace, all law titles are all converted to the expression form of the lexical item vector on described proper subspace, similarity using the similarity at described proper subspace of two corresponding two lexical item vectors of law title between described two laws.
12. legal retrieval method of servicing according to claim 11, it is characterized in that, described step 7) in, for each, hit law, show that this hits the incidence relation figure of law and its relevant law, described incidence relation figure comprises: series of points and the limit that is connected each point, this hits a relevant law of law to hit law or one described in each some representative, shows the similarity between corresponding two laws of two end points on every limit.
CN201410242810.8A 2014-06-03 2014-06-03 Legal database establishing method and legal retrieving service method Pending CN104008171A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410242810.8A CN104008171A (en) 2014-06-03 2014-06-03 Legal database establishing method and legal retrieving service method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410242810.8A CN104008171A (en) 2014-06-03 2014-06-03 Legal database establishing method and legal retrieving service method

Publications (1)

Publication Number Publication Date
CN104008171A true CN104008171A (en) 2014-08-27

Family

ID=51368828

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410242810.8A Pending CN104008171A (en) 2014-06-03 2014-06-03 Legal database establishing method and legal retrieving service method

Country Status (1)

Country Link
CN (1) CN104008171A (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069031A (en) * 2015-07-21 2015-11-18 湖北翰群通信有限公司 Self-service query system and method of legal provisions on the basis of wireless network
CN105718526A (en) * 2016-01-15 2016-06-29 上海律巢网络科技有限公司 Data search method, device and system based on lawyer information
CN106815256A (en) * 2015-12-01 2017-06-09 北京国双科技有限公司 Set up the method and device of laws and regulations bar fund incidence relation
CN107153689A (en) * 2017-04-29 2017-09-12 安徽富驰信息技术有限公司 A kind of case search method based on Topic Similarity
CN107273476A (en) * 2017-06-08 2017-10-20 广州优视网络科技有限公司 A kind of article search method, device and server
CN107609096A (en) * 2017-09-11 2018-01-19 武汉科技大学 A kind of intelligent lawyer's expert responses method
CN107784024A (en) * 2016-08-31 2018-03-09 北京国双科技有限公司 Build the method and device of party's portrait
CN108132941A (en) * 2016-11-30 2018-06-08 北京国双科技有限公司 The treating method and apparatus of the incidence relation of juristic writing
CN108255862A (en) * 2016-12-29 2018-07-06 北京国双科技有限公司 The search method and device of a kind of judgement document
CN108280172A (en) * 2018-01-20 2018-07-13 郑州幼儿师范高等专科学校 A kind of science of law inquiry system
CN108572953A (en) * 2017-03-07 2018-09-25 上海颐为网络科技有限公司 A kind of merging method of entry structure
CN108595415A (en) * 2018-03-26 2018-09-28 北京北大英华科技有限公司 A kind of law differentiation determination method, device and computer equipment, storage medium
CN108733732A (en) * 2017-04-25 2018-11-02 北京国双科技有限公司 A kind of text searching method and device
CN109213925A (en) * 2018-07-10 2019-01-15 深圳价值在线信息科技股份有限公司 Law Text searching method
CN109408520A (en) * 2018-09-26 2019-03-01 青岛农业大学 A kind of law online updating method, system, equipment and computer program product
CN109582959A (en) * 2018-11-21 2019-04-05 紫优科技(深圳)有限公司 Library catalogue generation method, device, computer equipment and storage medium
CN109614453A (en) * 2018-12-14 2019-04-12 杭州法询信息科技有限公司 A kind of data storage, querying method and the device of regulatory information
CN110019655A (en) * 2017-07-21 2019-07-16 北京国双科技有限公司 Precedent case acquisition methods and device
CN110020420A (en) * 2018-01-10 2019-07-16 腾讯科技(深圳)有限公司 Text handling method, device, computer equipment and storage medium
US10452734B1 (en) 2018-09-21 2019-10-22 SSB Legal Technologies, LLC Data visualization platform for use in a network environment
CN111767365A (en) * 2019-03-12 2020-10-13 株式会社理光 Document retrieval apparatus and method
CN113051289A (en) * 2021-03-11 2021-06-29 北京律联东方文化传播有限公司 French retrieval method, device, equipment and storage medium
CN113190667A (en) * 2021-05-12 2021-07-30 北京律联东方文化传播有限公司 Legal data query method, device, equipment and storage medium
CN115374239A (en) * 2022-07-13 2022-11-22 北京中海住梦科技有限公司 Legal and legal analysis method and device, computer equipment and readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997012334A1 (en) * 1995-09-25 1997-04-03 International Compu Research, Inc. Matching and ranking legal citations
US6081774A (en) * 1997-08-22 2000-06-27 Novell, Inc. Natural language information retrieval system and method
CN1916905A (en) * 2006-09-04 2007-02-21 北京航空航天大学 Method for carrying out retrieval hint based on inverted list
US20090094186A1 (en) * 2007-10-05 2009-04-09 Nec Corporation Information Retrieval System, Registration Apparatus for Indexes for Information Retrieval, Information Retrieval Method and Program
CN101853252A (en) * 2009-04-02 2010-10-06 深圳市辰飞信息技术有限公司 Legal searching method and legal searching system
CN102541961A (en) * 2010-12-31 2012-07-04 北大方正集团有限公司 Method and device for displaying relevance among digital works
CN103198136A (en) * 2013-04-15 2013-07-10 天津理工大学 Sequence-association-based query method for personal computer files
CN103533396A (en) * 2013-09-29 2014-01-22 乐视网信息技术(北京)股份有限公司 Video content interaction method, device and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997012334A1 (en) * 1995-09-25 1997-04-03 International Compu Research, Inc. Matching and ranking legal citations
US6081774A (en) * 1997-08-22 2000-06-27 Novell, Inc. Natural language information retrieval system and method
CN1916905A (en) * 2006-09-04 2007-02-21 北京航空航天大学 Method for carrying out retrieval hint based on inverted list
US20090094186A1 (en) * 2007-10-05 2009-04-09 Nec Corporation Information Retrieval System, Registration Apparatus for Indexes for Information Retrieval, Information Retrieval Method and Program
CN101853252A (en) * 2009-04-02 2010-10-06 深圳市辰飞信息技术有限公司 Legal searching method and legal searching system
CN102541961A (en) * 2010-12-31 2012-07-04 北大方正集团有限公司 Method and device for displaying relevance among digital works
CN103198136A (en) * 2013-04-15 2013-07-10 天津理工大学 Sequence-association-based query method for personal computer files
CN103533396A (en) * 2013-09-29 2014-01-22 乐视网信息技术(北京)股份有限公司 Video content interaction method, device and system

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105069031A (en) * 2015-07-21 2015-11-18 湖北翰群通信有限公司 Self-service query system and method of legal provisions on the basis of wireless network
CN106815256A (en) * 2015-12-01 2017-06-09 北京国双科技有限公司 Set up the method and device of laws and regulations bar fund incidence relation
CN105718526A (en) * 2016-01-15 2016-06-29 上海律巢网络科技有限公司 Data search method, device and system based on lawyer information
CN107784024A (en) * 2016-08-31 2018-03-09 北京国双科技有限公司 Build the method and device of party's portrait
CN107784024B (en) * 2016-08-31 2019-04-09 北京国双科技有限公司 Construct the method and device of party's portrait
CN108132941B (en) * 2016-11-30 2021-03-26 北京国双科技有限公司 Processing method and device for incidence relation of legal document
CN108132941A (en) * 2016-11-30 2018-06-08 北京国双科技有限公司 The treating method and apparatus of the incidence relation of juristic writing
CN108255862A (en) * 2016-12-29 2018-07-06 北京国双科技有限公司 The search method and device of a kind of judgement document
US11288326B2 (en) 2016-12-29 2022-03-29 Beijing Gridsum Technology Co., Ltd. Retrieval method and device for judgment documents
CN108255862B (en) * 2016-12-29 2019-09-17 北京国双科技有限公司 A kind of search method and device of judgement document
CN108572953A (en) * 2017-03-07 2018-09-25 上海颐为网络科技有限公司 A kind of merging method of entry structure
CN108733732A (en) * 2017-04-25 2018-11-02 北京国双科技有限公司 A kind of text searching method and device
CN107153689A (en) * 2017-04-29 2017-09-12 安徽富驰信息技术有限公司 A kind of case search method based on Topic Similarity
CN107273476A (en) * 2017-06-08 2017-10-20 广州优视网络科技有限公司 A kind of article search method, device and server
CN110019655A (en) * 2017-07-21 2019-07-16 北京国双科技有限公司 Precedent case acquisition methods and device
CN107609096B (en) * 2017-09-11 2020-07-10 武汉科技大学 Intelligent lawyer expert response method
CN107609096A (en) * 2017-09-11 2018-01-19 武汉科技大学 A kind of intelligent lawyer's expert responses method
CN110020420A (en) * 2018-01-10 2019-07-16 腾讯科技(深圳)有限公司 Text handling method, device, computer equipment and storage medium
CN108280172A (en) * 2018-01-20 2018-07-13 郑州幼儿师范高等专科学校 A kind of science of law inquiry system
CN108595415B (en) * 2018-03-26 2022-06-14 北京北大英华科技有限公司 Law differentiation judgment method and device, computer equipment and storage medium
CN108595415A (en) * 2018-03-26 2018-09-28 北京北大英华科技有限公司 A kind of law differentiation determination method, device and computer equipment, storage medium
CN109213925B (en) * 2018-07-10 2021-08-31 深圳价值在线信息科技股份有限公司 Legal text searching method
CN109213925A (en) * 2018-07-10 2019-01-15 深圳价值在线信息科技股份有限公司 Law Text searching method
US10452734B1 (en) 2018-09-21 2019-10-22 SSB Legal Technologies, LLC Data visualization platform for use in a network environment
US11030270B1 (en) 2018-09-21 2021-06-08 SSB Legal Technologies, LLC Data visualization platform for use in a network environment
CN109408520A (en) * 2018-09-26 2019-03-01 青岛农业大学 A kind of law online updating method, system, equipment and computer program product
CN109582959B (en) * 2018-11-21 2022-03-01 紫优科技(深圳)有限公司 Book catalog generation method and device, computer equipment and storage medium
CN109582959A (en) * 2018-11-21 2019-04-05 紫优科技(深圳)有限公司 Library catalogue generation method, device, computer equipment and storage medium
CN109614453A (en) * 2018-12-14 2019-04-12 杭州法询信息科技有限公司 A kind of data storage, querying method and the device of regulatory information
CN111767365A (en) * 2019-03-12 2020-10-13 株式会社理光 Document retrieval apparatus and method
CN113051289A (en) * 2021-03-11 2021-06-29 北京律联东方文化传播有限公司 French retrieval method, device, equipment and storage medium
CN113190667A (en) * 2021-05-12 2021-07-30 北京律联东方文化传播有限公司 Legal data query method, device, equipment and storage medium
CN115374239A (en) * 2022-07-13 2022-11-22 北京中海住梦科技有限公司 Legal and legal analysis method and device, computer equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN104008171A (en) Legal database establishing method and legal retrieving service method
CN103678576B (en) The text retrieval system analyzed based on dynamic semantics
CN104239513B (en) A kind of semantic retrieving method of domain-oriented data
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
CN103678412B (en) A kind of method and device of file retrieval
CN111143479A (en) Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
Hillard et al. Learning weighted entity lists from web click logs for spoken language understanding
CN109947914A (en) A kind of software defect automatic question-answering method based on template
Yu et al. Concept extraction for structured text using entropy weight method
Ajoudanian et al. Deep web content mining
CN105426490A (en) Tree structure based indexing method
Asadi et al. Pattern-based extraction of addresses from web page content
CN105550226A (en) Inquiry sub-page generation method based on knowledge base
Selvi et al. An approach to improve precision and recall for ad-hoc information retrieval using sbir algorithm
Wang et al. Ontology-assisted deep Web source selection
Wang et al. Focused deep web entrance crawling by form feature classification
Shaker et al. A framework for extracting information from semi-structured web data sources
Gaur Data mining and visualization on legal documents
Jin et al. Indexing temporal information for web pages
Ingle Processing of unstructured data for information extraction
Shdaifat et al. Arabic WebPages classification based on fuzzy association
Hládek et al. Evaluation set for Slovak news information retrieval
Jayanthi et al. Referenced attribute Functional Dependency Database for visualizing web relational tables
Röder et al. 9 DICE, Paderborn University, Paderborn, Germany michael. roeder@ uni-paderborn. de

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140827