WO2011091470A1

WO2011091470A1 - Query processing of tree-structured data

Info

Publication number: WO2011091470A1
Application number: PCT/AU2011/000082
Authority: WO
Inventors: Sebastian Maneth; Kim Nguyen
Original assignee: National Ict Australia Limited
Priority date: 2010-01-27
Filing date: 2011-01-27
Publication date: 2011-08-04

Abstract

A computer-implemented method for processing a query of tree-structured data, comprising: (a) constructing an automaton from the query, wherein the automaton comprises one or more states, one or more labels, and one or more transitions each associated with each a state and a label; (b) analysing the one or more transitions of the automaton based on properties of the tree-structured data, or a subset of the tree- structured data; and (c) based on the analysis, updating the automaton for traversal of the tree-structured data, or the subset of the tree-structured data. Steps (b) and (c) are performed repeatedly during query processing to facilitate jumping from one node of the tree-structured data to another node. A computer system and a computer program for processing a query of tree-structured data are also disclosed.

Description

Query Processing of Tree-Structured Data

Cross Reference to Related Applications

The present application claims priority from Australian Provisional Application No 2010900322 filed on 27 January 2010, the content of which is incorporated herein by reference. The present application is related to corresponding international applications that claim priority from Australian Provisional Application No 2010900320 and Australian Provisional Application No 2010900321 respectively, the content of which is also incorporated herein by reference.

Technical Field

This disclosure concerns generally to query processing, and more particularly to a computer-implemented method for processing a query of tree-structured data. Other aspects include computer program to implement the method and a computer system for processing a query of tree-structured data.

Background

Extensible Markup Language (XML), a tree-structured data model defined by the World Wide Web Consortium (W3C), is slowly replacing conventional relational data model in applications for electronic commerce, business reporting and bioinformatics. Unlike relational data model, an XML document contains not only data, but also the relationship of the data using tags -or markup constructs such as <section> and </section>.

As more documents are stored and queried in XML format, quer languages such as XPath (XML Path Language) and XQuery have also become increasingly popular. XPath, which is a simpler that and forms the basis of XQuery, provides a path-like syntax for navigating nodes in a tree and selecting nodes based on search criteria. XPath query engines can be divided into two categories: sequential and indexed. In the sequential or streaming approach, each query must sequentially read a whole collection of data such that, ideally, only one pass over the data is required. In the indexed approach, the tree-structured data is pre-processed to build an index to guide query processing, such that traversal of the whole collection is avoided. For many time- critical applications, query run time is important. Summary

In a first aspect, there is provided is a computer-implemented method for processing a query of tree-structured data, comprising:

(a) constructing an automaton from the query, wherein the automaton comprises one or more states, and one or more transitions associated with each state;

(b) analysing the one or more transitions of the automaton based on properties of the tree-structured data, or a subset of the tree-structured data; and

(c) based on the analysis, updating the automaton for traversal of the tree- structured data, or the subset of the tree-structured data,

wherein steps (b) and (c) are performed repeatedly during query processing to facilitate jumping from one subset to another subset of the tree-structured data.

Using the method, the automaton associated with a query is updated based on properties of the tree-structured data, or a subset of the data. This allows the automaton to be adapted according to the properties, thereby improving query run time during query processing by, for example, removing transitions that necessitate the same subset of data to be traversed several times. Further, since tree automata are semantic constructions, similar queries can be processed independently of its syntax. Repeating steps (b) and (c) allows the automaton to be updated dynamically during query processing according to the properties of the tree or a subset of the tree, thereby improving the efficiency and run time of the query processing.

Traversal of the tree-structured data, or the subset of the data, may be according to a top-bottom traversal order.

The properties of the tree-structured data, or the subset of the tree-structured data, may be stored in a tree index representing a hierarchical structure of the tree-structured data, or the subset of the data. Alternatively or in addition, the properties of the tree-structured data, or the subset of the tree-structured data, may be stored in a text index representing textual content of the tree-structured data, or the subset of the data. For example, the properties may be a set of labels of the tree-structured data, or the subset of the data. Step (c) may comprise replacing one or more transitions with a jump to a labelled node in the tree-structured data, or a subset of the tree-structured data, using the tree index. Step (b) may comprise determining one or more transitions that will succeed based on the properties of the tree-structured data, or the subset of the tree-structured data. In this case, step (c) may also comprise simplifying the automaton by removing the determined one or more transitions from the automaton.

Step (b) may also comprise determining one or more transitions that cannot be satisfied based on the properties of the tree-structured data, or the subset of the tree-structured data. In this case, step (c) may comprise simplifying the automaton by removing the determined one or more transitions from the automaton.

The method may further comprise traversing the tree-structured data, or the subset of the tree-structured data, according to the updated automaton. The tree-structured data may be Extensible Markup Language (XML) data. The query may be an XPath query.

In a second aspect, there is provided a computer program to implement the method according to the first aspect. The computer program may be embodied in a computer- readable medium such that when code of the computer program is executed, causes a computer system to implement the method according to the first aspect.

In a third aspect, there is provided a computer system for processing a query of tree- structured data, comprising:

a parsing unit to construct an automaton from the query, wherein the automaton comprises one or more states, and one or more transitions associated with each state; and

a processing unit to (a) analyse the one or more transitions of the automaton based on properties of the tree-structured data, or a subset of the tree-structured data; and (b) based on the analysis, update the automaton for traversal of the tree-structured data, or the subset of the tree-structured data,

wherein the processing unit performs steps (a) and (b) repeatedly during query processing to facilitate jumping from one subset to another subset of the tree-structured data. Brief Description of Drawings

Non-limiting example(s) will now be described with reference to the accompanying drawings, in which:

Fig. 1 is an exemplary system for query processing.

Fig. 2 is a schematic diagram of steps performed by a query engine.

Fig. 3(a) is an exemplary XML document.

Fig. 3(b) is a text collection created based on the XML document in Fig. 3(a).

Fig. 3(c) is a tree structure created based on the XML document in Fig. 3(a).

Fig. 3(d) is an XML model created based on the XML document in Fig. 3(a). Fig. 4 is an exemplary tree diagram for an XML document.

Fig. 5 is a flowchart of steps performed by a processing unit of the query engine during query processing.

Fig. 6(a) is a tree diagram of a subset of tree-structured data with footnote and marginnote nodes.

Fig. 6(b) is an automaton updated according to the properties of the subset in

Fig. 6(a).

Fig. 7(a) is a tree diagram of a subset of tree-structured data without any footnote nodes.

Fig. 7(b) is an automaton updated according to the properties of the subset in Fig. 7(a).

Detailed Description

Referring first to Fig. 1, the system 100 comprises a query engine 110 and a data store 120 in communication with a plurality of communications devices 152 over a communications network 140, 142. The devices 152 are each operated by a user 150. The communications network 140 may be a local area network (LAN) or wide area network (WAN), wireless or wired.

Referring also to Fig. 2, the query engine 1 10 comprises an indexing unit 112, a query parsing unit 114 and a processing unit 1 16. The query engine 1 10 processes queries of a collection of XML documents 122 (tree-structured data) in the data store 120. A query may be submitted by a user 150, or by a server 154.

The query engine 110 uses an indexed approach to pre-process the XML collection 122, so that later queries can be solved without traversing the entire collection. This indexed approach is distinguishable from streaming approach where XML documents are not stored on disk but rather fed "bits by bits" to the "streaming" query engine that is unable to keep track of what it has seen of the document. That is, upon seeing a piece of the input document, the streaming query engine decides whether to flag that piece of a result or pass it to another process. By contrast, indexed documents are stored the data store 120.

Firstly, the indexing unit 1 12 performs pre-processing data analysis on documents in the XML collection to determine the structure and content of the documents 122; see step 210. Results of the data analysis are used during index generation to build a text , index 124 and a tree index 126 for use in query processing; see step 220.

After receiving a query, the parsing unit 114 analyses or parses a path expression for the query; see step 230. Based on the query, an automaton is constructed by the parsing unit 1 14 before further processing is performed; see step 235. The constructed . automaton comprises one or more states, each state being associated with one or more transitions.

Then in step 240, the processing unit 1 16 analyses the constructed automaton based on properties of the XML collection, or a subset of the XML collection. Based on the analysis, the processing unit 1 16 determines whether to update the constructed automaton for traversal of the XML collection or a subset of the XML collection. The results of the query are then presented to the user 150; see step 250.

The index comprises the text 124 and tree 126 indices created by the indexing unit 1 12. Specifically, the text index 124 is used to facilitate counting of the number of text nodes matching a simple predicate in a query. The tree index 126 provides an approximation on the number of nodes or potential results in a subset of nodes in the tree starting from a particular node. The indexing 220, automaton construction 235 and query processing 240 steps performed by the query engine 110 will now be explained further below. Indexing 220

XML documents can be regarded as a "text collection" or a set of strings organised into a labelled "tree structure". The strings correspond to textual content of the data while the tree structure defines the hierarchical structure of the tree.

Referring now to Fig. 3, the tree in Fig. 3(d) corresponds to the XML data in Fig. 3(a). The tree is formed by solid edges, whereas dotted edges display the connection with the set of texts. There are two types of identifier in the tree: text identifiers (numbers in italics) assigned to text content, and global identifiers (numbers in non-italics) assigned to internal and leaf nodes.

There are a number of internal nodes represented by the following symbols:

& is a dummy root (labelled 1) that is added to create a tree instead of a forest; # is a node (6, 8, 10, 16) associated with a string or textual content ("soon discontinued", "blue", "40" and "30" respectively),

@ is a node (3,12) associated with an attribute ("name"), and

% is a leaf node (1,5) of an attribute node (3,12) and is associated with a value ("pen", "rubber") to an attribute ("name"). Using the above representation, there is exactly one string content associated to each tree leaf, and those strings are referred to as texts. In the example in Fig. 3(d), there are six texts, which are associated to the tree leaves and labelled using text identifiers from left to right: / - "pen", 2 - "Soon discontinued", 3 - "blue", 4 - "40", 5 - "rubber" and 6 - "30".

The indexing unit 112 analyses the XML data in Fig. 3(a) to create the text index in Fig. 3(b) and the tree index in Fig. 3(c).

Text Index 124

The text index 124 allows pattern matching during query processing. Textual content is represented as a succinct full-text self index [1] that is generally known as the FM- index [2]. The text collection T stores the content of the XML data as $-terminated strings so that each text corresponds to one string. In the example in Fig. 3(b), T is a concatenated sequence of d texts:

r= pen$Soon discontinued$blue$40$rubber$30$, where $ is a delimiter; see 310. Given a string T of total length w, from an alphabet size of σ, the F -index is based on the Burrows-Wheeler transform (BWT) transformation [3] of string T. Assume T ends with the special endmarker '$' and letM be a matrix whose rows are all the cyclic rotations of T in lexicographic order. The last column L of M forms a permutation of T which is the BWT string L - T*^wl. The matrix is only conceptual; the FM-index uses only on the Τ*¹" string. Note L[i] is the symbol preceding the /-th lexicographically smallest row of The resulting permutation is reversible. The first column of M, denoted E, contains all symbols of T in lexicographic order; see 320 in Fig. 3(b). There exists a simple last-to- first mapping from symbols in L to F [4], Let C[c] be the total number of symbols in T that are lexicographically less than c. Now the IE-mapping can be defined as:

LF(i) = C[L[i]] + rank_m (L, i).

The symbols of T can be read in reverse order by starting from the end-marker location i and applying LE(z) recursively: we get ^wt [i], 7*^w' [LF(i)], ™' [LF(LF(i))] and so on. Finally, after u steps, we get the first symbol of T . The values C[c] can be stored in a small array of σ log u bits. Function rank_c(L, i) can be computed in 0(log a) time with a wavelet tree data structure requiring only uHi^T ) + 0(u log σ) bits [5], [6].

During query processing, pattern matching is supported via backward searching on the BWT [4]. Given a pattern P[l, m], backward searching is performed as follows: 1. Starts with the range [sp, ep] - [/, u] of rows in M.

2. At each step e {m,m-l, . . . , /}, update range [sp, ep] to

[sp', ep'] to match all rows of M that have P[i, m] as a prefix:

sp' = C[P[i]] + rank_P[i] (L, sp-\)+\ and

ep' = C[P[f]] + rank_P[i (L, ep).

To find out the location of each occurrence, the text is traversed backwards from each sp < i < sp (virtually, using LF on 7*^w') until a sampled position is found. This is a sampling carried out at regular text positions, so that the corresponding positions in 7*^w' are marked in a bitmap B_s[], u], and the text position corresponding to *^w' [i], if B_s[i] - 1, is stored at a samples array P_s [rank) (B_s, /)]. 7*"" contains all end-markers in some permuted order; see 320 in Fig. 3(b). This permutation is represented with a data structure Doc, that maps from positions of $s in 7*"" to text numbers, and also allows two-dimensional range searching [7]; see 330 in Fig. 3(c). Thus, the text corresponding to a terminator T*^wl [i] = $ is Doc[rankS>( 7*^w', i)]. Furthermore, given a range [sp, ep] of 7*^w< and a range of text identifiers [x, y], Doc can be used to find identifiers of all $-terminators within [sp, ep] * [x, y] range in 0(log d) time per answer. In practice, Doc can be implemented as a plain array using d log d bits.

The basic pattern matching feature of the FM-index is extended to support XPath functions. Given a pattern and a range of text identifiers to. be searched, these XPath functions return all text identifiers that match the query within the range. In addition, existential (i.e. is there a match in the range?) and counting (i.e. how many matches in the range?) queries are supported. Exemplary XPath functions are as follows:

(a) starts-M>ith(P, [x, y]): The goal is to find texts in [x, y] range prefixed by the given pattern P. After backward search, the range [sp, ep) in 1*™' contains the endmarkers of all the texts prefixed by P. Now [sp, ep] x [x, y] can be mapped to Doc, and existential and counting queries can be answered in 0(log d) time. Matching text identifiers can be reported in 0(log d) time per identifier.

(b) ends-with{P, [x, y]) Backward searching is localized to texts [x, y] by choosing [sp, ep] - [x, y] as the starting interval. After the backward search, the resulting range [sp, ep] contains all possible matches, thus, existential and counting queries can be answered in constant time. To find out text identifiers for each occurrence, text must be traversed backwards to find a sampled position.

(c) operator = (P, [x, y]): texts that are equal to P, and in range, can be found as follows. Do the backward search as in ends-with, then map to the $-terminators like in starts-with. Time complexities are same as in starts-with.

(d) contains(P, [x, y]): To find texts that contain P, we start with the normal backward search and finish like in ends-with. In this case there might be several occurrences inside one text, which have to be filtered. Thus, the time complexity is proportional to the total number of occurrences, 0(1 log σ) for each. Existential and counting queries are as slow as reporting queries, but the 0(\P\ log a)-time counting of all the occurrences of P can still be useful for query optimization.

(e) operators <, <, >, >: The operator < matches texts that are lexicographically smaller than or equal to the given pattern. It can be solved like the starts-with query, but updating only the ep of each backward search step, while sp = 1 stays constant. If at some point there are no occurrences of P[i] = c within the prefix L[l, ep], we find those of smaller symbols, ep = C[c], and continue for P[l, i - /]. Other operators can be supported analogously, and costs are as for starts-with.

Tree Index 126

As shown in Fig. 3(c), the tree index 126 is represented by the following compact data structures, which provide navigation and indexed access to it. (a) Par 350: The balanced parentheses representation [8] of the tree structure. This is obtained by traversing the tree in depth-first-search (DFS) order, writing a "(" whenever the indexing unit 112 reaches a node, and a ")" when the indexing unit 112 leaves the node (thus it is easily produced during the XML parsing). This way, every node is represented by a pair of matching opening and closing parentheses. A tree node will be identified by the position of its opening parenthesis in Par (that is, a node will be just an integer index within Par).

(b) Tag 360: A sequence of the tag identifiers of each tree node, including an opening and a closing version of each tag, to mark the beginning and ending point of each node. These tags are numbers in [/, 2t] and are aligned with Par so that the tag of node is simply Tag[i]. For example, Tag[\] returns the root node & (labelled 1) in the tree in Fig. 3(d) and 7¾g[4] is "@name" as represented by "n" (4th position). The sequence also comprises corresponding closing tags "/&" (last position) and "/n" (7th position)

Rank and select queries are also required on Tag. Several sequence representations supporting these are known [9], and a practical representation that favours speed over space is selected. First, the indexing unit 1 12 stores the tags in an array using flog 2t] bits per field, which gives constant time access to Tag[i\. The rank and select queries over the sequence of tags are answered by a second structure. Consider the binary matrix: R[1..2t][1..2n] such that R[i, j] = 1 if Tagj\ = i; see 370 in Fig. 3(c). Each row of the matrix R is represented using Okanohara and Sadakane's structure sarray [10]. The structure supports access and select in (9(1) time, and rank in 0(log n) time.

Tree structure comprising data structures Par and Tag can then be used during query processing. The following operations over the tree structure are useful to support XPath queries over the tree. Let tag be a tag identifier.

(a) Basic Tree Operations [11]

Let x be a node (a position in Par), the tree operations are:

Close(x): The closing parenthesis matching Par[x]. If x is a small subtree this takes a few local accesses to Par, otherwise a few non-local table accesses.

Preorder(x) = ranHPar, i): Preorder number of x.

SubtreeSize( ) = (Close(x)-x+l)/2: Number of nodes in the subtree rooted at x.

IsAncestor(x, y) = x < y < Close(j ): Whether x is an ancestor of y.

FirstChild(x) = x + 1 : First child of x, if any.

NextSibling c) = Close( )+l : Next sibling of x, if any.

Parent(x): Parent of x. Somewhat costlier than Closest) in practice, because the answer is less likely to be near x in Par.

(b) Connecting to Tags

The following operations are important for fast XPath evaluation.

SubtreeTags(x, tag): Returns the number of occurrences of tag within the subtree rooted at node x. This is rank^Tag, Close(x)) - rank_lag(Tag, x - 1).

Tag( ): Gives the tag identifier of node x.

TaggedDesc(x, tag): The first node labelled tag strictly within the subtree rooted at x. This is select _tag(Tag, rank_lag(Tag, x) + 1) if it is < Close(x), and undefined otherwise.

TaggedPrec(j , tag): The last node labelled tag with preorder smaller than that of node x, and not an ancestor of x. Let r = rank_ta^Tag, x - 1). If select _tag (Tag, r) is not an ancestor of node x, we stop. Otherwise, we set r = r - 1 and iterate.

TaggedFoll(x, tag): The first node labelled tag with preorder larger than that of x, and not in the subtree of x. This is select_tag (Tag, rank_tag(Tag, Close(x)) + 1). (c) Connecting the Text and the Tree

Conversion between text numbers, tree nodes, and global identifiers, is easily carried out by using Par and a bitmap B of 2n bits that marks the opening parentheses of tree leaves containing text, plus 0(n) extra bits to support rank or select queries. Bitmap B enables the computation of the following operations:

LeafNumber(x): Gives the number of leaves up to x in Par. This is rank_\(B, x).

Textlds(*): Gives the range of text identifiers that descend from node x. This is simply [LeafNumber(x-l)+l , LeafNumber(CloseQc))],

XMLIdText(c : Gives the global identifier for the text with identifier d. This is Preorder^e/ect tf, d)).

XMLIdNode(x): Gives the global identifier for a tree node x. This is just Preorder(x).

Automaton Construction 235

A tree automaton is an abstract machine, consisting of states and transitions , which can then be used to traverse a tree. An automaton Λ is a tuple , where is the infinite set of all possible tree labels, Q> \s the finite set of states, — Q is the set of initial states, and δ is the set of transitions. Generally, the translation from XPath query to automata can be done in one pass through a parse tree created by the parsing unit 1 14; see step 230 in Fig. 2. The. resulting automaton is, roughly speaking, "isomorphic" to the original query.

Consider a query /descendant : : list item/descendant : : keyword. The corresponding automaton is a 4-tuple:

A = (C, {q₀ qi } _y {qo} , 6)_t

where C is the infinite set of all possible tree labels , {qo} and {qo, qi } are states in the automaton, and set δ contains the following transitions:

1 ςο, {list item}→ it qi 4 qi , {keyword}→ mark

2 <?o,X - { <§> , #} →ii go 5 ¾-! , £ - { <§> , #} →Ii qi

3 qo, C -→i2 <70 6 <?i , £ -→l2 9i Each transition is characterised by a starting state (qo or qi) and at least one condition such as whether the current node is labelled {listitem} or { keyword}. If the condition is satisfied, the action on the right hand side of the transition is performed. Action mark (transition 4) represents marking a node as a result node. Action (transition 1 and 5) represents going to the next child node and changing to state qi. Similarly, action 4- 1 0 (transition 2) represents traversing to the next child node, but remaining in state q₀. Finally, actions J-2 QO (transition 3) and ^ Ql (transition 6) represent traversing to the next sibling node, and changing to state qo and qi respectively. The automaton is non-deterministic, in that for a given state and label, the automaton can change to more than one state.

The above automaton starts in initial state {q₀} and traverses the tree until it finds a subtree labelled list item. If the subtree is found, the automaton changes to state {qi } and continues to traverse the subtree to look for a tag keyword, or possibly another tag list item. In the latter case, the automaton returns to state {qo} according to transition 1.

Condition (<?i , £ - { <§> , #}) in transitions 2 and 5 ensure that, according to the semahtics of the descendant axis, only element nodes (i.e. not text # or attributes @) are considered. According to transition 4, a node labelled keyword that is found in state {qo, qi } will be marked as a result node.

In a first example, consider the following query for extracting captions of all figures in the tree-structured data in Fig. 4:

Query 1: XPath : //caption

Based on Query 1, the parsing unit 1 14 constructs the following automaton having one state, qo, and two transitions (Yl) and (Y2) according to step 235 in Fig. 2.

Automaton 1:

qo, caption→ Mark (Yl)

where operator V necessitates an earlier action (e.g. |i qo in Yl) to be performed before a later action (e.g. |₂ qo in Y2) positioned to the right of the earlier action. If Automaton 1 is used during query processing, the tree in Fig. 4 is traversed to mark all caption nodes. The automaton starts its computation at the root node ("document") 402 in state qo. At this current node, transition (Yl) is not performed because the node is not a caption node. Instead, transition (Y2) is performed because its condition (* is a wildcard) is satisfied. The first action (|i q₀) instructs the processing unit 1 16 to go to a first child node ("page" 404) of the current node ("document" 402). Once the nodes in the subtree starting from the "page" 404 are considered, the second action (|₂ q₀) instructs the processing unit 1 16 to consider the subtree starting from the second child node ("page" 406). This process is repeated until all caption nodes are found.

Since the "page" node 404 is not a caption node, transition Yl is not performed and transition (Y2) instructs the processing unit 116 to move to the first child node ("para" 408). The current node is set to this child node ("para" 408). Again, once all nodes in the subset starting from the current node ("para" 408) are considered, the second action (j₂ qo) instructs the processing unit 1 16 to go to the next sibling node ("figure" 410). The current node is then set to this sibling node ("figure" 410). At the "figure" node 410, transition (Y2) instructs the processing unit 116 to move to the first child node "caption" 412. In this case, the caption node 412 is marked as a result node because the condition of (Yl ) is satisfied. Other parts of the tree are then considered, from the subset starting from the second child node 406 to that from child nodes 414, 416 and finally 418. In this case, three caption nodes 412, 420 and 422 are found.

However, Automaton 1 is not optimal because the whole tree in Fig. 4 needs to be traversed to satisfy the query. In a second example, consider a more complicated query for extracting all figures on a page that contains both footnotes and margin notes:

Query 2: XPath : / / page [ . // footnote and , / /marginnote ] / / figure. Based on Query 2, the parsing unit 1 14 constructs the following automaton having four states qo, qi, q₂ and q₃, and eight transitions (Tl) to (T8); see 235 in Fig. 2.

Automaton 2:

qo, page -→ |iqi & |,q₂ & |iq₃ (Tl)

qi, footnote→ OK (T3)

q₂, marginnote→ OK (Τ5)

q₃, figure→ Mark (Τ7)

where operator '&' necessitates all actions (e.g. jiqi, iiq₂, and Jiq₃ in Tl) to be performed, and operator V necessitates an earlier action (e.g. ji qo in Tl) to be performed before a later action (e.g. J,₂ qo in T2) to the right of the earlier action.

If Automaton 2 is used during query processing, the first (Tl) and second (T2) transitions look for a "page" node in the tree-structured data. If a page node is found, the first transition (Tl) instructs the processing unit 1 16. to execute actions of states qj, q₂ and q₃. States qi and q₂ test the presence of nodes ("OK") but do not mark them. State q₃ will only be considered if the page contains both footnotes and margin notes.

However, Automaton .2 is not optimal because transitions (T4) and (T6) require two traversals of the same subtree of the tree-structured data. In some cases, if a footnote is not located on the subtree, transitions (T5) to (T8) can be avoided.

If the XPath query is changed to XPath : / /page [ . / /marginnote and . / / footnote ] / /figure, there will be no savings in query run time if no footnotes are found after locating at least one margin note. As such, automaton constructed based on the syntax of the XPath query may not be optimal.

Query Processing 240

To improve query run time, an automaton constructed by the parsing unit 1 14 is first analysed by the processing unit 116 prior to query processing. Automaton analysis is performed using the properties of the tree-structured data in the form of the textual content of the tree, as stored in a text index 124, and its hierarchical structure, as stored in the tree index 126.

Referring now to the flowchart in Fig. 5, the processing unit 1 16 first determines a current node in the tree-structured data; step 510. The processing unit 116 analyses the transitions in the constructed automaton to determine whether the automaton can be dynamically updated based on the properties of the subtree starting from the current node; see steps 520 and 530. Properties of the subtree are stored in the text 124 and tree 126 indices in the data store 120. If no improvements can be made, the original automaton will be used to traverse the subtree and any results stored; see steps 555 and 560. Otherwise, the processing unit 1 1 will proceed to update the automaton based on the analysis; see step 540.

In a first example, Automaton 2 is analysed to determine whether the automaton can be dynamically updated based on the properties of the subtree starting from current node ("page" 610) in Fig. 6(a). Transitions (T3) and (T5) only test the presence of nodes with tags "footnote" and "marginnote" respectively. Based on the text 124 and tree 126 associated with this subset (not shown), the subtree in Fig. 6(a) clearly comprises such nodes. As such, transitions (T3) and (T5) will succeed for sure, and they are removed from the automaton constructed by the parsing unit 1 14. The updated automaton is shown in Fig. 6(b), where removed or inactive transitions are shown struck through for the subtree in Fig. 6(a). Compared with the original automaton, the same subset of data is not traversed multiple times: first to determine the presence of "footnote" nodes and then to determine the presence of "marginnote" nodes. The remaining transitions are the last action of (Tl), (T7) and (T8) allow the automaton to be processed more quickly. Based on the updated automaton, the processing unit 1 16 traverses the subset and stores any results obtained; see steps 550 and 560 in Fig. 5.

In another example shown in Fig. 7, Automaton 2 is updated differently based on different properties of the subtree in Fig. 7(a). In particular, transitions (T3) cannot be satisfied because the subtree does not have any "footnote" nodes. As such, transition (Tl) also cannot be satisfied, leaving (T2) as the remaining transition. However, action J.iqo of transition (T2) also cannot be satisfied because there is no child page node below the root page node of the subtree.

In this case, the updated automaton is shown in Fig. 7(b) where removed or inactive transitions are struck through for the subtree in Fig. 7(a). Compared with the original automaton, the subtree in Fig. 7(a) is not traversed at all because the text 124 and tree 126 indices show-that this subset will not return any results. As such, query run time is reduced because the processing unit 1 16 does not have to waste time traversing subsets with no results. This also avoids the need to traverse the entire tree. Based on the remaining transition (T2), the processing unit 116 jumps to the next sibling node to proceed with the traversal; see arrow 710 in Fig. 7(a).

The above steps are repeated during query processing to specialise or adapt the automaton according to the properties of the relevant subtree; see steps 570 and 590. Once the relevant nodes are considered, the processing unit 1 16 proceeds to report the results; see step 580.

As such, using the text 124 and tree 126 indices, the constructed automaton can be specialised such that, during query processing, the processing unit 1 16 is able to determine and jump to the next node of interest to avoid traversing unnecessary parts of the tree-structured data. Where applicable, actions that need to be performed on the same subset of the tree-structured data are simplified such that at most one traversal is performed on the same subset. For instance, such jumps occur when the automaton possesses specific pairs of transitions. One going from one state qo to a distinct state qi or a Marking action, and another transition going from state q₀ to q₀ itself.

For instance consider the pairs T1/T2 or T7/T8 in Fig. 6(b). T1/T2 basically mean that if the current node is labelled "page" then the automaton can proceed to the actions associated with state qi and if not, it stays in qo and should continue to look for a "page"-labelled node. In other words, the automaton is just traversing the tree until it reaches a "page"-labelled node. Such "traversal until label" behaviour can be replaced all together by a jump from the current node to the next "page"-labelled node. This jumps are possible using the tree index 126. Furthermore, this analysis on the automaton is not lost but rather, the data structure representing the automaton is updated to reflect the behaviour that the query engine has determined. For instance, the pair of transitions T1/T2 is replaced by a single instruction coding that that in state q₀ the automaton has to jump to the next "page"-labelled node. For many applications, query run time is important but queries are often "weakly specified", which means that the queries are relatively general. Examples of weakly specified queries are "all titles" and "the titles of all books". This is to be contrasted with more specific queries such as "the title of all books whose author is Kipling and published after 1900". If an XML document has 20 million of elements but only 100 titles satisfy a weakly specified query, a query engine 1 10 using the above method only has to perform 100 jumps to the relevant title nodes instead of 20 million lookups. Variations

It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the method and system as shown in the specific embodiments. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.

It should also be understood that, unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as "receiving", "processing", "retrieving", "selecting", "calculating", "determining", "displaying" or the like, refer to the action and processes of a computer system, or similar electronic computing device, that processes and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices. Unless the context clearly requires otherwise, words using singular or plural number also include the plural or singular number respectively. It should be understood that the techniques described herein might be implemented using a variety of technologies. For example, the methods described herein may be implemented by a series of computer executable instructions residing on a suitable computer readable medium. Suitable computer readable media may include volatile (e.g. RAM) and/or non-volatile (e.g. ROM, disk) memory, carrier waves and transmission media (e.g. copper wire, coaxial cable, fibre optic media). Exemplary carrier waves may take the form of electrical, electromagnetic or optical signals conveying digital data steams along a local network or a publically accessible network such as the Internet. References

[1] G. Navarro and V. M^' akinen, "Compressed full-text indexes," ACM Comp. Surv., vol. 39, no. 1, 2007.

[2] P. Ferragina and G_r Manzini, "Indexing compressed text," J. ACM, vol. 54, no. 4, pp. 552-581, 2005. '

[3] M. Burrows and Ό. J. Wheeler, "A block-sorting lossless data compression algorithm." Digital Equipment Corporation, Tech. Rep. 124, 1994. [4] P. Ferragina and G. Manzini, "Indexing compressed text," J. ACM, vol. 54, no. 4, pp. 552-581 , 2005.

[5] P. Ferragina, G. Manzini, V. M^"akinen, and G. Navarro, "Compressed representations of sequences and full-text indexes," ACM TALG, vol. 3, no. 2, 2007.

[6] R. Grossi, A. Gupta, and J. S. Vitter, "High-order entropy-compressed text indexes," in SODA, 2003 , pp. 841 -850.

[7] V. M^' akinen and G. Navarro, "Rank and select revisited and extended," Theor. Comput. Sci., vol. 387, no. 3, pp. 332-347, 2007.

[8] I. Munro and V. Raman, "Succinct representation of balanced parentheses, static trees and planar graphs," in FOCS, 1997, pp. 1 18-126.

[9] F. Claude and G. Navarro, "Practical rank/select queries over arbitrary sequences," in SPIRE, 2008, pp. 176-187.

[10] D. Okanohara and . Sadakane, "Practical entropy-compressed rank/select dictionary," in ALENEX, 2007.

[11] K. Sadakane and G. Navarro, "Fully-functional static and dynamic succinct trees," in SODA, 2010.

Claims

Claims:

1. A computer-implemented method for processing a query of tree-structured data, comprising:

(a) constructing an automaton from the query, wherein the automaton comprises one or more states, one or more labels, and one or more transitions each associated with a state and a label;

(c) based ,,οη the analysis, updating the automaton for traversal of the tree- structured data, or the subset of the tree-structured data,

wherein steps (b) and (c) are performed repeatedly during query processing to facilitate jumping from one node of the tree-structured data to another node.

2. The computer-implemented method of claim 1, wherein traversal of the tree- structured data, or the subset of the data, is according to a top-bottom traversal order.

3. The computer-implemented method of claim 1 or 2, wherein the properties of the tree-structured data, or the subset of the tree-structured data, are stored in text index representing textual content of the tree-structured data, or the subset of the data.

4. The computer-implemented method of claim 1, 2 or 3, wherein the properties of the tree-structured data, or the subset of the tree-structured data, are stored in a tree index representing a hierarchical structure of the tree-structured data, or the subset of the data.

5. The computer-implemented method of claim 4, wherein step (c) comprises replacing one or more transitions with a jump to a labelled node in the tree-structured data, or a subset of the tree-structured data, using the tree index.

6. The computer-implemented method of any one claims 1 to 4, wherein step (b) comprises determining one or more transitions that will succeed based on the properties of the tree-structured data, or the subset of the tree-structured data.

7. The computer-implemented method of claim 6, wherein step (c) comprises simplifying the automaton by removing the determined one or more transitions from the automaton.

8. The computer-implemented method of any one of the preceding claims, wherein step (b) comprises determining one or more transitions that cannot be satisfied based on the properties of the tree- structured data, or the subset of the tree-structured data.

9. The computer-implemented method of claim 8, wherein step (c) comprises simplifying the automaton by removing the determined one or more transitions from the automaton.

10. The computer-implemented method of any one of the preceding claims, further comprising traversing the tree-structured data, or the subset of the tree-structured data, according to the updated automaton.

11. The computer-implemented method of any one of the preceding claims, wherein the tree-structured data is Extensible Markup Language (XML) data.

12. The computer-implemented method of any one of the preceding claims, wherein the query is an XPath query.

13. A computer program to implement the method of any one of the preceding claims.

14. A computer-implemented system for processing a query of tree-structured data, comprising:

a processing unit to (a) analyse the one or more transitions of the automaton based on properties of the tree-structured data, or a subset of the tree-structured data; and (b) based on the analysis, update the automaton for traversal of the tree-structured data, or the subset of the tree-structured data, wherein the processing unit performs steps (a) and (b) repeatedly during query processing to facilitate jumping from one subset to another subset of the tree-structured data.