WO2002075533A1

WO2002075533A1 - Method and apparatus for processing information

Info

Publication number: WO2002075533A1
Application number: PCT/FI2002/000212
Authority: WO
Inventors: Antti Jokipii
Original assignee: Republica Jyväskylä Oy
Priority date: 2001-03-16
Filing date: 2002-03-15
Publication date: 2002-09-26
Also published as: FI20010536A0; FI20010536A

Abstract

The invention relates to the processing of data in the XML form (Extensible Markup Language) so that the information contained in the data is made available for the use of existing software components. The invention also relates to a centralized management of the definitions carried out in order to process the information. When processing XML source data (301, 405), there is used a set of rules (302, 404) that include exact definitions as regards with what software components (304) the source data is used, what the employed source data (301, 405) is like and what is done to the source data at each step. The source data (301, 405) is modified, in a way defined in the set of rules (302, 404), in the processing component (303, 406), so that the modified source data is in a form required by the reusable software components (304).

Description

Method and apparatus for processing information

The invention relates to the processing of data in the form of XML (Extensible Markup Language), so that the information contained in said data is obtained to be available for existing program components. In addition, the invention relates to a centralized management of the definitions made in order to be able to process the data.

The employed program component needs the input information in a certain form. Between the program components, there is no uniform input format, but each program component has its own requirements as regards the input information. The interface of the program component defines in which form and in which order said program component needs the input information. The data described by XML must be modified into a form required by the program component. At the priority date of the present application, additional information of the XML language can be found in the address http://www.w3.org/XML/. In most cases, if the XML to be fed in is not particularly designed for said program component, it must also be possible to fetch the necessary pieces of information from their respective locations in a given order. In most applications using XML, for instance in XML-based messaging systems, there are only used certain parts of the XML data that was fed in.

When reading the source data, there are utilized XML parsers that organize the input data in a form where the program components can use it. In addition, the utilization of the input data requires a remarkable amount of programming work in order to render the required parsers and the data produced by said parsers to be put to proper use in the internal logic of the program component in question. Among generally known XML parsers, there are two main types: event-based (SAX, Simple API for XML; API, Application Programming Interface) and tree-based (DOM, Document Object Model). At the priority date of the present application, more information of said XML parsers can be found at the addresses http://www.megginson.com/SAX/index.html and http://www.w3.org/DOM/. Let us now observe known arrangements, their operation and the drawbacks connected thereto.

One example of an event-based parser is SAX, the operation whereof is illustrated in figure 1. The source data 101 is read in the SAX parser 102, which arranges the original source data into various parsing events, including for example the beginning and end of the elements. Said events SAX then transmits directly to the event-processing classes of the employed program components 103. Generally the events are not arranged in an internal structure, but they are processed after the events have been triggered, which means that the processing is light. On the other hand, the source data 101 can only be processed event by event. Events are processed through so-called handles 103a of the programs. Said handles 103a must be separately produced in each event-processing program. The processing proper of the events in the program components 103 is very similar to the processing of events in a graphical user interface.

An example of a parser producing a tree structure is illustrated in figure 2. The parser illustrated in figure 2 represents the type DOM 202. An application using the DOM 202 definition creates an application programming interface (API) for instance to source data 201 in the XML language. By means of the DOM interface 203, the contents and structure of the source data 201 can be accessed, and said features can then be modified. DOM includes a number of objects that represent XML documents, as well as a model describing how said objects can be combined and an application programming interface through which the objects can be handled.

DOM 202 defines the logical structure of the documents, as well the way in which the documents can be processed. Any kind of data contained in an XML document can be processed, transformed, removed or added by means of DOM 202. DOM 202 collects the source data into an internal tree structure 203 and allows the employed program 204 to navigate in said structure. As regards the program components 204, each of them includes a separate part 204a that can fetch the required data from the tree structure 203 created by DOM.

One of the drawbacks in a tree-based parser is it insufficient capacity when processing a large amount of data. A drawback of the event-based arrangement is the management of complex structures as well as references to certain locations, for instance by means of XPath. At the priority date of the present application, there is found additional information of such XPath references at the address http://www.w3.org/TR/xpath. Moreover, in event-based structures, the events collected in the parser are transmitted to the program components only once, whereafter they cannot be further processed. Both of the above described arrangements have the drawback that reprogramming is always required when there are changes in the data to be fed in. In many applications, programming interfaces based on the tree structure are feasible, but they require a lot of resources of the system, particularly when the source document to be processed is large. In addition, the production of a tree structure is a slow process, which means that this technique is not suitable in time-critical environments. Neither is the event-based model fast enough for time-critical systems, particularly in cases where only a part of certain types of events is needed.

Yet another drawback is caused by the fact that many program components designed to be reused contain parts made for processing data. In particular, if the source data to be fed in must be in the XML form, the structure of the program component contains a lot of source data specific coding. This make the reusability virtually impossible, because reusability means that a program component can be transferred to another environment or program without changing it . This means that for each system, there must be written new program components for processing the data.

When the input data or source data is in the XML language, another problem is created in that the XML standard only defines how the data must be coded, but it does not in any way define for instance the names, attributes or mutual relations of the employed elements. If the source data is in the XML form, a change in its grammar may in the worst case cause changes in all employed program components. Said changes are difficult to manage. Changes in an XML structure are typical, because XML based languages are undergoing a continuous process of change as their standardization proceeds.

Attempts have been made to solve the problems of the current technology by using tools that automatically generate for the program components such processing parts, i.e. frameworks, that are suitable for XML based data processing. One of these tools is for example Breeze XML Studio, and at the priority date of the present application, additional information of said tool is found at the web pages, address www.breezefactor.com. A corresponding tool is being developed in the Adelard project of Sun Microsystems, information found at the priority date of the present application in the web pages www.sun.com. However, these tools only somewhat alleviate the problem, because the program component itself still contains the parts required in the processing of data in the XML language. According to the XML Data Binding (JSR-31) produced by Java Community Process, XML is modified so that there are created Java classes corresponding to the XML structure. As regards this solution, additional information is found, at the priority date of the present application, at the address http://java.sun.coπj7aboutJava/communityprocess/jsr/jsr_031_xπύ.html. This technique makes it easier to use XML, but it neither solves the problem of changing XML nor that of using several different formats for the same program components. An object of the invention is to parse source data in the XML form so that it can be reprocessed by reusable program components. A particular object of the invention is to provide said reusability of the program components when the source data is in the XML form. Moreover, an object of the invention is a centralized management of the data processing. Yet another object of the invention is to facilitate fast parsing and use of the source data in the XML form and thus to enable real-time processing of the source data with the defined program components.

The object is realized by generating the required definitions only once in one location, so that they can be managed in a centralized way. Particularly in the processing of source data in the XML form, there is used a processing component that has at its disposal said definition as to what is the input source data in the XML form like, and how and with what program components said data should be processed. The processing rate can be increased by loading as many program components as possible in the memory in advance, prior to parsing the source data in the XML form, and by using a special parser that only reads the required information of the source data.

The invention is characterized by what is set forth in the characterizing part of the independent claims. The dependent claims describe the preferred embodiments of the invention in more detail.

When developing programs, it is economical and effective to use already existing program components as much as possible. In order to maintain effectiveness and functionality, also the new features are advantageously realized, as far as possible, in the form of program components. The advantage of program components is that the same functionality need not be realized many times, but the ready-made, already existing embodiments can be reused. Thus the quality of the programs is improved, because there can be used already tested program components, and only their compatibility must be tested. Moreover, projects can be realized with a schedule that is more accurate and faster than before, because less programming and testing is needed in the production of the programs.

According to the invention, the processing of input data in the XML form and the processing of the employed program component can both be managed at one centralized spot. Various already existing data can be reused so that already defined features are utilized again. Said features are now defined, according to the invention, in one common definition process only. This definition process is called MAP. MAP includes an illustration of the employed source data, information as to how the source data should be processed by the program components, illustrations of the necessary parts of all interfaces of the employed program components, as well as information of how said program components are put to use. The processed program components can be completely separate and unaware of each other. Program components are always used through an interface that defines the data required for using the program component in question.

MAP can exist so that is written inside the processing component, or MAP can be loaded to be used for example from an XML file or database containing the definition when starting the processing component. MAP tells unambiguously how the data is processed and what program components are used for processing the data, as well as what kind of interfaces or parts thereof are used. Because MAP illustrates the functions as uses of existing interfaces, the employed program components need not be modified. This lightens the process as a whole and remarkably increases the modifiability of the system.

In MAP, there are defined all employed program components and interfaces required for the use thereof. In addition, in MAP there is illustrated the employed source data and defined to what kind of procedures the source data is subjected at each step. Thus MAP illustrates only those parts of the source data that are employed, not necessarily the whole of the source data. Consequently, when creating MAP, it must be known exactly what program components should be used and what is the input information. In order to facilitate this process, it is possible to create a graphical user interface that identifies the existing program components and automatically finds out the interfaces used by them.

The processing component operates so that first the source data is fed to the processing component. Then the defined program components, their interfaces, a description of the source data and operational instructions are read from MAP. The parsing of the source data is started on the basis of the MAP information, and simultaneously the source data is converted, into a program that can be run according to the MAP instructions. Thereafter the running of the program components proper is performed. The running order is defined in MAP, and it can contain for instance conditions, jumps or calls, by which the performing order during the run can be affected. Thus there can be created for example internal loop structures for MAP. The obtained end result is a program defined by the source data and MAP, which program is easily run and can use the required program components for processing the desired tasks. The source data fed in during the processing controls the running of the program component. Moreover, external parameters can be given for controlling the process or as the data to be used in the processing.

MAP can be dynamically modified, which means that the processing component can be used dynamically. When the description as to what should be done to the components is only realized at one location, i.e. in MAP, the probability of errors is reduced. When the logic of the application is programmed in the employed program components, and the application logic and the required XML data are separate, it is not necessary to modify the program components. Because the employed program components are generally tested in advance, there is now tested only the system composed by means of MAP and the input data.

This kind of processing component and connected MAP are easily realized in several different environments and applications. All definitions can be reused. The operation is more reliable, because the probability of errors is reduced, when the definitions and instructions are created in one location only. The application is effective, and the data-specific part need not be included in the program components. This means an essential reduction in the work of programmers and designers and clarifies the structure of the designed program components and programs as well as improves their reusability.

The invention is explained in more detail with reference to a few preferred embodiments and to the accompanying drawings, where

figure 1 illustrates an arrangement according to the prior art,

figure 2 illustrates an arrangement according to the prior art,

figure 3 illustrates an arrangement according to the invention, and

figure 4 illustrates an arrangement according to an embodiment of the invention.

Figures 1 and 2 have already been dealt with in the description of the prior art and existing suggestions for solutions. In the following description of the invention, we shall now observe the drawings that illustrate the present invention in more detail, starting from figure 3. However, let us first define and explain some of the terms used in the text.

The term 'program component' means an independent unit with a given function and unambiguously defined interfaces and other dependencies with respect to its surroundings according to what is agreed on. Program components can be distributed irrespective of each other, and third parties can use them for compiling their programs. Program components are used through a given interface. In this application, the term 'interface' describes the way a certain service is used, but not how it is implemented. An interface should be understood as a number of operations that can be called when using a program component. By means of an interface, the features offered by a program component can be used. In practice, the features described in the interface must be available and ready to be called during processing. The calls can be direct calls, such as for example Java Reflection and C# Reflection, or remote calls, RPC (Remote Procedure Call), with any possible structure, for instance CORBA/IIOP (Common Object Request Broker Architecture/Internet Inter-ORB Protocol) or RMI (Remote Method Invocation).

Let us now observe the arrangement according to the invention for processing source data in the XML form into a form required by the program components, and for transmitting parsed data and the information contained therein to the program components with reference to figure 3. The information transmitted to the program components can be for instance parsed source data or static data, fed in the program component, or it can be a program call without parameters or some information indirectly connected to the source data, such as the size of the source data. In an arrangement according to figure 3, the source data 301 that should be treated in the program component 304 is fed to the processing component 303. In this example, the source data 301 means, according to the invention, data in the XML form. The processing component 303 also needs the information contained by MAP 302. In MAP 302, there are defined the employed program components 304 and their interfaces. MAP 302 also contains the description of the employed source data 301. The program to be run is created by means of MAP 302, and MAP contains the information that tells the processing component 303 how the source data 301 is run for each program component 304.

The processing of the source data 301 proper according to the instructions of MAP 302 is performed so that the processing component 303 transmits to each program component 304 separately the part that is meant for it. The program components 304 are not necessarily bound to each other in any way. The processing component 303 also takes care of error processing and of transmitting data and events between the program components. Consequently, during a run, the arrangement according to the invention takes care of the processing of the source data 302 and feeds, on the basis of the information contained in MAP 302, the desired information in the desired form to the program components 304 and uses the program components 304 in a desired way for processing this information. Owing to this structure, in the processing of the data there can be used program components that were not originally designed for processing said source data. Moreover, new program components can be designed so that they only contain an application logic that is independent of the source data, by which application logic the data should then be processed. This again improves the reusage prospects of the program components and thus improves the effectiveness of the program construction and the quality of said programs.

A given part of MAP always deals with a given program component. If there should be added new parts to be processed by the program component or the program, it is possible, by modifying MAP, to obtain for instance a new XML document to be processed, wherefore there are not needed any heavy applications in between. It is also possible to add new features for processing source data in an existing program only by changing MAP.

Let us observe, by means of simple examples, how MAP is created. First in MAP there are defined all program components to which the data to be run is transmitted, as well as the interfaces of said program components. The definition can be for instance the following XML definition, where the class to be used is defined as a "java.lang.StringBuffer", i.e. a character string in the Java language, its constructor and one method "append" for this class as follows:

<class name="java.lang.StringBuffer"/> </object> <object id="2"> constructor classRef=" 1 "/>

</object> <object id="3"> <method classRef="l" methodName="append"> <argument datatype="java.lang.String"/> </method>

</object>

'Class' is an abstract concept, and 'instance' is its practical occurrence, i.e. an available occurrence of a class. In the above example, by means of a constructor there is created a new instance of the Java class, and the task of the defined "java.lang.StringBuffer" class is to contain text data that can be modified. The method "append" adds the character string to the instance of the Stringbuffer class.

Next MAP defines the structure of the source data, i.e. gives an exact description of the source data to be fed in. Among the source data in the XML language, there is defined an element called "stringdata" and an attribute called "value" contained therein, as follows:

dement name="stringdata">

The next example refers to a corresponding spot in MAP, where consequently the element <content>refers to incoming XML:

</element>

In addition, MAP contains detailed information as to how the above defined source data in the XML language should be processed. In MAP, there also is collected the running order and the mutual relations of the program components in a tree structure. In this example, MAP also reports all classes and files used in the processing. In fact, MAP is a kind of meta-ini file of the program and tells what program components the program uses and how. Moreover, MAP contains information of how the running of the program should be controlled. For example for the data according to the examples dealt with here, the functions can be defined as follows:

dement name="stringdata"> <object id="4">

Thus, in the above example there is created a new instance of the defined class "java.lang.StringBuffer", and the defined method "append" is executed for this class. For the method parameter, there is fed from the source data the contents of the attribute "value" of the element "stringdata" under treatment. The input can be for instance as follows:

When the XML fed in the processing component corresponds to the example given above, by means of the MAP information given above the obtained end result is an instance of the StringBuffer class containing the text "Test".

Owing to MAP, in the employed program components there can be fed information in the required form irrespective of the form of the original source data. MAP must be defined separately for each application. If MAP is stored in the XML file, parts of the MAP description can be used when making a new MAP description. In addition, a graphic user interface can be constructed in order to facilitate the creation of MAP.

In the following exemplary case, there are defined, as DTD (document type definition) descriptions, elements that can be used when creating an XML based MAP definition. Moreover, the definition also contains the attributes of said elements and elements contained therein.

<!ELEMENT map (init, enginelnit, element*)>

The init element contained in the example includes all such object elements that are common for all program components to be run. The object elements to be loaded for each program component separately are included in the enginelnit element. This kind of object element represents the form

< [ELEMENT object (class, constructor, method, instance, methodCall)> <!ATTLIST object id ID #REQUIRED>

The described object element records the object restored by the element contained therein and stores its attribute in a location defined by the value of the attribute "id" thereof. Thus the information can, on the basis of the name defined in the attribute "id", be retrieved for further usage. The element class restores the class definition, i.e. the class constructor restored by the constructor element, by means of which there can later be created new instances of said class. The method element restores the method of a given class that is desired to be put to use. By means of said object, the method in question can also be used later on. The instance element returns the desired class instance as an object. The methodCall element returns the result of a method call.

The lifetime of the objects in the memory is defined on the basis of the element under which the object element is defined. Objects defined under the init element are kept in the memory as long as the processing component (engine) exists. Objects defined under the enginelnit element are kept in the memory as long as a single source data is being processed. Objects defined under the element elements live as long as the element in question is being processed, i.e. as long as the element still has left action elements to be processed. A described element element could be for instance as follows:

< [ELEMENT element (object*, action*)>

<!ATTLIST element name CDATA #IMPLIED>

Hence, an element element contains object elements and action elements needed by the element in question. In the action elements, there are defined the procedures performed by the element in different situations. Moreover, the element element returns the values returned by the action element. The attribute "name" defines for which element said rules apply. In the definition the attribute "name " may contain a path reference (XPath) instead of a mere element name. Now different elements can be referred to, so that their logical connection is taken into account. For example, a distinction can be made whether the attribute "name " is located under the element receiver or under the element transmitter. The action element contains the actions that are performed with the element in question.

Let us now observe, with reference to figure 4, the operation of an arrangement according to a preferred embodiment of the invention. In the top row, there is created MAP 404. The original MAP information can be contained in the processing component, or it can be loaded from an external source, for instance from the XML document 401 as is illustrated in figure 4. The MAPfactory 402 reads the MAP information 401 provided in XML form. In this embodiment, all original MAP information 401 is put to use by means of the parser 403. The type of the employed parser is not important from the point of view of the invention, which means that it can be any type of parser that is best suited for the purpose in question, for instance a SAX parser. The MAP 404 proper is created by means of the original MAP information 401, the MAPfactory 402 and the parser 403. The lower row in the drawing describes in more detail the operation of the invention which was already discussed with reference to figure 3. The source data 405 in XML form is first fed to the EngineFactory 406, which forms the tree to be run, i.e. the EngineTree 407. Now the EngineFactory 406 obtains at its disposal the desired parts of the source data 405 by means of the employed parser 407. If a SAX parser is used at step 407, the EngineFactory 406 must first collect a sufficient amount of SAX events in order to know which pieces of the MAP information are needed. However, according to the invention, the parser 407 can be replaced by a particular speed-optimized parser that fetches, on the basis of the information contained by MAP 404, only those parts of the source data 405 in the XML form that are needed in order to run the desired program components. Thus the processing of the source data 405 is made faster than by using prior art. Speed is important for example in real-time systems, such as in mobile services.

MAP 404 transmits to the EngineFactory 406 information as to what kind of program the program to be run should be. On the basis of the obtained information, the EngineFactory 406 forms an EngineTree 408, which is a tree-like structure. Thereafter the program contained by the EngineTree 408 can be run by means of the EngineRunner 409. MAP 404 transmits to the EngineTree 408 a pointer to itself. By means of said pointer, the EngineTree 408 can point directly to MAP 404. When the EngineTree 408 has pointers to the elements of MAP 404, the EngineTree 408 receives the required running code directly from MAP 404. The source data 405 in the XML form is parsed according to the created MAP 404. The parsing of the source data 405 and MAP 404 operate together, so that there is first searched the root element of the XML source data 405, and then the MAP 404 is asked how it should be processed and what information of the source data 405 there is next required. The parsing is continued as far as this information, and always when the right piece of information is found, an inquiry is sent to the MAP 404 in order to find out how it should be processed.

Parts of the source data 405 can be run by the EngineRunner 409 already before the whole of the source data 405 is available and parsed. This means that as the source data 405, there can be processed for instance a data flow in the XML form. The program contained by the EngineTree 408 can be run by the EngineRunner 409 simultaneously as the preceding EngineFactory 406 is only creating the EngineTree 408. The EngineTree 408 waits for running instructions and the process continues, when the necessary information is at hand. The EngineTree 408 transmits information further to chosen program components, which can be any suitable programs.

Claims

1. A method for processing XML source data by a software component (304), characterized in that

- there is established a processing component (303, 406) for processing source data, and XML source data is parsed by said processing component,

- there is established a set of rules (302, 404) that describe the XML source data and its processing,

- the XML source data (301, 405) is modified in said processing component (303, 406), in a way defined by the set of rules (302, 404), into a form required by the software component (304), and

- at least one of said modified source data and information connected therewith are transferred, in a way defined in said set of rules (302, 404), to be available for the software component (304) to process.

2. A method according to claim 1, characterized in that said set of rules (302, 404) contains an exact description as to what software components (304) are employed for using modified source data, what kind of data the employed XML source data (301, 405) is and what is done to the XML source data.

3. A method according to claim 1, characterized in that said set of rules (302, 404) is at least partly available for use already before the modifying proper of the XML source data.

4. A method according to claim 1, characterized in that said set of rules (302, 404) is at least partly established at the same time as the XML source data is being modified.

5. A method according to claim 1, characterized in that contents of said set of rules (302, 404) are reused later.

6. A method according to claim 1, characterized in that on the basis of the XML source data and said set of rules, there is created an executable structure (407), and when executing (303, 408) said structure, there are used existing software components (304).

7. A system for processing XML source data (301, 405), which system comprises a software component (304) that requires its input data to appear in a certain form, characterized in that the system comprises

- means for establishing a set of rales (302, 404) that describe the XML source data and its processing, and

- a processing component arranged to modify the XML source data (301, 405) according to said set of rules (302, 404) for modified source data to appear in the form required by the software component (304), and

- the software component (304) arranged to process at least one of modified source data, information associated therewith and information defined in said set of rules (302, 404), so that the software component is itself arranged to be maintained unchanged.

8. A system according to claim 7, characterized in that it comprises information (401) required for establishing said set of rules and means for creating (402) said set of rules in a form required by the processing component, and means (403) for reading said information required for establishing said set of rules in order to create said set of rules (302, 404).

9. A system according to claim 7, characterized in that the software component (304) is a separate program with certain defined interfaces.

10. A system according to claim 7, characterized in that it contains a processing component (303, 406) for receiving XML source data (301, 405) and said set of rules (302, 404) as well as for creating an executable structure (408) according to said set of rules (302, 404).

11. A system according to claim 10, characterized in that it contains an executing component (409) for executing said created executable structure (408).

12. A computer program for processing XML source data, which computer program comprises a software component (304) that requires its input data to appear in a certain form, characterized in that the computer program contains

- means for establishing a set of rules (302, 404) that describe the XML source data and its processing, and - a processing component arranged to modify XML source data (301, 405) according to said set of rules (302, 404), for modified source data to appear in the form required by the software component (304), and

- the software component (304) arranged to process at least one of modified, source data, information associated therewith and information defined in said set of rales (302, 404), so that the software component (304) is itself arranged to be maintained unchanged.