WO2011128512A2

WO2011128512A2 - Method and apparatus for a control device

Info

Publication number: WO2011128512A2
Application number: PCT/FI2011/050333
Authority: WO
Inventors: Antti Rauhala
Original assignee: Antti Rauhala
Priority date: 2010-04-14
Filing date: 2011-04-14
Publication date: 2011-10-20
Also published as: FI20105390A0; WO2011128512A3

Abstract

A method and apparatus for controlling operations according to sampled information collected from one or more information sources. A model for re-expressing system variables through re-expressed variables, a re-expressed variable expressing a state combination of two or more other variables is defined for a system. The model is used to re-express system variables through the re-expressed variables. These variables are used to predict a system variable that is not available through the sample. A control operation determined according to the predicted system variable is output. Performance of a control device is improved, but the required processing capacity of the apparatus is not significantly increased.

Description

METHOD AND APPARATUS FOR A CONTROL DEVICE

FIELD OF THE INVENTION

[0001] The present invention relates to control devices and especially to an apparatus that implements control operations according to sampled infor- mation collected from one or more information sources.

BACKGROUND OF THE INVENTION

[0002] In order to determine an operation to be implemented, an apparatus used as a control device inputs a sample of measured and/or stored information, and on the basis of the sample generates automatically a control indi- cation that maps to a defined control operation of the apparatus. In complex system environments the desired sample information may vary statistically. Accordingly, selection of an operation of the apparatus is not deterministic but requires use of estimation. This means that parameters of a system are modeled as random variables of known a priori distribution, and the model de- scribes the physical system scenario in which the parameters apply.

[0003] Performance of the apparatus corresponds with validity of the selected control operations, i.e. the accuracy of the estimation. Conventionally one may try to improve accuracy of estimations by increasing the amount of input samples and/or by increasing brute processing power of the device performing the estimation. However, many times the sampling is not in the control of the operator of the apparatus so one has to achieve results with material at hand. On the other hand, increase of processing power is expensive and may in any case prove inadequate when the sampled environment is more complex. In some systems complexity may rise exponentially in relation with the amount of system variables. This means that estimations of most of such systems is in practice impossible; combined processing power in the whole world would not be able to perform all necessary computations in reasonable time.

[0004] The technical challenge is thus how to improve performance of a control device that determines control operations according to statistically varying samples, but avoid significantly increasing the required processing in the apparatus.

SUMMARY OF THE INVENTION

[0005] An object of the present invention is thus to provide a method and an apparatus for implementing the method so as to solve or at least alleviate the above problems. The objects of the invention are achieved by a method, an apparatus and a computer program product, which are characterized by what is stated in the independent claims. The preferred embodiments of the invention are disclosed in the dependent claims.

[0006] In an aspect, the invention provides a method that comprises defining for a system a model for re-expressing system variables through re-expressed variables, a re-expressed variable expressing a state combination of two or more other variables; inputting a sample of a system; translating the sample into a binary form that comprises a plurality of system variables; using the model to re-express the determined system variables through the re-expressed variables; using the re-expressed variables to predict a system variable that is not available through the sample; and implementing a control operation determined according to the predicted system variable.

[0007] In another aspect, the invention provides an apparatus that comprises means for defining for a system a model for re-expressing system variables through re-expressed variables, a re-expressed variable expressing a state combination of two or more other variables; means for inputting a sample of a system; means for translating the sample into a binary form that comprises a plurality of system variables; means for using the model to re-express the de- termined system variables through the re-expressed variables; means (202) for using the re-expressed variables to predict a system variable that is not available through the sample; and means for implementing a control operation determined according to the predicted system variable.

[0008] In another aspect, the invention provides a computer program product readable by a computer and encoding a computer program of instructions for executing a computer process controlling functions in an information system. The process includes the steps of defining for a system a model for re- expressing system variables through re-expressed variables, a re-expressed variable expressing a state combination of two or more other variables; input- ting a sample of a system; translating the sample into a binary form that comprises a plurality of system variables; using the model to re-express the determined system variables through the re-expressed variables; using the re- expressed variables to predict a system variable that is not available through the sample; and implementing a control operation determined according to the predicted system variable.

[0009] The invention can be utilized to improve accuracy of prediction. In sev- eral fields of application the improved accuracy of prediction permits the apparatus carrying out the inventive method, ie, the control device, to operate in a more efficient manner. The improved efficiency is a result of reduction in the number of trial-and-error cycles. A search engine, also known as an infor- mation retrieval server (IR server), is a prime example of an apparatus that uses resources more efficiently as a result of the inventive method. A conventional IR server receives queries from a number of client terminals. The queries contain key words, some or all of which must be contained in the documents that the IR server is to return to the client terminals that send the queries. The hits (list of retrieved documents) returned by the conventional IR server primarily depend on the key words of the query. In addition, the hit list may be complemented by documents from sponsoring partners. But the hit list returned by the conventional IR server is the same regardless of the identity of the client terminal that sent the query, or the user of that terminal. Furthermore, the hit list returned by the conventional IR server is not influenced by the time, day or location of the inquiring terminal or user. For instance, if a young user requests information on buses, it is likely that the young user is interested in bus routes and schedules. On the other hand, if a business manager requests information on buses, it may be more likely that they are interested in the tech- nical details of buses or the economics of running a bus service. Because the results provided by conventional IR servers are not influenced by the profile of the inquiring users, the users are likely to receive large amounts of information which is irrelevant for them. The users receive these documents, which are typically displayed by their internet browsers, then they reject the documents and browse more documents, hoping to find documents that are relevant for them. On one hand, the user experience leaves something to be desired, and on the other hand the conventional IR servers burden their mass storage systems and communication resources by retrieving and transmitting documents that are irrelevant for the particular user.

[0010] The invention and its various embodiments provide several further advantages that are discussed in connection with respective embodiments in the following description.

BRIEF DESCRIPTION OF THE DRAWINGS

[0011] In the following the invention will be described in greater detail by means of preferred embodiments with reference to the attached drawings, in which:

Figure 1 provides a functional description of an apparatus;

Figure 2 illustrates basic functional entities within a processing engine;

Figure 3 illustrates elements of an exemplary pair expression driven learning process of a learning engine;

Figure 4 illustrates steps in a method for creating a model for a system in a language learner of the apparatus;

Figure 5 illustrates elements of a prediction process associated with the learning process;

Figure 6 illustrates an exemplary process of the apparatus;

Figure 7 illustrates an exemplary hardware configuration for the implementation of the apparatus.

DETAILED DESCRIPTION OF THE INVENTION

[0012] It is appreciated that the following embodiments are exemplary. Fur- thermore, although the specification may in various places refer to "an", "one", or "some" embodiment(s), reference is not necessarily made to the same embodiments), or the feature in question does not only apply to a single embodiment. Single features of different embodiments may be combined to provide further embodiments.

[0013] A variety of configurations applying a variety of information processing technologies may be used separately or in combinations to implement the embodiments of the invention. Information processing devices and technologies evolve continuously, and embodiments of the invention may require a number of modifications. Therefore all words and expressions of this specification should be interpreted broadly, as they are intended merely to illustrate, not to restrict, the embodiments.

[0014] Figure 1 provides a functional description of an apparatus (A) 100 according to an embodiment of the invention. During its operation the apparatus 100 inputs a plurality of samples s,. A sample s, refers to a block of information that is coded in electronic form to allow computer devices and computer software of a processing engine 101 to convert, store, process, transmit, and/or retrieve in the processes of the apparatus 100. A sample s, is typically a data record that comprises one or more information elements Xk that the processing engine 101 is able to automatically detect and treat as sepa- rate parts of the sample. [0015] In general the term processing engine relates to an information technologies component of the apparatus, and therefore refers to any computer related equipment or interconnected communications system or subsystems of equipment that may be used in the acquisition, storage, manipula- tion, management, movement, control, display, switching, interchange, transmission, or reception of information, and includes software, firmware, and/or hardware. Automatically in this context means that the recognition and treatment functions may be performed with the software, firmware, and/or hardware of the processing engine 101 without simultaneous support opera- tions with the human mind.

[0016] The one or more information elements Xk of the sample carry measured or empirical data that the processing engine 101 converts to a defined binary format. The resulting binary vectors are used as system variable values for prediction that provides an estimate for one or more system variables that are not available through the sample data. A system variable value may be considered non-available, for example, when it is missing from the sample, or is non-trusted and therefore not directly applicable. One or more of these estimates are output as a control indication c, to a mapping engine 102 of the apparatus. Configuration and operations of the processing engine 101 will be discussed in more detail later in this description.

[0017] The mapping engine 102 comprises an input interface to the processing engine 101 and may comprise one or more additional input interfaces to other internal units of the apparatus or to external entities. The mapping engine 102 also comprises an output interface to an operations engine 103, and mapping means for determining a control operation call coc, on the basis of information input via the input interfaces. The mapping means may comprise, for example, a simple one-to-one mapping, for example a table, where one control indication c, corresponds with one control operation call coc,. The mapping means may also comprise a more complex algorithm that on the basis of the control indication c, input by the processing engine 101 , and possibly some other internal and/or external data input via the one or more additional input interfaces computes one or more control operation calls coc,. The mapping engine outputs the one or more determined control operation calls coc, to the operations engine 103.

[0018] The operations engine 103 comprises a library of control operations and in response to a control operation call coc, from the mapping engine 102 implements a corresponding control operation co,. The operation of the apparatus thus comprises a consecutive sequence of control operations initiated on the basis of sample data fed into the apparatus. The process that is controlled by the control operations may be automatic or manual. However, the opera- tions of the apparatus are automatic.

[0019] It may be noticed that the mapping engine 102 and the operations engine 103 are relatively simple control elements of the apparatus. Improving the relevancy and/or accuracy of the apparatus operations in whole is thus by far based on accuracy of the control indications c, provided by the processing en- gine 101 .

[0020] The processing engine 101 of Figure 1 is a learning machine that analyzes a large binary variable system to recognize patterns within variables, and may utilize the knowledge of these patterns to compress individual samples and make predictions on unknown variables. Figure 2 illustrates in more detail the basic functional entities within the processing engine 101 of Figure 1 . The information system 101 comprises a learning engine 201 and a predicting engine 202. The learning engine 201 inputs a set of training data s_t that represents a system S and automatically recognizes patterns in the system S. Based on the system properties, the learning engine 201 iteratively constructs a N-expression language L that re-expresses the system S^'=L(S) and calculates statistics β from the re-expressed system. This results in a system model M(L,S^', β), which is fed into and applied by the predicting engine 202 during processing of subsequent data samples s,.

[0021] In the following, the embodiment of Figure 2 is illustrated in more detail with a processing engine that applies na^'fve Bayesian predictor. It should be noted, however, that the solution is also applicable to other prediction mechanisms, like multilayer perceptrons (MLP). Furthermore, the following example applies pair expression, i.e. re-expression through- two system variables. As will be discussed later, re-expression may also be implemented through more than two system variables.

[0022] Figure 3 illustrates in more detail elements of an exemplary pair expression driven learning process of the learning engine 201 . In this case training data s_t 301 is input in the form of historical information /, which comprises a number of recorded cases I=(i_l,i₂,i₃ ,...) of a system. This historical information / may originate from one or more sources and be in one or more formats. To make the re-expression of this historical information / to binary information S representing the system, the learning engine 201 uses a defined language preprocessor L_p 302, which preprocesses the freely formatted historical information into a predefined binary format ^{5 =} C⁵.¹ , ¾2, ¾3, ... } 303. in the language preprocessor 302 numeric/continuous variables are typically normalized and then fitted into some limited number of categories. Operations to transform free format data to binary vectors are generally known to a person skilled in the art and will not be discussed here in more detail.

[0023] After this preparing phase, the binary information is given for a language learner Λ 304, which iteratively constructs the referred system model M(L,S^', β) 305. In the following, the basic principles associated to the language learner Λ are discussed in more detail.

[0024] In order to derive L, S^', and β the binary variable system may be expressed as where S denotes the system and

X X 2 , X 2 ,■■■ , x„

are Boolean variables called in this context system variables as theirs states are used for ex ressing each individual state of the system

where

^X l , ^X2, ^X3, ' ^{■ ■} > ^Xn

are the Boolean states χ, Θ 0,1 of the system variables.

[0025] Information entropy H(S) of a system is known to be less or equal to the summed entropies of the system variables:

H {S)≤ )

[0026] The sum of system variable entropies is called here as the naive system entropy, because it is equivalent with system entropy, when the 'naive' assumption of statistical independence holds. Statistical independence refers here to a situation where an occurrence of one event makes it neither more nor less probable that the other occurs. This name is adopted from the context of naive Bayesian classification, where the statistical independence assumption is called naive, because it is systematically assumed for systems, even if it might not fully apply for the sake of reduced complexity.

[0027] This naive system entropy may be marked with H_naive(S). Using this notation, the system mutual information is equal to the difference l(S)= H_naive(S)- H(S).

[0028] If part of the variables in the system are expressed in the form of a subsystem

J the system can be re-expressed as

[0029] The upper boundary for the system can be set as:

H {S )≤ if ( X J - H i X₂ ) . . - H {$ -. } * . . . * H ^{ί "} j

[0030] If dependencies revealing subsystem mutual information l(S)= H_naive(S)- H(S) are recognized, the upper bound of mutual information for the system is recognized to decrease similarly:

[0031] For a number of non-overlapping subsystems the equation obtains the following form:

H i S)≤ H ,_K(:ri (5 j - / ( $ , )-/ i$₂) ~ S J

[0032] Accordingly, the reduced entropy in the subsystem interprets as reduction of entropy in the entire system. In other words, the patterns present in subsystem can be applied to the entire system. This approach provides means for analyzing the system in a desired level, and thus to avoid effectively the curse of dimensionality.

[0033] When performing system analysis the size of an observed subsystem can be considered as a variable window or entropy window, as it determines that how big patterns can be recognized with it. While a two variable window may expose many of the system regularities, complex systems may contain patterns, which cannot be spotted with small entropy windows. An example of such patterns is a relation, where a variable is the result of XOR:ing two other variables. If all variable probabilities equal 0.5, the window must be at least 3 variable wide to be able to recognize the mutual information/dependency between the variables.

[0034] The reason, why the window could also be referred as entropy window is that the entropy summed of all variables fitting the window seems to be related to the entropy of the variables forming the patterns. Even if there are patterns, which contain a great number of the variables, if naive/real entropy of the system is low, traces of the pattern or even the entire pattern can be detected with much smaller entropy windows. For example, consider a system of n random variables and one Oddness' variable, which is true only if odd amount of other variables are true. If probabilities of variables in this system are 0.5, n+1 -sized window is required to spot the mutual information. If the same variables probabilities were, for example, 0.01 , a degree of mutual information around the oddness variable could be detected with variable win- dows much smaller than n - if a non-oddness variable X is true, it is very unlikely that any other non-oddness variable would be true at the same time, and the relationship between X and oddness variable Y approximates implication X->Y that is easy to detect with a two variable window.

[0035] More generally, for a sequence of n variables -^¾ ι^^^ - '^» .<>^■ and a measurement for mutual information for some subset of size w of these varia- bles ^{1 '< f}= ^{5 " '} , the subset acts as a w-sized window. The maximum amount of information that can be extracted with a single w-size window is: i^max_{( j} / ( X_L ; X _,;j X< )

[0036] This value can be used to limit the maximum mutual information ^ that can be extracted with k nos of w variable sized windows, where this measure- ment is limited by and ultimately by the system mutual information *ν -^ι «—^{κ >}V The actual value of ^J¾ still depends of the properties of the system. If the prop- erties of a n-variable system allow ^ to approximate system entropy / so that ^ e ~i _wjfh moderately growing window size 0 {w }<£. n , and with moderate- ly growing window amount i^^S ^ , it has been noticed that the system analysis approach can be applied very effectively.

[0037] Susceptibility of a system for the system analysis is related to the number of states that need to be traced to reveal a degree of system's mutual information. This traced state count and the system state count (that behaves exponentially) are different things, and depending of the system properties the growth of ideal traced state may behave differently, for example linearly. [0038] In a system that is static, except for point-like random disruptions and their consequences within time period t, for each system variable X it is possible to limit a subsystem S that has all variables, whose disruptions will hold effect with the variable X. Under these conditions it is possible to find a window size w so that mutual information of the entire system can be determined through w sized windows, and the system is not severely touched by the curse of dimensionality. While the state complexity still rises exponentially, the price of solving the system does not do so, but instead the (ideal) price raises linearly. This ideal price may not reflect the real price, because the limited set of (sub)system states that needs to be traced is not known. The term price represents here a combination of performance related aspects of the analysis such that increase in price is associated to higher use of memory, higher amount of samples, lower accuracy of results, increased processing, for example.

[0039] For such systems there exists a window size w and factor a, so that the mutual entropy of a n-variable system can be determined with k entropy windows of size w, where k <an . The amount of traced states has an upper limit where ae" is constant. Because the price of solving such system grows linearly, such systems can be called linearly complex systems and the system property can be called linear complexity.

[0040] Other systems are touched by the curse of dimensionality in the sense that the amount of states that need to be traced grows exponentially. They are called here as exponential complexity systems. Similarly, if the amount of tracked states grows logarithmically, the system is called a logarithmic complexity system. The important notion made here is that many, if not most of the systems encountered in the field of engineering may actually be analyzed without the curse of dimensionality, and their internal patterns and regularities can be expressed in linear or even logarithmic size (when compared to system variable count). Due to this, in such systems it is possible to determine automatic control operations with a good performance. Furthermore, control opera- tions may be based on sampling systems that have earlier been considered too complex for system analysis with a time and processing capacity available.

[0041] For example, a 3-variable XOR system, where C = A XOR B, can be recognized with 2-variable window if A and B are re-expressed with variables Ει=ΑΛΒ, E₂= ->AAB, E₃=AA->B and Ε₄=^ΑΑ^Β. In such situation, the exclu- siveness between C and E? and C and E₄ and implication from E₂ and E₃ to C are easy to detect with the 2-variable window. In a sense, the re-expression raises the effective window size to 3, and similar kind of operations can be used to grow the effective window size to any arbitrary value and therefore to detect patterns of arbitrary size. In this sense re-expression provides a way to virtually increase the traced window sizes.

[0042] In smart synthesis methods the window size is grown around regular subsystems, where regularity is revealed by elevated level of subsystem mutual information. The learning engine 201 applies smart synthesis by targeting surprisingly common state combinations for re-expression. Surprisingly common refers here to a situation where the occurrences of the state combination exceed a predefined level. A pair expression corresponds here to a processed variable that may be used to express a state combination of two other variables. This expression mechanism is further supported by technique called variation reduction, which virtually reduces expressed variables by making them undefined, whenever the formed expression is true. The exemplary learning engine 302 of the present embodiment uses 2-sized windows for system analysis and it implements the concept of system synthesis by using pair expressions for re-expressing the original data as lower-entropy variables.

[0043] Let us consider system S and a na^'fve predictor P doing na^'fve Bayesian predictions and na^'fve compressor C that utilizes only variable state probabili- ties p(v) for encoding. The technical challenge then becomes to find translation L from system S to translated system S^'=L(S) so that the error of the na^'fve predictor P is minimized and/or the compression rate of the na^'fve compressor C is maximized, when system S^' is used instead of S. Thus instead of seeking to optimize conventional prediction or compression algorithms as such, the learning engine 201 is configured to create a translation mechanism Λ that optimizes the language L=A( S) and the re-expressed system S'=L(S) .

[0044] Figure 4 illustrates steps in a method for creating the model M for the system S in the language learner Λ of the learning engine 201. The method begins in a stage where statistically varying historical information on system S is input (400) in binary format ⁵ = (^t¹ , ¾ι^,^3^, .^„} to the language learner Λ. The language learner Λ checks whether a surprisingly common state combination Cs (402) is found from S. This may be done by minimizing the following property for all system variable states p (v, -A v

[0045] Minimizing this value effectively drives the estimated maximum system entropy down thus optimizing the system for na^'fve compression and drives the system variables to be statistically more independent (in a weighted manner) thus optimizing the system for na^'fve predicting. The more common a state combination is, the more statistically independent it will be. Also, the more common the state combination is the more accurate the dependency estimate will be and the more significant the combination is for the total prediction error.

[0046] If Cs is found (404), the language learner Λ generates (406) an expression that is true when C_s is true, and not true when C_s is not true. In the tech- nique, singular states of variables ViCV, v_jG _j are expressed with an expression Ek, which is true on condition

Variables V,, V_j can be original binary variables or pair expressions. The benefit of pair expression comes from greater granularity when examining the statistical properties of variable combinations. In other words, it is possible to provide better compres- sion rate and predictions, when probabilities p^ , ->V_j), p^ , V_j), p(Vj_, ^~>V_j) and p(Vj_, V_j) are known, compared to knowing only p(V) and p(V_j). Typically a greater granularity leads to decrease in the system maximum entropy, which then increases understanding of the system.

[0047] While introduction of expressions increases granularity, it also adds new variables and drives more dependencies and redundancy into the system. Increase in the variable count could drive the observed 'na^'fve' system entropy up, if the added dependencies are not taken into account. Now, if there is some known subset of variable states, it may become possible to determine states of some otherwise 'unknown' variables based on these dependencies. Knowledge of such determination operations can then be used to eliminate added redundancy from both entropy calculations and encoding.

[0048] Accordingly, for a system state s, if the state

is known to be true, the states of original variables do not need to be explicitly declared, because their combined states are (vi,v_j). A more complex situation rises, when E_k is false, while V, is known to be v,. In this case, one can reason that V_j must be ^vj as ν,Λν, is known to be false. Equally implication (->E_kAv )→-> v, applies. Similarly, when we have expressions

and E₄=(vAv_i), it is known that all of these expressions are in fact exclusive. If any of the expressions is known to be true, it is implicitly known that all other ex- pressions are false. Same mechanism applies when exclusive expressions are Ei=(aAb) and E^(-bAc) for variables A, B and C. [0049] Introduction of pair expression E to the system S may introduce redundancy to the produced system S°E. This redundancy is removed (408) by the language learner Λ such that S translates into S'. The elimination of redundancy leads to the concepts of explicitly and implicitly known variables. In a situation where translated system is written and read in a bit format, the bits need to be written and read in some order to be able to know, which variable's state they declare. Bit for a first variable needs to be explicitly read (and declared in data), but depending of its state, a state of a number of remaining variables may become known. For example, first bit may declare an expres- sion E=AAB to be true, and because the states of A and B become implicitly known, one no more needs to explicitly declare and read states of A and B. Similarly, every explicitly read bit thereafter may reveal states of some following variables, making them implicitly known. As the process is deterministic, given some variable ordering, it is possible to unambiguously determine for each system state the implicitly known variables (called the implicit variables for s) and explicitly declared variables (called explicit variables for s).

[0050] A variable V, in this new encoding is explicitly declared under some condition C, called context. When this variable is encoded, its ideal codeword size is determined by its entropy, which is determined by its probability. Be- cause the state of the variable needs to be encoded only under condition C„ the entropy should be based on probability p(V\C_>) instead of p(Vi). For entropy calculations and encoding, one is interested on a conditional variable V\C, and it is possible to ignore variable V, and its states whenever C, is false. Similarly, when variable V is predicated, knowing that variable V) was both true and ex- plicitly declared, it makes more sense to use conditional probability p(Y\VACi) instead of

utilizes more information and so the prediction is more accurate. This means that after introduction of pair expressions, one can focus on conditional variables and the original variables can be considered not to hold much relevance. Since variables are this way trans- formed into a reduced context form, this mechanism is called variable reduction mechanism.

[0051] As discussed above, the introduction of expression E? may be considered as a translation of existing system S-^ ^ ¾·- -·· * into a new system of form ¹ ' ¾ }

, which includes both pair expression and the interesting conditional variables, while the non-interesting and non-conditional original variables ^l' - - ^]"■■ - ^{· *} · are not included as separate variables. This model of system definition means that the system and its properties are changed when one introduces a pair expression. If one seeks to introduce another pair expression E2, the expression will be applied to the modified system S¹ instead of the original system S. When introducing a sequence of pair ex- pressions Ei,E2,..., Em, the expressions are applied to systems ⁵ ·· · · · -^r which form a series. If symbol E^!' denotes a function that converts an earlier 'generation' system -^¾ into a later generation system , each system can be expressed in form ^' - i E^{" : ;} .. ^j £^: < ^ / or recursively Λ - £^'; 5^{' " "}: . The sequence of pair expressions forms a pair expression language ι= £"^:-£"^"!-..,- ^ that can be used for translating system S '^ US}, and its reverse function can be used for interpreting the pair expressed system

[0052] Accordingly, the language learner Λ of the present embodiment may eliminate the redundancy by developing rules for identifying redundantly de- dared variable states, marks them implicit, omits them from encoding and reformulates the system to have conditional variable defined only when their state is exnlicitlv declared, so that re-expressed system is of form s ^}·■ j^it ί^• _I ^::C_I . ^{, , . .}J ^' jL_s , £₅ ^!_ jhis framework is used for deriving translated system " - ΙΓ^' ^ Ϊ .

[0053] Finally, the language learner Λ may incorporate the idea that series of pair expressions ^!,£\.... iT^!are applied to the series of systems s*. $* .. $"^'* so that final translated system is s = £·'ί £^I_S L . J £^s is) I . ?, where the series of pair expressions form a pair expression language ΐ- Ε°^**ε^χ~- £,.,*ε

[0054] For each system ^: there is typically one variable, which can be assert- ed never to be expressed by other variables or made implicit by - A V^ ^ , mechanism and that is the newest pair expression £^! , whose state should be read and encoded first. After reading a bit holding state one can transform system 5^' bits to system s^{s- s} bits by dropping £ -bit and by inserting the bits determined by expression £^! state to (guaranteeing that is explicit). Now, the same conclusion of encoding priority can be repeated for system bit sequence and all the systems and bit sequences until the original system's bit sequence is recovered. This means, that the natural encoding order is preferably from the newest expression to the oldest expression and then to the original variables. [0055] During its operation, the language learner Λ may use variable determination rules to mark variable states to be implicitly known. These variable determination rules may comprise:

1 . Expression rule: If expression E is true, the expressed variable states are implicitly known to be whatever E expresses.

2. Expressed state exclusion rule:

. if E is not true, but one other variable is in a state that E expresses, it is known implicitly that the other variable must be in negation of the state E expresses.

3. Expression exclusion rule, which states that no two expressions can express same variable at the same time. If expression expressing state of variable V is true, all other expressions expressing any state of variable V are implicitly known to be false. Also more complicated chains of expressions obey this rule. If E₂ expresses Ei, which expresses V, while E₄ expresses E₃ expressing V, then also E₂and E₄ are exclusive.

[0056] Considering original system state s with variable states xi, xz - - ., x_n and language L with expressions The language function can be defined through expression functions so that L = E' £^s . In this case, the algorithm may be defined through expression function E', which translates older generation system state into a newer generation one s'^- ^i }. For each system state s¹ and variable Vj the language learner Λ has both variable states and variable's explicitness/implicitness stored.

[0057] Accordingly:

1 . Considering system state s¹ and expression E_i+1, the language learner Λ declares expression E_i+1 state to be true, if and only if the expressed variables V and Vj have the expressed states v, and vj, and are explicit.

2. If the expression E_i+1 is set true, the expressed variables V_hV_s, and older excluded expressions Ej, j<i+1, (which are known to be false) are marked implicit.

3. If the expression E_i+1 is not true, but expressed variable Vj with higher priority has an expressed state, the lower priority variable Vj is marked implicit as its state is known to be Vj.

[0058] Re-expression can be continued until the entire re-expression language L has been derived. After completed re-expression, the implicit variable states can be put aside without permanent loss of information. [0059] The interpretation function

does the opposite. Having s' and expression E there may be a need to translate from s' to s^1'1. To perform this translation, the language learner Λ may analyse the state of expression E, and:

1 . If the state is true, the expressed variables V,, Vy are marked to have the expressed states Vj,vj, and the excluded expressions E with j<i to have false state. All these variables are also marked explicit.

2. If the state is false, while the higher priority expressed variable has the expressed state, the state of the lower priority variable is known to be complement of the state expressed by E,. The state of higher priority variable may not be explicitly known when ^£= is resolved, so the determination may be made via complex state determination rules or at the moment the higher priority variable's state becomes explicitly known.

[0060] Let us consider a simple system S of binary variables A, B and C with expressions E? attempting to express true states of A and B and E₂ attempting to express true states of B and C. Languages may be defined such that oldest expressions will hold priority over newer expressions, which means that in the case of system state $ - A BA C , expression E? is allowed to be true, expressing true A and true B, while E₂ is determined false, because B is already ex- pressed by E?. This is convenient, because when re-expressing, the language learner Λ may process oldest expression first and newest expression last (remembering that S ' -E^ iE^ ' i-. ^ i iS }}, , .- )} )).

[0061 ] Let us consider introduction of expression £^" - Λ -- 8 and the resulting intermediate system ^s =U ^ S C _¾, tiC , i^; _i| - jhe unknowns for this system are the context variables of form '- . Because E? expresses A and B, the following applies and B 's context resolves to . Now based on state exclusion rule: whenever E? is false, but B is true, A is implicitly known to be false thus reducing conditional A's context to C < -- £^' _:S A ^ £Λ ^ £_;. _ Knowing all this, the intermediate system resolves to S^''- r _; r£₅.c . t .

[0062] The following table demonstrates relationships between different original system states and intermediate system states.

[0063] On the extreme right under column S¹ one can see the binary form of system S¹ states. In the binary format, the system states may be encoded from newest expression to the oldest up to the original variables, and all implicitly known variable states (bits) may be left out.

[0064] Now, introducing the second expression E₂ will modify the system S¹ further to form S^{Af^ , Mi , C ^ - , t^c :_z }_m |_n here, the second expression will have form E^ (B £ *}&C with a logical syntax, which interprets as £·> - * Λ- E, AC - B Λ-ί 3 A A }AC-B A A- A . Thus, if E₂ is true, B and C are implicitly true, A is implicitly false and E? is implicitly false, as E? and E₂ are exclusive. One can further specify the system variable contexts that are ,Aj- B, Si- £^. _S A- c .. C r is_::p similarly, it is then possible to provide a table that describes how the original system states are transformed to translated states:

[0065] Note that even while expressions E? and E₂ are exclusive, E₂ is not marked implicit, when E? is true. This is because in 'right-to-left' reading, whenever the state of E? is read, the state of E₂ is already known and thus true state of Ei is not too useful for determining the state of expression E₂.

[0066] It is further noted that most of the variables under discussion here are conditional variables of form H , i.e. they are defined only under definition context C. For convenience, variables are assumed to have definition context C so that for original system variables of form the definition context is simply defined true <··^".¾ ^: - '. It may be convenient to mark a reduced variable with notation V '- Vlc with implicit context, so that

and ρ Ί^ρ Ι* |P A C). |_n this paper, V syntax is used only when the variable is reduced because of re-expression. If a variable gets re-expressed a number of times, syntaxes V", V" and so forth may be used to mark this. Furthermore, as an addition to notation ^ ^{" ; v};, notation ^fc'*^{= < l>}«» ^{v >} may be used to em- phasize that the question is of pair expressions and not of any arbitrary conjugation.

[0067] In a further example, let us consider a heavily regular system with three states: -Ά Α^ Β ~-€ ► A A A -- C and -AA. BA C. Consider pair expressions i: . -< -i , B> _anc| &Y-<ir .. and the system states. The state expressions can be visualized with a following logical table (dotted cells for E?, lined for E₂):

[0068] The system S can be re-interpreted by using the pair expressions E? and E₂ so, that the variables A, B and C in their conditional reduced form no more vary. Instead the reduced A ', B" and C are always zero and have zero entropy. This effectively turns a 3-variable system with 8 possible states into a 2-variable system with 4 possible states. Because it is known that E? and E₂ exclude each other, the amount of possible states can be reduced to 3. This state reduction results in a heavy compression rate for this heavily regular system. In fact, when bits of E? and E₂ are encoded with codewords, the sizes of which are according to their entropy, the encoded system state size is ideal and is equal with the system entropy.

[0069] Pair expression language does not only contain a number of ordered pair expressions and starved variables, but also knowledge of variables present in the system. Given language L, it is basically possible to transform any sample .·> ^'- ί-· _Λ ^:. in original system S into translated form s' in translated system S'. Equally, the language provides means to translate system S' state s' into original form Λ = Ζ. ^{~ :} · Λ ·: . Pair expressed data may include samples in either translated form or in an intermediate form, which is easy to convert to translated or original forms. Pair expression statistics records information col- lected from the pair expressed data to enable pair expression formation, predicting and compression.

[0070] Accordingly, in the beginning the language learner Λ has an initial language ^' , which does not comprise any pair expressions, and for which applies. In this context the term expression is understood as a function from an initial system state s¹ to a translated system state s^l+1 so that the expression is £^': KlS^'}-*$^!* The language L is developed incrementally by introducing expressions one by one so that generation /^' language £^! can be defined as = or recursively as i ^i ^'^E* . If different re- expressed systems and system states are marked with a similar notation, a system state s can be expressed in a translated form of any generation /^' so that - ^: ; , which is equal to - £ ^_i u.. i^:a^; ;t ,i; .

[0071] There are also some samples of system states -- ^ ^ , which can be expressed as a subsystem 5 1 Λ _; that is essentially a subset of the state set of the original system (¾^c5). Furthermore, there is a translated ^_;:- ;:^[ ^ w ich can be called as pair expressed data or a pair expressed sample.

[0072] For a translated sample ¾ special statistics β^*"-βί$%} are generated and maintained. The statistics contain for a translated system variable ^ eS , defined in context at least the following values:

- ^{i ;! "} - ^.; that describes in how many samples * are both defined and true, defined.

[0073] Al nd ^L the values

^{!: ; V X} "

Cf- _are calculated and maintained. This numerical information can be used to calculate for each ; and ^■ probabilities ^{1 1 ■} '

and Ρ ^ - ^{? ϊ i} ¾:t ;A t } ⁱ_ y^g knowledge is applied by the learning engine in pair expression language forming.

[0074] The language learner A thus accepts a language and statistics β and gives back a pair expression ^{i; ~ L} ·^{β :}, which can be used to form another generation of the language i. " ^: - E^{: '} - f . Accordingly, after generating an expression, the language learner Λ stores (410) the translated sample ¾ and associated statistics β⁹. After forminq the next generation of language, another generation of the system data ^so ~^{L !} ¾ and statistics -^>¾ - ^; can again be formed until the entire language L is formed. If in step 404, no surpris- ingly common states are detected, S^'=S (412) and no re-expression language is formed.

[0075] One model to check whether the entire language L has been formed (414) is one of greedy search, where the benefit of adding each potential pair expression E is measured with benefit function H ^ . β) , and the best possible pair expression is considered provided, if benefit value is above a certain threshold t. If there are no pair expressions above threshold t, the expression adding stops and the language is complete. So if &wA ^{L >} is defined as a set of all possible pair expressions for language L, the pair expression learner function can be defined as:

Λ i L , β } - { E€ £ ( I }ij h ( E , β } ί ¥ D E ^ L D≠£ th iE , β) > b(D , β)

[0076] With this model, the operation of language learner A is approximated as defining the benefit function Μ Ε , β) and threshold t. Hereinafter benefit function is referred to simply as assuming implicitly that the used probabilities are based on the statistics β that again are calculated though examined lan- guage I^s and data ¾. Other models and means for checking the appropriate extent of re-expresssions for the system may be applied without deviating from the scope of protection.

[0077] The pair expression learning or inclusion mechanism of A minimizes the na^'fve entropy of the re-expressed system and forms an optimal language- statistics pair ^{■ L' '} ' ·^β"^} for the prediction work. In practice, these aims are basically different sides of the same goal; they both benefit from the reduction in the na^'fve system entropy. However, any arbitrary amount of pair expressions cannot be applied. When compressing, unlimited introduction of expressions would bloat serialized language definition horribly, while the consequence for predicting is overfitting.

[0078] When a pair expression ^{< v}:^< i ^> is applied to a system, the na^'fve entropy is reduced by a negative delta

and the language definition will also increase by some PA&S L TO decrease an entire data sample, the following must apply:

p Ej

j i v _t : V , >———

' ' ??

[0079] If this equation is used for pair expression inclusion logic, when the amount of samples approaches infinity the right hand threshold reduces to zero and the effect on the system entropy starts to dominate the pair expression inclusion.

[0080] For individual state combinations, statistical dependency of the state combination will inevitably predict mutual information l(V,; V_j) between the variables. This is simply because mutual information peaks when p iv.Av A ^p ivA p i v ,^} applies for each state combination ^;fci Thus, when finding out elevated mutual information it is sufficient to concentrate on the deviation in the dependency value:

oi v.- Λ v .· ;

d ; v,. ;> - y :—→—

i V< } p i Ϋ -. Ϊ

' '^■' *

[0081] If this value is above of below one for any state combinations, variables Vj, Vj are dependent. The problem with the dependency value is that the relevance of the state combination is heavily dependent of the state combination probability, i.e. the bigger the state combination is, the more relevant the state combination is for mutual information and actually for both compression and prediction work altogether. For this reason a better indicator has been determined to be a bit amount ideally saved by re-expressing the state combination: p ί ^'ν ,Λν . } S Jg—— — r—pi -,Α^'Ϋ .- i J SS i-? 5 v\- -"v .·

[0082] In this case, it is possible to compare the bit amounts of encoding the state with two codewords (one for V, and another for Vj) against encoding the state with a codeword of length ^ ^yy The state combination is true on probability '^{>ίι *}" y which then results in the above equation. When compressing n samples, the amount of information saved by the state re- expression in total can be compared to the amount of adding a pair expression entry into the language definition. If one ignores that the addition of the pair expression will also change encoding of the remaining states, the entity consisting of pair expressed data and language is compressed if

[0083] Herein ^ ^L is the entropy of the pair expression entry, i.e. expresses the growth of the language definition in bits. When the sample amount n approaches infinity, the right side of the equation will approach zero. For all variable pairs it applies that there is a state combination that will fulfill this condi- tion and therefore will cause a pair expression to be formed. After the pair expression is introduced, the same condition will again apply to the starved variable pairs, as well as pairs of other variables and the formed pair expressions. Eventually all original variables become entirely eliminated by starvation and for each system state there will be a pair expression, which is true only for the system state and false for all other states. In this sense with infinite samples, there will be an equivalent pair expression expressing each system state; the na^'fve system entropy becomes equal with true system entropy (of the sample SD) and the encoding transforms into system state encoding thus producing optimized encoding.

[0084] For the equation the benefit function can be defined:

■b \ < v. . Ϋ > r— ft p i V ..A v , ———— -^—— p i <v-.- , v > ϊ

^{' ■■■ "} " ^{!" '} B i V. ^'i p i v ^{: :i ; '} ' ^"

, (14)

while the threshold is defined as t=0.

[0085] In case only a small sample of states l*!,¾ - - ' < ^>^{C i} in the original system are known, the inclusion logic presented in the previous chapter may not provide optimal expression for S, but for system SD, which contains only a small sample ^ i . - - - ' ^ i = ¾ ^c5 . it is noted that expression designed for SD may not always provide a good compression rate when applied to S. If the sample is small, a random variation present in SD can be recognized as pat- terns and additional expression may be formed to reflect random noise. Similarly, if the pair expression language formed for SD is used to make predictions with S, the pair expression language may suffer from overfitting, which will drive error to predictions. [0086] There are some ways to try to avoid this kind of error. One simplistic method is simply introducing a threshold f>0 so that for a pair expression h(E)≥t applies.

[0087] Another way is to try to minimize potential error in the predictions. If rate piaA& p\ ) piii} \_s elevated, it may cause systematic bias of size h=p\ Ab y- p\a ;ρ j_n the calculation. In a prediction situation where the measurement error ΘΝ of p(a)p(b) plus systematic error b are both smaller than measurement error es of p(aAb), it is beneficial to use approximate i«A&) instead of P— P'-^!' This results in a comparison

s <e, --b

[0088] which can be transformed into form

[0089] where all errors e are of form

[0090] in case the approximate is simply the calculated average of n sam pies of binary variable following Bernoulli equation. Assuming that probabili ties p are equally likely, for n samples, where in k sample X is true, we get ί distribution

_,.^? _, s _vi Px pt _sk }^™ η+- i} ip {l—p^}

[0091] Assumin a flat priori distribution f

[0092] the above expectation value can be used as an estimate

P = E[ p\ Ni X ;= K ,.V )

f t 4- z

[0093] The standard deviation for the distribution can be used as the error.

[0094] The pair expression inclusion equation based on prediction error will meet the system entropy based equation when samples are big enough. When n approaches infinity, errors ΘΑΒ, ΘΑ, ΘΒ, will reduce to zero and equation will gain form b>0, which equals to P\AAB)> p{A)p{B) and ρ{ΑΛΒ}ίρ{Λ)ρΙΒ)>1 _and g(p{AAB ) / p Ϊ A ) p{B})>0 and

■ . _{t )} , piAAB) , ..,

' ^{' '} ~" ρ\Λ)ρ(β}^'

[0095] A further way to approach the problem is to have a na^'fve assumption of statistical independence as the base assumption, and use pair expression only in the situation, when the likelihood, that variables are not independent is sig- nificantly higher than the likelihood that variables are independent. This can be expressed through following equation:

— ;— ; A^y=« i i* p— i ; A A B) ,= ip— i Ά A

; B)) ; >

p{K=k N^™n\piA Λ Bj^Pi ) p{BY) where t is a threshold and where the equation can be derived into form

p i A A BY i 1 - » { . B) f ^!

-— ——^-7 —— :— ...... ,_t > t

ipiA) piB fil^pi A) piS f'^{^''1}

[0096] If variables are independent, the left side of the equation equals 1. Putting the equation inside a logarithm, approximating

and applying

— // ( A Λ B) = i A Λ B) log ρ{Α Β)+{ I— log p ( A )) log ( i - log piA BYi provides an equation:

n (— i ( A Λ B)— p i A Λ B ) log/? (A ) pi $ 1— p i A A. BY} log f 1 - pi A⁾. piB))}> logi

[0096] The above equation is true if re-expression of a state s=AAB will save in total over log_t bits. The equation contains saved bits for state negation information

where t₀=logt. This equation may apply even in situations, where the following dependency value is less than one p i A A B )

[0097] Figure 5 illustrates in more detail elements of an exemplary pair expression enhanced prediction process of the predicting engine 202. Predicting can usually be expressed as requesting the probability for variable Y, when the states xi, X2, . . . , x_n of variables Xi, X2,..., X_n are known. This can be expressed as a conditional probability ^ ¹- ^' Ι ι^Λ *2 ^Λ· · · ^Λ **) , which can be given in form

[0098] where the numerator is equivalent with Pi ^ Λ*₂Λ ^...A* j_f which can be processed with conditional probability into form

p { ¥ A. .x ; AX , A . , .Aje

[0099] If a na^'fve indeDendence assunriDtion is made, conditional probabilities can be marked

_which brings the original equation into form:

[0100] Unfortunately, the na^'fve assumption relatively rarely holds, which causes to the equation a systematic bias, whose magnitude corresponds with the level of dependency between variables X,. In fact, if the assumption does not hold, the result of the previous equation may be over 1 , which makes it invalid. In addition to this,

_{mav De} difficult to estimate. Consequent- ly, another form of the equation is used in the embodied predicting engine 202. Understanding that J^T P ^ J⁼ * the original Bayesian equation can be written in following form:

resulting in p(C\Y} piY) piC) piC\Y),

iC i p> C ^:Y ) / c\Y}pC: Ί piC l ^' i iii ^""·}' Ϊ ^■ p(C\¥)p(Y)

[0101] Connbining the previous equations yields

where

and equivalently

ΗΥ,. , x, ^■■- y ^■-Y \/ Λ/Η ,,Η 1

[0102] This series of equations provides an approximation

X · .·"... AI, which is accurate under na^'fve assumption. When the na^'fve assumption does not hold, it is anyhow well behaving in the sense that its value is always inside the range [0, 1].

[0103] The advantage of using pair expressions for predicting follows from the original system re-expression S^'=L(S) using the pair expression language function L. In system

ATJCf ,Χ,ίϋ! V \C* £f where conditional variables can also be referred with notation * \·~^^' C conditional variables are either exclusive ^{1 >} J^{■■ →} I J, or for each two conditional variable states v , vf the statistical dependency c/(V, vf) is constrained to be close to 1. Both of these conditions reduce the systematic error that results from na^'fve assumption.

[0104] In a typical prediction situation there is given some condition C, which describes the state of some set of variables in system S. The task is to know the probability of unknown variable Y under this condition C. In a very simplistic way of making the prediction, given the original system S, there is usually some set of states $ for which the condition applies and which can be re-expressed using the language function so that

5> . After deriving the matching states, in principle, the calculation of probability p(Y\C) should be simple in the sense that:

where the sigma sums up all the probabilities of all re-expressed system states, where Vis true.

[0105] A problem with this model is that the state probabilities are difficult to approximate. In here we may replace H with ^''ii\J /?!>'), which can be approximated under na^'fve assumption with pis ΐΥ)~ρ{ν \Υ) p v_< 'ΙΥί.,.ρίν, If) where all v , v ,..., v_k ^' are explicit variable states for system state s^'. In case the state of all other variables are known, except for a predicted variable Y, there will be in fact only two different re-expressed states that are ' y~ £ |C Λ.Ι ! _anc| ΐ ^'·^,*·^·=■*·^· it A -1 j _an(j the estimate P^ becomes

[0106] It is noted that variables of form V are used as a notation for conditional variables of form ^¥ !^- * . This means that when this notation is removed, pre- vious na^'fve approximate equation changes its form to

p { s ' \ Y ) = p { V .jCf Λ Y ) p i V C) AY),., p{ i^' _: 1 Cf Λ Y ) where all variable contexts of form ^ l are true for state s^', because variables with * false are excluded from the equation.

[0107] Let us consider a very simple system with variables Y, A and B. If the system is re-expressed with expression <A,B>, the system is in form 5 ^:,-=if , A\^~~>B , B\^~E , £}. Now, if we predict Y, when both A and B are true, the naive equation turns into form $' }=pi }=ΡΙΛΛΒ\Υ)_> which in fact doesn't have systematic error so that P ¹ ' '~Ρ^{: v j} . Similarly for

Π that can be turned to following form by remembering that

ΪΛΙΊ ^EA V i *M .8 A:

pi - E A ΪΊ

[0108] Finally the last estimate for ®A.-.S_! where variable state a can be A or -■A is

S Λ f j » ( ^~B HE A Yipl ^~E A ¥ )

[0109] Remembering that - BA -*EA ¥=-B Λ ¥ , the last kind of estimate is also without bias

[0110] This means that the pair expression mechanism with conditional variables work to remove bias driven into the predictions by the na^'fve assumption. A conditional probability ^; .½A.^,,AJ:_; ⁾ approximated by ^{-^x\ ■< can be written in form

where the upper part can be expressed as

which can be derived into form

t>i.&, a..-, , ! = »i ύ,)— > ( I - α(^~α,Α^~-α A)A^ i]—t> Ηα,Α '^~.α ,Λ -^αΑι !-

[0111] Under independence assumption this probability is approximated as p { a.■, A ~A,,, A _;)=-pl a_i ⁾ p{a ). p la , j

which is equal to [0112] Using the difference Γ-i-n·^»·,' between the real value and the approximation, one can determine the inherent systematic bias of the approximation as

H pi ~«,Λ ∑ p <2j Η**)- ( ^A-^a _fA which is marked with b(...).

[0113] By modifying the original equation, ^{X^ ^Λ^^Λ¾^{Α X}^ may be expressed as

( p { X, Λ x ; Λ Jf ₄ Λ Λ x,\ Y j i . - - I J^* ) 3 /? I A. F )

/> ( ;j }^' Η Ι.χ ,Λ ¾Λ ...Α ; ^' ) ,½ i ...! Y))pi Y i from where we can separate the relative bias component by replacing the con- text variables with ^{£i= X}J ^AX * ^{Λ *} - - -* :^; so that

where the relative error component e is

_Pic\y)- bix_sAc\Y}i pjxiiY)

which can also be expressed as

[0114] The relative bias is then proportional to following values:

J

meaning that driving b(...) to zero will neutralize relative bias by driving it to 1 and that b(...) can be re-expressed as

[0115] This means that the relative error can be further neutralized by driving equations of form

to zero. A familiar component in the equation is the measure for statistical de pendency

[0116] The error of the na^'fve Bayesian prediction is by far determined by how much this value differs from 1 . Accordingly, after pair expression forming, all statistical dependency values should be close to 1 (or exactly 0).

[0117] Let us consider a situation, where probability P i 1 .*ι Α ^ Λ, , , Λ.*.,.. ) is replaced by its expressed form P \ * ί^ν ; Λ ν , Α ^{, ,} .Λ ^ ) where each v,- may be either pair expression, variable or starved variable. Since all variables, for which

[0118] have been turned into starved variables and pair expressions, for each two variables their statistical dependency is limited from top by

[0119] where the right side is always above 1 and will approach 1 from positive side, when n approaches infinity. This gives protection against greater than one dependency values between two variables.

[0120] Against below-one dependency values there is also protection, which comprises two parts:

1 . Below-one dependency value for state combination v_it v, predicts above-one dependency values in one of d(~>Vi; ν , dfa; ^~>ν and df-'Vi,- -%). This is not a weak indication as it is known (e.g. by definition of entropy), that below-one dependencies must be balanced by above-one dependencies in other variable states. Let Wi, w_j be a 'balancing' state with the highest above-one dependency value. When n approaches infinity, the balancing state dependency value will be limited from up by 1 , which means that the balancing positive state is eventually eliminated by pair expression formation. When the 'balancing' positive state is removed, the statistical properties of the variable changes, which brings the original below-one state dependency value closer to 1 . As long the state's dependency value remains below-one, the variable combination will be a subject to starvation as n approaches infinity until there are no state combinations with either below-one or above-one dependency values left (instead the expression reduces for expressing individual system states with exclusive individual expressions). It is also worth noting, that the lower the original below-one dependency value is, the higher the balancing state combination's dependency value will be and the earlier the below-one dependency value will be neutralized by pair expression formation.

2. Furthermore, variable state combinations with below-one dependency values are very rare; and the lower the dependency value; the lower the probability that both states co-occur. E.g. for dependency value 0, the probability that both states are present equals zero. Now, let us consider situation, where the need for knowing probability p(x\c) can be approximated with p(c). This is the situation for example for heuristic function p(will win\game situation), where only game situations that actually happen (or may happen) in the game need to be evaluated. In this situation the expectation value for the square error is

H {e{ p {x\c} ) )= j ipix\c) -p ix\c}† pic). dc- and the effect on the average error that is caused by bias in p(x\c) is proportional to p(c). The more negative the dependency value is, the less likely p(c) becomes, and the more meaningless the resulting bias becomes. While this observation doesn't affect the actual bias, it means (with given assumptions) that the bigger the bias (for below-one dependencies) is, the less relevant it actually becomes.

[0121] Figure 5 illustrates elements of a prediction process associated with the learning process of Figures 3 and 4. In the beginning of the prediction process sample information s, 501 on a target entity is input to the predicting engine. This sample information s, is transformed with language preprocessor L_p 502 into a binary format s 503 that was applied also in the learning engine 201. The sample information s, may originate from the same one or more sources than the training data s_t used by the learning engine and be in the same format as the training data sets. The sample information s, may alternatively originate from other sources and/or be in other format as the training data sets. In such case the language preprocessor L_p 502 needs to comprise conversion algorithms for both formats of s, and s_t or a conversion algorithm for converting between formats of s, and s_t.

[0122] The preprocessed information s 503 is then forwarded to re-expression phase 504 where it is recursively re-expressed with the pair expression language L that was generated by the learning engine 201 on the basis of the training data s_t. This results in re-expressed sample information s^' 505. This re-expressed sample s^' comprises one or more unavailable variables that need to be estimated by prediction. In the present embodiment, this prediction is made with na^'fve Bayesian predictor 506 that outputs the re-expressed sample Sf^' 507 complemented with probabilities p for the unavailable parameters. Operations of the Bayesian predictor 506 do not require any adjustments due to the re-expression. The Bayesian predictor 506 uses the statistical infor- mation β that was generated by the learning engine 201 on the basis of the training data s_t. The predicting engine then uses one or more of these probability values as control indication c, to the mapping engine 102.

[0123] Figure 6 illustrates an exemplary process where the apparatus is used in a system that inputs complex sampled information collected from one or more information sources on a target entity (a person or a legal entity) and implements a control operation by outputting a classification indication for the target entity. The exemplary classification relates to the capability of the target entity to fully pay back a bank loan granted to him. The processing engine inputs sampled data records associated to the target entity, and computes a control indication that now indicates a probability p for the situation that the target entity will not pay back his loan. The mapping engine comprises mapping between control indications and classes of classification, where one class corresponds with a level of recommendation on whether to grant the loan or not. Accordingly, the mapping engine maps the received probability p to a class, and generates an control operation call for an output function with that particular class. In response to the received control operation call, the opera- tions engine of the apparatus outputs infornnation including the determined class e.g. through the user interface of the apparatus.

[0124] The basic situation is thus rather familiar: a person or a company is asking for a loan, and a bank official has to consider according to his best judgment, whether the loan should be granted or not. Conventionally, such judgment has been based on the official's good understanding of customers, their character and loans, and the understanding has been based on the wits and - even more - on experience of the bank officer. In order to support the decision, banks have provided their officers with views to one or more data- bases and created guidelines for interpreting the information in the databases in a methodical way. However, this is still a relatively simple and heuristic way for such an important operative decision in the financial industry.

[0125] Personal information given with loan applications provides a considerable amount of background information, and thus describes many features of the target entity. In addition, banking industry is very well recorded and historical data on payback performance of earlier customers is known in detail. These pieces of information in combination provide a promising basis for finding dependencies between them, and they may be used as historical information on the basis of which subsequent target entities may be evaluated. However, in order to achieve necessary accuracy significant amounts of complex data needs to be processed. It is clear that this type of estimation requires use of information processing systems that are able to make the computations without simultaneous support operations by the operator of the system.

[0126] Accordingly, the situation can be described such that there is some piece of information on the customer and the user of the system needs create control information on whether the customer should be granted a loan or not. In order to do this, the system first gains the necessary experience to make estimations, i.e. it forms understanding on the context on the basis of earlier experiences on one or more existing customers. The information that compris- es data on features of a number of training targets and experiences gained from their payback behavior is denoted with /.

[0127] The process of Figure 6 begins at a stage where the apparatus is switched on and operative to process data, as will be described in the following. Recorded information l=(ii, i₂, , ...) is input (600) to the system. In this simplistic example, historical information is available on six known customers. The historical information comprises the age and annual income of the cus- tomer, a recorded impression by a bank official and experimental indication on whether an earlier loan was paid or not. This means that one knows four parameters (age group, income level, recorded impression type, experience), of which the last parameter is the one that is then typically unavailable in later es- timations. The recorded information can be described in following way:

1 . Jane, 47 (middle), high, negative, paid

2. John ,68 (senior), normal, positive, paid

3. Martha, 51 (middle), normal, positive, paid

4. Donald, 43 (middle), normal, negative, unpaid

5. Sally, 21 (junior), low, positive, unpaid

6. Jack ,27 (junior), low, negative, unpaid

[0128] This information may be in various formats, for example, one part may be in an excel file and another part in a text file, perhaps exported from a database. The input data is first translated (602) into binary form to form a system S of bit vectors.

[0129] To translate information from / to S the method applies a language preprocessor L_p, where a bit at each specific location is adjusted to have a specific meaning. For example a bit at the end of a vector may be associated with the experience, i.e. be true, when the customer did pay back his loan and false when the customer for some reason did not pay back his loan. Another bit may describe a defined personal property of the customer. For example, bit number one may be true, when the customer is young. Bit number four may describe whether the customer is with limited means and so forth. When predicting bits based on bits, one in fact assigns probabilities for Boolean variables based on a set of known Boolean variables. This association between meanings and bits forms a language for this specific context.

[0130] The information may comprise discrete categorical properties such as whether the customer paid the granted loan or not. Such easy to bring into binary format; a defined bit may be assigned to indicate whether the customer paid or not. The age of the customer represents another type of property. Allocating one bit for each age would be ineffective. Instead, age could be turned into a binary format by first dividing customers into three different age groups such as junior (less 35 years), middle (35-55 years) and senior (55+ years). One needs three bits for each of these categories, and one bit is always true while other ones are not. For example, a junior bit will be true only for bit vec- tors associated to young customers below 35 years of age, and for those other bits (middle, senior) are false. Similar technique can be used for coding e.g. customer's monthly income into bits. There may be one bit for customers with lower income (<2000€ per month), for normal income customers (2000€- 4000€) and for high income customers (4000+€). The information may also comprise some other type of information, like the bank official's personal evaluation about the customer.

[0131] After the translation, the input information / has been brought into a binary form ^{5 =} CM , ¾2,^3, .^„> After this preprocessing phase, the binary in- formation is undergoes a language learning phase (604) where a pair expression language L based on system properties is recursively constructed, the system is re-expressed and statistics from the re-expressed system are calculated. General stages of this phase have been described in detail in connection with Figures 3 and 4 above.

[0132] In this example, the information / can be transformed into binary form so that 3 bits are allocated for age group (junior/middle/senior); 3 bits are allocated for income level (low/normal/high), 1 bit for evaluated impression (positive/negative) and 1 bit for loan eligibility. We then derive following binary vectors

j unior I middle I senior

low I normal | high

I positive

I I eligible

l\ l\ I I

I \l M l

jmslnhpe

1. 01000101

2 . 00101011

3. 01001011

4. 01001000

5 . 10010010

6. 10010000

[0133] The re-expression phase comprises a search for surprisingly common binary variable state pairs. This surprising commonness is evaluated in this embodiment with the following equation:

[0133] The equation can be interpreted to represent a specially weighted statistical dependency value. With this kind of statistical analysis, the bit for juniority can be observed to be surprisingly common with the bit for lower income level. The previous equation gives for this binary variable state pair a weighted dependency value 3.17. The next phase is then to form a new expression to express this value. The expression is like any binary variable that is true, whenever the juniority bit and low income bit are true. Following pair expression notation, the expression can be marked as <junior, low>. Part of the pair expression philosophy is avoidance of redundancy, and for this reason the states that are expressed by the new expression are then removed from the data. 1 -bits for junior and low are consequently replaced by '-' mark to mark points from where states were removed.

Expsi

jmslnhpe |

1. 010001010

2. 001010110

3. 010010110

4. 010010000

5. -00-00101

6. -00-00001

[0134] For several practical reasons, it is preferable not to include the values of removed states (marked with '-') in the statistics of low and junior variables. This means that the statistics for low and junior will be gathered on the basis of 4 samples (instead of 6) and they will have different statistical properties, when compared to original low and junior variables. Indeed, they can be viewed as different variables, and to mark this difference one may name them as reduced low and reduced junior and mark them with apostrophes, i.e. low^' and junior^'. Accordingly, when statistics are updated, the process continues by looking for new expressions. As a difference to previous round, the statistics of old variables have changed and there is also a new variable candidate (<junior, low> expression) that can be targeted for re-expressing. In fact the next most promising expression is < middle, <young, poor» with benefit value 2.00. After applying this expression, system data again changes its form to

Exps2

jmslnhpel |

1. 0100010100

2. 0010101100

3. 0100101100 4. 0100100000

5. --0-0010-1

6. --0-0000-1

[0135] The next expression is then found with benefit value 2.00. It is < middle, senior*. After applying this expression, the data has reached following form:

Exps3

jmslnhpel2 |

1. 01000101000

2. 0—01011001

3. 01001011000

4. 01001000000

5. —0-0010-10

6. --0-0000-10

[0136] As a consequence of these operations, the reduced junior, middle, sen- ior and low variables as well the expression <junior, low> no more vary in the re-expressed data. For all recorded members, the reduced junior, senior, low and <junior, low> are zero and the reduced middle variable is one.

[0137] One consequence of this is the improved compressibility of the information as the information that was originally stored in 4 original variables is now expressed with 2 variables. As a further consequence, the na^'fve entropy of the system has been reduced from 7.14 to 5.1 1 thus eliminating mutual information (metric for regularity) worth of 2.03 entropy units. This means that superficially an average system state can be encoded in 5.1 1 bits instead of 7.14. Also after measuring statistics it appears, that no two binary variable state pairs score over 2 (or even over 1 .5), when measured with weighted dependency value. It means that the operation has succeeded in reducing weighted statistical dependency between variables. This again predicts that na^'fve Bayesian predicting can be expected to be more accurate.

[0138] It is noted that the example is provided only to illustrate the process and it naturally significantly simplifies the true procedure. In practice, a potential expression with benefit less than twenty, fifty or seventy may not be worth applying, because of a fear that the learner will recognize noise as 'patterns', which apply only within the sample and nowhere else. The process typically comprises a threshold and only potential expressions that yield a benefit above this threshold are introduced in the language. The threshold is advantageously set according to how well the sample is considered to represent the system. If a sample would describe the system perfectly, threshold could be 0, but this typically cannot be assumed. A typical optimal threshold value is between 50 and 70.

[0139] As a further aspect, the example used 6 samples, but in practice hundreds or thousands of historical cases may need to be analyzed to allow the potential patterns and made predictions to have any statistical significance.

[0140] It is also worth noting, that there are at least three different determination rules that can be used for eliminating redundancy by reducing variables, but only one of these rules was applied in this example. The used rule was the expressed state determination rule, which says that the expressed states are implicitly known and should be excluded (the ones marked with '-').

[0141] When the language L and necessary statistics β are determined, the system is ready to process samples associated to individual customers. A sample / associated to a customer that applies loan from the bank is a data entry with one non-available variable, i.e. the eligibility to loan. The sample is in- put to the system (606), and again translated (608) into binary form s. The binary form sample is then re-expressed (610) with the language L learned in step 604 and fed into na^'fve Bayesian predictor that estimates (612) a probability value for the unknown variable representing the customer's eligibility to loan. This value is mapped into a recommendation that may then be output (614) through the user interface of the system to the user. The operative action may further or alternatively comprise implementing a data entry and/or changing content of a data entry of the customer in a data repository where customers' background information is stored.

[0142] The apparatus is a special-purpose control device that predicts unavail- able variables on the basis of statistically significant amounts of available training information and generating a control operation on the basis of this estimation. The example of Figure 6 shows how prediction based on earlier recorded payback experiences may be used to automatically provide a control value associated to an individual user and indicating his eligibility to pay back the loan. Use of pair expression based re-expression enhances the accuracy of the prediction without requiring larger samples and/or without essentially increasing the required amount of processing. Pre-procession of the samples with the pair expression based re-expression also improves compression of the data, so less data storage capacity is needed in the processing of the information. The apparatus provides an automatic control operation (visual, audio, change of state in a data record, entry of a data record, transmission of a message, etc.) on the basis of an individual sample and a prediction model created on the basis of earlier recorded data.

[0143] These or similar advantages are achievable in a variety of applications of this prediction-based apparatus. In more general, the apparatus may be used as a control device that provides a risk estimate for a target entity it receives information on. The risk estimate relates to the probability that a defined risk will materialize for that particular target entity. This risk estimate may then be used to initiate operative action(s) to reduce the probability that the risk materializes.

[0144] In an aspect the apparatus may be used as a control device that samples various parameters of an industrial process, and monitors a statistically behaving fault in the process. If counteractions against the fault are expensive or slow down the process itself, it is preferred to trigger them only when the probability of the fault increases. On the basis of historical data on the appear- ance of the fault, the control device generates a pair-expression based model (L, β, S^'). The control device then inputs samples in defined intervals, prepro- cesses them with the generated pair-expression based model and determines the probability for appearance of the fault. Mapping in this context comprises comparing the computed probability against a predefined trigger level and se- lecting a control operation call to either initiate a counteractive operation in case the estimated probability exceeds a first predefined trigger level, or to end a counteractive operation when the probability is below another predefined trigger level. A more accurate estimation is achieved, which shows as improved performance of the control device. Counteractive measures are applied in the system controlled by the control device only when necessary and therefore the throughput of the industrial process is improved.

[0145] As another example, let us consider that the target entity is a building and the risk relates to a defined structural defect, like mold buildup, that may have serious consequences if it is let to develop without interfering to its pro- gress. Structural details of buildings are well recorded, and also a lot of environmental data is available for analyses. Combining these data with experimental information on the occurrences of the defect enables estimating the risk of an individual building to suffer from this risk. The information to be considered may be very complex comprising structural details of the particular build- ing, statistics on environmental data (temperatures, rain), other location based information (industrial, urban or rural location etc.). Furthermore, in order to have any statistical significance, a great number of observations need to be run first to achieve any statistical relevance for the results. However, by means of the pair expression-based re-expression, such analysis is achievable with acceptable accuracy and with acceptable use of processing and memory ca- pacity in the apparatus performing the analysis. On the basis of the analysis, the control device may be configured to map different probabilities of the structural defect to different reminder schemes and thereby allow generating and sending of reminders to owners of buildings with increased risk level. The system may be also configured to map different probabilities of the structural de- feet to specific check period to plan inspections against this particular defect. Different variables have different dependencies, so if experimental data for more than one defects is available, the control device may be configured to provide an adapted checking schedule for all these defects.

[0146] Corresponding risk assessments are possible in medical systems, as well. For example, in an aspect, the control device may be used in a system that inputs complex information collected from one or more public information sources on a person and provides a risk classification for the person. The risk classification relates to the probability that the person will suffer from a defined statistically behaving symptom or disease. For example, let us consider that the risk relates to a defined disease. Medical data and history of a population is well recorded, and available is also a lot of other information that may have influence to the risk of developing the disease. Combining these features with experimental information on the occurrences of the disease enables estimating the risk of an individual person acquiring this disease. This risk estimation may be mapped to a control operation, like initiation of regular checks against that disease. The sampled information to be considered is again very complex and comprises a significant amount of different types of data. Furthermore, in order to have any statistical significance, a mass of records need to be analyzed first to achieve any statistical relevance for the results. However, by preprocessing the sampled information with the proposed pair expression-based re- expression, such analyses are now achievable with acceptable accuracy but with reduced use of processing and memory capacity in the apparatus that performs the analysis.

[0147] Corresponding risk assessment is possible also in other technical appli- cations, like communications. For example, the apparatus may be used in a control device that extracts various features (structural data, content, routing information) of an incoming message. The training information comprises such features of a group of earlier received messages and an indication on whether the message was detected to be an unwanted message (e.g. spam). By means of estimation enhanced with the proposed re-expression based pre- processing, the probability that a message is an unwanted message is effectively and more accurately estimated. By mapping the probability values to different levels of blocking operations, unwanted messages may be blocked more effectively.

[0148] In an aspect, the apparatus may be used in a control device that esti- mates an expected tendency in stock prices on the basis of successively, not necessarily regularly timed samples. The training information comprises defined information on defined companies and defined relevant events at one point of time, and experimental information on the associated change detected right after the point of time. This information may be used to develop a re- expression model that is then applied to pre-process a sample taken at some later point of time. The control operation comprises outputting in indication of the expected trend to the user of the apparatus, and/or changing a state of a record storing a value for the expected trend.

[0149] In an aspect, the apparatus may be used in a control device that esti- mates missing parts or parts that are not relied on in a defined type of information sequence. In such applications the control operation then comprises replacing the unavailable piece of information with the estimated piece of information, The unavailable piece of information may be, for example, a part in a DNA sequence, or an attenuated telecommunications signal.

[0150] The above aspects describe exemplary application fields of the embodiments of the control device. Other such application fields comprise computer vision, natural language processing, syntactic pattern recognition, speech and handwriting recognition, object recognition, machine perceptron, search engines, advertisements, adaptive websites, spam filtering, data mining, expert systems, brain-machine interfaces, gaming (heuristic, modeling), software engineering, robot locomotion, agents (heuristic, modeling).

[0151] As discussed earlier, even if the examples given above are for conciseness made with pair expression, re-expression may be based on more than two variables. Let us have a further look into multi-variable dependency values of form d(ai; az a_n). Given a dependency value for more than two variables, one needs to consider whether it is possible to make strong assumption, that eventually also these values will all approach 1 . Basically the answer is no, but there are limitations, which may probabilistically drive the multi-variable dependency value closer to one. Such limitations comprise:

1 . Below-one or above-one value of a given a multivariable dependency value d(ai; a_∑; a„) predicts below-one or above-one dependency values d(a_h aj) in state combinations (a,, aj).

2. For each state combination (a„ aj) with d(a_h aj)>1 there is a sample amount n, which will be re-expressed by pair expression. Equally for each state combination (a,, aj) with d(a_h aj)<1, there is a sample amount n, which is neutralized through starvation. Overall all multi-variable systems, which have two-state combinations with non-one dependency values, would eventually be re-expressed in a form, where all variables are either statistically independent or exclusive. This means, that as n grows larger, multivariable systems whose state combinations' dependency values reveal dependency, will become increasingly rare, until they disappear as n approaches infinity.

3. The more variables in the multivariable system, the lower the probability in equation

^' i ^~<u -.

therefore the lower the effect on the error.

[0152] Overall, the bias resulting from na^'fve assumption is considered to be limited, because two-variable dependencies are limited with strong guarantees (as n grows large) and 3+ variable dependencies (of original system) are limited with weak guarantees. This limits the error of na^'fve predicting.

Accordingly, there are some aspects to how pair expression behaves with samples of varying sizes from the perspective of approximation error and measurement error. Typically, there exists some kind of balance between

a) approximation errors, which relate to how fine grained or clumsy models or rules are constructed to express the knowledge, and

b) measurement errors, which originate from errors in measurements and the limited supply of observations.

[0153] The approximation error can be reduced by increasing granularity of modeling, but this typically leads to greater measurement errors. [0154] Now, with small samples, pair expression behaves like na^'fve Bayesian and it tries to minimize measurement error (that dominates small sample problems) simply because in lack of expressions and the approximation error is high. As sample count increases, the granularity is increased around state combinations, which are surprisingly common. This reduces approximation error with minimal cost to measurement error (because measurement error is reduced for surprisingly common state combinations). As sample count approaches infinity, pair expression forms one expression for each existing system state, and reduces to ordinary system state trace thus minimizing approx- imation error. In this sense the algorithm is well behaving, as it works to find a proper compromise between the two errors.

Use of pair expression in an information retrieval server

[0155] As briefly stated under the heading "Summary of the Invention", the inventive method can be complemented by pair expression processing. This technique may be utilized to improve the operating efficiency of an information retrieval server. A side benefit is that the operating efficiency of the network that interconnect the information retrieval server ("IR server") and client terminals is also improved.

[0156] Pair expression processing attempts to provide an answer to the ques- tion of how to provide people with the information they need. A conventional IR server responds to user queries by providing a list of candidate documents in a ranking order which is based on a match between the query and document contents, plus some global ranking statistics, ie statistics obtained from a large number of users. But the conventional IR server fails to take into account indi- vidual properties of a given user. The user must activate downloading of several candidate documents and inspect them for relevant content. Because the individual properties of a given user do not affect the order in which the IR server arranges the candidate documents, users must download a large number of candidate documents for visual inspection, until a sufficiently relevant document is discovered. This causes an excessive burden on the IR server, the interconnecting network architecture, the client terminals and their users.

[0157] This is a non-trivial problem, whose significance is steadily increasing as more and more information is becoming available online. The challenges are related to the scale of the problem, namely countless millions of users ac- cessing millions or billions of candidate documents. The challenges are also related to the difficulties of obtaining user preferences.

[0158] A conventional procedure for finding a desired document includes first a step of making a query in a specialized information retrieval (IR) database and retrieving a list of document matching the query. After or during the retrieval step, the relevance of the documents is estimated using a separate algorithm. The result is a list of matching documents sorted by their relevance so that the most relevant document is presented to the client terminal at the top of the list of documents. The present implementation example of the invention focuses on the use of pair expression for calculating document relevance.

[0159] In its basics, the problem of relevancy can be formalized in the equation p(s(d^'} \ c).

[0160] Herein s(d) describes the event of document d being the desired document under some query context information c. The query context information may include for example query (e.g. 'eiffel') as well as some information on the user, time and location.

[0161] Predicting may be based on information obtained from the history, and the most relevant document for a user, under a given set of conditions, is most likely a document identical or similar to a document that the user has previous- ly selected under similar conditions. To learn the patterns, the inventive IR server should have access to some sizable training data, which contains a large number of samples, such as thousands or millions of samples.

[0162] Now let us consider the following example. The IR server operator operates a service mainly aimed for travelers to find locations and web pages re- lated to the locations the travelers are in. The service should also be accessible from the home countries of the travelers. The operation of the IR server according to the present embodiment is based on the assumption that the needs and interests of users follow certain patterns. In this illustrative but non- restrictive description, the assumption is made that young users are more in- terested in entertainment, senior users prefer tourist attractions, and adult users at working age typically travel for purposes of business and are interested in business-related information on various companies. We will further assume that all user groups are interesting in buying new things. Accordingly, for the purposes of the present example, the IR server classifies documents into four categories: 1 ) business-related documents, 2) entertainment-related documents, 3) shopping-related documents and 4) tourism-related documents.

[0163] In addition to the exemplary four document categories, the IR server processes two kinds of relevance information. A first kind of relevance infor- mation is document-related relevance information that can be summarized as "what". This kind of relevance information is processed by every conventional IR server. In addition to the "what" information, an IR server of the present embodiment processes relevance information related to the query context. This kind of relevance information can be described as "who requested this infor- mation", "under what conditions (when, where...)".

[0164] In the present example, the query context data is organized into 6 different binary variables. The first three binary variables (Y, A, S) describe whether the user is classified as "young", "adult" or "senior". The two next binary variables describe whether the query was initiated during traditional working hours (W) or during a weekend (w). The sixth binary variable describes whether the user is abroad (a) or not. The document data has 4 variables, one for each category of 1 ) business (b), 2) entertainment (e), 3) shopping (s) and 3 tourism-related documents (t). A vector of bits is basically the state of a vector of binary variables. So for a query context, where a young person made a que- ry during a weekend, while travelling abroad, the query context can be described with the following bitmap or vector:

YASWwa

100011

[0165] In the above example, the bits in order are young, adult, senior, work hours, weekend, abroad. Similarly, a document related to entertainment (e) can be described with: best

0100

[0166] Let us next assume that the IR server needs to describe an event, in which a young (Y) person when abroad (a), during a weekend (w), found an document related to entertainment (e). The IR server can describe it with the following sequence YASWwabest

10001101001

[0167] Here the last bit indicates that the document was selected. For example, a business (b) document that was not selected can be described by the following bit vector:

YASWwabest

10001110000

[0168] Let us consider a sample of 9 selected documents. The non-selected documents can be ignored at this point, as they can be implicitly deducted from the selected document sample:

Age ime

Abroad

Document category

Selected

YASWWABESTS

YASWwabest

1. 01010110001

2. 01010010001

3. 01001100101

4. 10001100101

5. 10001001001

6. 10000101001

7. 00100100011

8. 00101100011

9. 00100100101 [0169] Herein the bits labeled Ύ", "A" and "S" again indicate whether the user is classified as young, adult or senior, respectively. The document category portion of the bitmap (the bits labeled "best") indicate whether or not the document belongs to the one or more of the four categories, namely business, en- tertainment, shopping and tourism. It is self-evident that the number of categories is purely exemplary and kept low for the interest of brevity and clarity, and an actual IR server will be capable of processing immensely larger numbers of categories, and accordingly immensely larger bitmaps.

[0170] For instance, a document context bitmap of OOlOllOOOll indicates that the user is a senior person traveling abroad, and that the user, who initiated the query during a weekend, was interested in tourism-related documents. The last bit of Ύ indicates that the user selected a candidate document. In this exemplary set there are three young users, three adults and three seniors (three rows with a Ί ' bit in each of the Y, A and S columns), and the exemplary set intentionally exhibits the patterns presented before.

[0171] As stated above, the object of the present embodiment is to provide users with documents they consider relevant under their currently prevailing conditions. Finding relevant documents automatically, instead of forcing the users to find the needle in the proverbial haystack, reduces the number of irrelevant documents that the IR server must retrieve and transmit over the interconnecting network.

[0172] In a normal situation, the patterns of human interest are not known a priori or indicated explicitly, and the patterns have to be learned instead. To learn what is considered relevant by a user of a given category under any giv- en set of conditions, the IR server has to analyze information on past behavior of a large number of prior users. Here the pair expression technique can be used for detecting the patterns. The IR server can express all input variables (not the selected variable). By running the described algorithm with a threshold of 2, the algorithm will form the following expressions for re-expressing the da- ta.

<workhours, business>'

<young, entertainment>

<senior, tourism>

<adult, <workhours, business>> [01 73] The apostrophe indicates that <workhours, business>' is a reduced expression. After the re-expression, the original bit set has obtained the following form:

Age

I ime

I I Abroad

I I I Document

I I I I Selected

I I I I I

I I I I I Expressions

I I I I I I YASWwabestS1234

1. 0-0-01-0001-001

2. 0-0-00-0001-001

3. 010011001010000

4. 100011001010000

5. -000100-0010100

6. -000010-0010100

7. 00-001000-10010

8. 00-011000-10010

9. 001001001010000

[01 74] Herein the bits labeled 1 , 2, 3 and 4 are bits for expression variables. The re-expression algorithm adds new expressions. The bit labeled Ύ corresponds to expression <workhours, business>', bit '2' corresponds to expression <young, entertainment>, bit '3' to expression <senior, tourism>, and bit '4' to <adult , <workhours, businessX

[01 75] The algorithm has effectively classified six out of the total of nine different query contexts. The described patterns are visible in the constructed language and they do help figuring out the usage patterns. To be able to use them in prediction, the IR server needs to gather statistics between the input variables (query context, document and expressions) and the one output variable (selected).

[0176] In the above description of the context-aware information retrieval, the output variable does not vary because the above description has not considered the documents that were not selected for each query. The context-aware IR server according to this embodiment can formulate a proper data set by creating entries for each document for each query, and by marking which document was actually selected by the user. This process will result in a total of 36 samples, of which the first 12 samples are described here:

Age ime

Abroad

Document

Selected

Expressions

YASWwabestS1234

1 0- 0-01-0001-001

2 -10101010000000

3 010101001000000

4 01- 101000100000

5 0- 0-00-0001-001

6 -10100010000000

7 010100001000000

8 01- 100000100000

9 010-11100000000

10 -10011010000000

11 010011001010000

12 01-011000100000 [0177] In the above example, it is apparent that whenever expression "4" (that is <adult, <workhours, business») was true, the document was also selected. An information retrieval server, which operates purely on the basis of a na^'fve prediction mechanism and only examines the relationship between input varia- bles and the single output variable, cannot find relationships like this without analyzing the re-expressed variables. It is true that within the first 12 samples the business variable seems to correlate with selection. This correlation disappears on examination of all the 36 variables, however. After performing the re- expression, the inventive IR server that implements the pair expression can improve its capability to make accurate predictions.

[0178] In the following description the term O(N) stands for the complexity of the prediction algorithm, wherein N indicates the number of input variables. By pre-examining the statistical dependencies between variable selections, it is possible to identify a subset of K input variables of the entire set of N input var- iables, wherein the subset of K input variables carry most information of the predicted variable. In our example the K significant variables are the 4 expression variables, all of which significantly contribute to accurate identification of the most relevant document given the present context. Restricting the process to the subset of K input variables, out of the total of N input variables, reduces the algorithmic complexity to O(K), K < N, where in real life examples K may be smaller than N by several magnitudes.

[0179] In connection with the context-aware IR server, the benefits of pair- expression include an improved ability to detect context-related patterns in the data. The inventive context-aware IR server can use the detected patterns to provide extremely a fast and powerful scoring mechanism. In addition, the use of pair expression in connection with an IR server reduces computational burden and thus increases response times. The speed increase is possible because in connection with IR servers what matters is not the actual probability that any given document will be selected from a client terminal, but rather the mutual order or ranking of those probabilities. If, say, a probability for a document to be selected is formed from components p(x) = v(x)^*Y / (A + Y), wherein A is a constant and Y is the same for all documents x, it is possible to set s(x) = v(x). When comparing actual probabilities for document selection and the ranking of those probabilities, it can be seen that the probability-based ranking of the documents is the same but calculating the ranking imposes a smaller computation burden compared with calculating the actual probabilities. [0180] Because pair expression can be coupled with any of a wide variety of statistical prediction algorithms, pair expression can be combined with tailored statistical scoring algorithms in order to provide better results.

[0181 ] After gathering the statistics from the re-expressed data, the context- aware IR server has stored a bitmap or bit vector that describes the states of the input variables. Next, the IR server can proceed to scoring. In a first step, the IR server may re-express the bit vector and proceed to calculating the score. One exemplary but not restrictive scoring method of reasonable simplicity is simplified bayesian scoring. This scoring method estimates the likeli- hood of document selection with the following formula.

p(s I Xi, x₂, x₃, ...) = p(s) d(s; x_x) d(s; x₂) d(s, x₃) ...

[0182] Herein, d(a; b) is the statistical dependency p(a & b) / p(a) p(b). In some implementations the IR server can further develop the scores by utilizing a logarithmic version of the above formula. An effect of the logarithmic version of the formula is that this version of the formula outputs the same units that the industry standard TF-IDF algorithm does:

score(s | Xi, x₂, x₃, ...) = log p(s) + log d(s; x_x) + log d(s; x₂) + log d(s, x₃) ...

[0183] Herein, it is again possible to utilize the expressions of the re- expression technique by inserting them in the formula. To make the example more realistic, we can also insert the score based on TF-IDF weighting. In the field of information retrieval technology and text mining, the quantity "term frequency-inverse document frequency", abbreviated "TF-IDF", is a commonly used weight. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. We could also in- elude the non-expression variables in the equation; but they are excluded herein for the sake of clarity and simplicity.

score(s | Xi, x₂, x₃, ...) = TF-IDF(s, query) + log d(s; ei) + log d(s; e₂) + log d(s, e₃) + log d(s, e₄)

[0184] After the scores have been calculated, the documents can be sorted in the order from highest scored document to the lowest scored. Those skilled in the art will understand that the example described herein is a very much simplified version in comparison with real-world examples. The description of a simplified version permits human reader to follow the example at the level of individual bits. In a real-world implementation, the use of pair expression with samples sizes as small as this and with a threshold of two would lead to over- learning of severely harmful results. With realistically large samples, the re- expression operation and the scoring technique of the present embodiment can be used to produce superior results with very good performance. This means that an IR server implementing this embodiment of the present invention is capable of utilize information on usage patterns to adjust relevance scoring of documents in a context-aware manner. A benefit of the context- aware relevance ranking of documents is that depending on current context, users of the IR server will be provided with documents whose relevance is supposedly optimal given the current context. Thus the search for the optimally, or at least sufficiently, relevant document is changed from a brute-force exhaustive search between IR server and user to an intelligent search most of which is performed in the context-aware IR server itself. As a result, the burden on the IR server's storage subsystem and the interconnecting network is dimin- ished.

Exemplary hardware construction

[0185] Figure 7 schematically shows a block diagram of an information retrieval server IRS. The information retrieval server comprises one or more central processing units CP1 ... CPn, generally denoted by reference numeral 705. Embodiments comprising multiple processing units 705 are preferably provided with a load balancing unit 715 that balances processing load among the multiple processing units 705. The multiple processing units CP1 ... CPn may be implemented as separate processor components or as physical processor cores or virtual processors within a single component case. The information retrieval server IRS also comprises a network interface 720 for communicating with with one or more client terminals CT1 through CTn, via data networks DN, such as the internet. In a typical but non-restrictive scenario, the client terminals CT1 ... CTn may be conventional data processing devices with internet browsing capabilities, such as desktop or laptop computer computers, smart telephones, entertainment devices or the like. The information retrieval server IRS also comprises or utilizes input-output circuitry 725 which constitutes a user interface of the information retrieval server IRS and comprises an input circuitry 730 and an output circuitry 735. The nature of the user interface depends on which kind of computer is used to implement the information retrieval server IRS. If the information retrieval server IRS is a dedicated computer, it may not need a local user interface, such as a keyboard and display, and the user interface may be a remote interface, in which case the authentication provider AP is managed remotely, such as from a web browser over the internet, for example. Such remote management may be accomplished via the same network interface 720 that the authentication provider utilizes for traffic between itself and the client terminals CT1 ... CTn, mobile devices and service providers, or a separate management interface may be utilized. In addition, the user interface may be utilized for obtaining traffic statistics.

[0186] The information retrieval server IRS also comprises memory 750 for storing program instructions, operating parameters and variables. Reference numeral 760 denotes a program suite for the information retrieval server IRS. The program suite 760 comprises program code for instructing the processor to execute the steps of the inventive method, namely:

• inputting a sample of a system;

· translating the sample into a binary form that comprises a plurality of system variables;

• using the model to re-express the determined system variables through the re-expressed variables;

• using the re-expressed variables to predict a system variable that is not available through the sample; and

• implementing a control operation determined according to the predicted system variable.

[0187] In the exemplary embodiment, wherein the apparatus for implementing the invention is an information retrieval server, the control operation includes control of an information retrieval server. For instance, a conventional information retrieval server receives a query from a client, wherein the query contains a list of keywords. The conventional information retrieval server retrieves documents matching the keywords and presents that list in the order of relevance, wherein the relevance is computed on the basis document-based prop- erties and, optionally, some global preference ratings. Examples of the document-based properties include number of occurrences of the query keywords in the documents, mutual proximity of the keywords in the documents, publication date, modification date, or the like. Examples of the global preference ratings include voting results of the documents, number of internet links referring to the documents, or the like. As stated earlier, a problem with such conventional information retrieval servers is that the retrieved documents and their mutual ranking does not take into account parameters of the user who initiated the query.

[0188] To solve this problem plaguing conventional information retrieval servers, the embodiment shown in Figure 7 contains a content database 710 and profile database 712, both of which are accessible to the central processing units CP1 ... CPn. The content database 710 can be similar to the ones used by conventional information retrieval servers. In other words, the content database 710 comprises documents and/or abstracts of documents and, optionally, global ranking information, such as numbers of links originating from or point- ing to the documents, voting results from previous users, or the like.

[0189] The profile database 712 contains user profile data. An example, albeit a simplistic one, of such user profile data was described under the heading "Use of pair expression in an information retrieval server". Reference numeral 780 denotes an area of the memory 750 used to store parameters and varia- bles.

[0190] The computer programs may be stored on a computer program distribution medium that is readable by a computer or a processor. The computer program medium may be, but is not limited to, an electric, a magnetic, an optical, an infrared or a semiconductor system, device or transmission medium. The computer program medium may include at least one of the following media: a computer readable medium, a program storage medium, a record medium, a computer readable memory, a random access memory, an erasable programmable read-only memory, a computer readable software distribution package, a computer readable signal, a computer readable telecommunications signal, computer readable printed matter, and a computer readable compressed software package.

[0191] It will be apparent to a person skilled in the art that, as the technology advances, the inventive concept can be implemented in various ways. The invention and its embodiments are not limited to the examples described above but may vary within the scope of the claims.

Claims

1 . A method, comprising the following steps performed by a programmed data processing system:

defining for a system a model for re-expressing system variables through re-expressed variables, a re-expressed variable expressing a state combination of two or more other variables;

inputting a sample of a system;

translating the sample into a binary form that comprises a plurality of system variables;

using the model to re-express the determined system variables through the re-expressed variables;

using the re-expressed variables to predict a system variable that is not available through the sample;

implementing a control operation determined according to the pre- dieted system variable.

2. The method according to claim 1 , wherein the step of defining the model comprises:

inputting a plurality of samples of a system;

translating the samples into a set of binary vectors that comprise a plurality of the system variables;

assuming statistical independence between the system variables; searching at least one condition for states of different system variables, where the occurrence level of the condition among the binary vectors exceeds a predefined level;

creating an expression that is true when the condition is true and not true when the condition is not true;

adding the expression to the set of binary vectors;

eliminate redundancies from the set of binary vectors.

3. The method according to claim 2, further comprising evaluating the occurrence level of the condition by minimizing: p (v,. v ) log^L^:

PM P{V_S

where Vi'and v are system variable states.

4. The method according to claim 3, further comprising: computing a weighted statistical dependency value between at least two system variables;

creating an expression if the weighted statistical dependency value exceeds a predefined level.

5. The method according to any one of claims 2 to 4, further com- prising: recursively adding the expression to the set of binary vectors, and eliminating redundancies from the set of binary vectors.

6. The method according to any one of claims 1 to 5, further comprising:

storing statistical information during the re-expression;

using the statistical information in the prediction.

7. The method according to any one of claims 1 to 6, further comprising:

defining for potential re-expressions a benefit function that indicates the benefit of adding a potential re-expression;

continuing the recursive operations if the value of the benefit function exceeds a predefined threshold.

8. The method according to claim 7 where only potential expressions that yield a benefit between 50 and 70 are introduced in the language.

9. The method according to any one of claims 1 to 8, further com- prising eliminating redundancy by:

developing rules for identifying redundantly declared variable states; marking redundantly declared variable states implicit;

omitting the redundantly declared variable states.

10. The method according to any one of the preceding claims, wherein the programmed data processing system comprises an information retrieval server (IRS), which also performs the following steps:

receiving document queries from one (?) or more client terminals (CT1 - CTn), the document queries indicating potentially relevant documents based on content of the potentially relevant documents;

collecting context-related selection information from multiple previous document queries, the context-related selection information indicating previous selections of the potentially relevant documents made from the one or more client terminals;

determining a context of a current document query from one of the client terminals;

utilizing the current document query, the context of the current doc- ument query and the collected context-related selection information to generate a relevance ranking for the potentially relevant documents;

presenting the potentially relevant documents in an order determined by the relevance ranking to the client terminal that initiated the current document query.

1 1 . The method according to any one of the preceding claims, further comprising allocating different processing steps and/or similar processing steps of multiple parallel processes to multiple processing units (705, CP1 - CPn) by a load balancing unit (715).

12. A programmed data processing system (IRS), comprising memory (750) for storing program instructions (760) and data (780) and one or more processing units (705, CP1 - CPn);

wherein the program instructions stored in the memory comprise instructions for performing the following steps by the one or more processing units:

inputting a sample of a system;

implementing a control operation determined according to the predicted system variable.

13. A computer-readable physical storage medium, encoding computer program instructions, wherein the computer program instructions, when executed by a data processing system cause the data processing system to carry out the steps of any one of the methods 1 to 1 1 .