WO2016203446A1 - System for extrapolation and statistical processing of data that can be acquired from one or more data sources - Google Patents

System for extrapolation and statistical processing of data that can be acquired from one or more data sources Download PDF

Info

Publication number
WO2016203446A1
WO2016203446A1 PCT/IB2016/053619 IB2016053619W WO2016203446A1 WO 2016203446 A1 WO2016203446 A1 WO 2016203446A1 IB 2016053619 W IB2016053619 W IB 2016053619W WO 2016203446 A1 WO2016203446 A1 WO 2016203446A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
acquired
value
sources
request
Prior art date
Application number
PCT/IB2016/053619
Other languages
French (fr)
Inventor
Giulio GATTI
Original Assignee
Gatti Giulio
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gatti Giulio filed Critical Gatti Giulio
Priority to CH01544/17A priority Critical patent/CH712818B1/en
Publication of WO2016203446A1 publication Critical patent/WO2016203446A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to the field of information extraction and analysis, in particular to a system for the extrapolation and statistical processing of data that can be acquired from one or more data sources.
  • these acquired data can be for example structured data, unstructured data, quantitative data, qualitative data, text data, opinion data, or web data, and they can have a value also of the time and/or space type.
  • the aim of the present invention is to overcome the limitations of the background art noted above, devising a new system capable of allowing to acquire information from a plurality of data and information sources and of processing them so as to allow a rapid and effective analysis thereof.
  • an object of the present invention is to select, among publicly accessible information, that which are most relevant for a given topic.
  • a further object of the invention is to allow the correlation of information acquired from various media channels.
  • Another object of the present invention is to have a structure that is simple, relatively easy to provide in practice, safe in use and effective in operation, and relatively low in cost.
  • the system according to the invention allows to process and store large volumes of data, preferably of the text type, acquiring them from various media channels, for example newspapers, social networks, blogs.
  • the system according to the invention allows to provide data that are useful for performing statistical studies on the way in which an information item propagates over information channels such as online ' newspapers, social networks, blogs, the media and the various data sources that are present on the Internet.
  • the system according to the invention allows to perform a web data search, but it can also reprocess the information in realtime with historicized information, optionally enhancing it with other information that arrives from other data sources, such as for example the data that originate from electromechanical tools that send data in real time to the system, which then historicizes them.
  • the system according to the invention allows to study the way in which an information item spreads geographically.
  • the system according to the invention allows to study financial markets by historicizing the main indexes and stocks, creating appropriate statistical indexes.
  • Figure 1 is a block diagram of an embodiment of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources according to the present invention
  • Figure 2 is a block diagram showing a detail of the embodiment of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources, shown in Figure 1 ;
  • Figure 3 is a flowchart showing the operation of an embodiment of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources according to the present invention
  • Figure 4 is a view of a first aspect of an embodiment of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources according to the present invention
  • Figure 5 is a view of a second aspect, in particular a word cloud, of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources according to the present invention
  • Figure 6 is a block diagram of the Query Creator of an embodiment of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources according to the present invention
  • Figure 7 is a block diagram of the process for creation of the topic lists, or Probes_Lists, of an embodiment of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources according to the present invention
  • Figure 8 is a block diagram showing the module that comprises the dictionaries, or Dictionary_MultiIdioms, of an embodiment of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources according to the present invention
  • Figure 9 is a block diagram showing a query in a system of a known type
  • Figure 10 is a block diagram of a query in an embodiment of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources according to the present invention.
  • the system comprises a station of a user 1, a server 2 and a telecommunications network 4, such as for example the Internet, which connects the station 1 to the server 2 and the server 2 to a plurality of data sources 3.
  • a telecommunications network 4 such as for example the Internet
  • the station of a user 1 comprises a terminal or terminal computer that allows to process information generated for example upon input or request of the user or operator and is capable of executing software instructions, in particular a software application for entering a request to be forwarded to the server 2.
  • Such user station 1 is capable of communicating, preferably by means of such software application, with the server 2.
  • the station of a user 1 comprises a computer which comprises various hardware and software resources and among other things allows to provide data to the user, to receive input from the user, to reprocess such input and send it to the server 2.
  • the user of the station 1 formulates this request in the form of a text data item.
  • a user of the station 1 might seek information related to a given topic, related to a news event, a fashion event, a financial event.
  • the user of the station 1 formulates this request by identifying a list of relevant words or search keys and sends this request to the server 2.
  • the server 2 comprises known hardware architectures and is arranged geographically in a position that is preferably remote with respect to the station 1 , with which it is capable of communicating by means of the telecommunications network 4.
  • the server 2 receives data from the station 1 and in particular requests to query sources 3 related to a given field or topic.
  • the server 2 comprises at least the means 20, 21 , 22, 23, 24. Such means are preferably implemented as modules of a software application that can be executed by the server 2.
  • the server 2 comprises further storage means ("Bigdata Architecture" block 56 in Figure 6) adapted to store all the information and data related to the system for extrapolation and statistical processing of data that can be acquired from one or more data sources.
  • the storage means comprise a database that is stored on appropriately sized storage media.
  • the means 20 are adapted to store data acquired from one or more sources 3 on the basis of the content of the request forwarded from the station 1.
  • the server 2 receives a request that is formulated by a user of the station 1 and generates a query intended for one or more of the sources 3 in order to obtain a plurality of information items from the network.
  • the choice of the sources 3 to be queried can be indicated in the request but it can also be determined on the basis of different criteria.
  • the means 20 can therefore perform this search by using the main search engines, such as, by way of nonlimiting example, Google and Yahoo, query the main online newspapers (local, regional, national, European and global), blogs, social networks, websites, for example financial websites, public or private data banks and the like.
  • the means 20 collect these acquired data obtained by means of this search and store them in the storage means of the server 2 or in storage means that are in any case accessible from the server 2.
  • the server 2 comprises further a Query Creator which is adapted to generate, as a consequence of the request formulated by the user, the optimized query intended for one or more of the sources 3 for acquiring the data.
  • the Query Creator uses customized probes, shown in the block diagram of Figure 6 by the "Probes" block 50.
  • the Probes 50 perform that work of searching and assessing the data on the basis both of preset criteria and of data reprocessing outputs, and do so in a totally automatic manner without any human intervention.
  • the Probes 50 in order to perform their activities and therefore assess the data, require topic lists, also known as "Probes Lists" 52 and 54.
  • the Probes_Lists 52, 54 define the behavior of the Probes 50, i.e.: what they have to search for, how to search the data, and where they have to perform the data search.
  • the Probes_Lists 52, 54 therefore define the activities that the probes must perform.
  • the process for the search and acquisition of data performed by the server 2 comprises: a. public data that are present on the Internet/social networks/blogs; b. data that are present in shared and/or remote folders and/or on remote filesystems; c. data originating from streams of information originated from remote architectures.
  • the Probes 50 used in the system according to the invention search for data not only on the Internet like others using a third-party web search engine (for example Google or Yahoo), but also in shared and/or remote folders, remote servers and/or internal databases, such as for example hospital patient records, company databases, shared public folders, or electromechanical probes, in order to search for as much data as possible related to a single search topic defined by the query.
  • a third-party web search engine for example Google or Yahoo
  • shared and/or remote folders, remote servers and/or internal databases such as for example hospital patient records, company databases, shared public folders, or electromechanical probes, in order to search for as much data as possible related to a single search topic defined by the query.
  • the block diagram of Figure 7 shows in detail the process with which the Probes_Lists 52, 54 are created.
  • the data are assessed in the system according to the invention on the basis of the source 3 from which they originate and the cluster to which they belong. Studying a phenomenon by means of the originating source and the cluster to which it belongs allows to delve into the learning of social phenomena. Furthermore, it is possible to assign to each type of data item the correct placement within one or more categories; for example, a data item or information item can belong to News and to Finance and to Fashion.
  • the system according to the invention is short on information for the study of a particular phenomenon, it is capable of generating internally a query for searching for data and of passing it to the probes, which in turn deal with the search for the data item proper at the various available data sources 3.
  • the system lacks information, it automatically creates the queries without any human intervention.
  • the means 21 for attributing to each one of such acquired data items a value, associate with such acquired and stored data items a given value that is defined as ranking of the data item.
  • such means 21 detect the frequency, for example the statistical frequency, with which a given data item is present in the acquired data.
  • the means 21 can identify the words that occur most frequently in both texts. Conveniently, these words to be identified may also not be indicated in the request entered by the user of the station 1 : in this manner, advantageously, it is possible to identify relevant words that were not indicated expressly by the operator or cannot be traced back to the search keys contained in the operator request.
  • the means 21 allow to find in such acquired information additional data that can be important for the operator.
  • the means 21 therefore assign to the words that are present most frequently a given value (for example, assuming that the value that can be assigned is comprised in a range of integers from 1 to 5, assigning the value 4 or 5) which is greater than the value assigned to the words that occur less frequently or are not present at all (to which for example the value 1 or 2 might be associated).
  • the means 21 assign to all the data a value that is comprised in a range of allowable values.
  • this range comprises a range of positive integers.
  • all the data are initially assigned a ranking value that is equal to a preset or default value, for example 1 , and this value is subsequently modified by the means 21 on the basis of the calculated frequency.
  • the default value is the lowest of the values comprising allowable values.
  • the means 21 initially assign to each data item a default value which is the lowest among the ones allowed and is preferably further modified by the means 21 on the basis of the calculated frequency and again, with methods that are more clearly described hereinafter, by the means 23.
  • the acquired data are analyzed by means of a na ' ive Bayes classifier, a classifier based on Bayes' theorem, such that it is necessary to know in advance the conditional probabilities related to the problem.
  • a na ' ive Bayes classifier a classifier based on Bayes' theorem, such that it is necessary to know in advance the conditional probabilities related to the problem.
  • the independence of the feature is assumed, i.e., it is assumed that the presence or absence of a feature in a data set is not correlated with the presence or absence of other features.
  • N number of total words and tokens in Tj, including duplicates
  • the means 22 for grouping the acquired data allow to identify correlations among the acquired data and to divide these data into groups or clusters.
  • the means 22 comprise a clustering engine that receives in input any type of clustering algorithm.
  • this correlation is provided by means of a partitional clustering algorithm of the K-mean or K-medoids type.
  • These means 22 therefore might enter in a same group of data to which the means 21 have assigned different values. For example, a data item to which a default value has been given might share the same of group assigned to data to which the maximum allowed value has been given instead.
  • the means 23 for calculating a statistical index allow to attribute a given value, termed group statistical index, to each one of the groups identified by the means 22.
  • the means 23 comprise a statistical engine that receives in input any type of statistical algorithm.
  • this value is given by the ratio between the sum of the values assigned by the means 21 to each one of the data items comprised in the group to which they belong, assigned by the means 22, and the number of data items contained in the same group.
  • the means 23 are further suitable to modify the value associated with a given data item by the means 21 on the basis of the value associated with the group to which said data item has been assigned.
  • the means 23 assign to the data to which the means 21 have associated a data item ranking equal to the default ranking, for example 1 , as a replacement a value that is equal to or based on the statistical index of the group, for example 4 or another value.
  • it is possible to modulate the value assigned initially by the means 21 adapting it to the value associated with the data that have a same affinity, for example data to which the same group or cluster has been assigned.
  • the data can have a value and therefore a relevance that does not depend exclusively on the repetition frequency of the data item in the information (for example text articles) acquired from the queried sources 3.
  • the means 24 allow to sort the acquired data according to a given criterion and to organize these data in a data structure, for example a priority list or a sorted tree.
  • the data structure can be sorted according to various criteria, for example on the basis of the statistical group index, on the value or ranking of the data item, or on both.
  • An example of this sorted tree is shown in Figure 4.
  • the means 24 are adapted to display visual information associated with such list, in particular which shows in a highlighted manner the data that are considered priority and are therefore more relevant than the ones considered less important.
  • An example of this visualization in the form of a word cloud is shown in Figure 5.
  • a word cloud is a visual representation of keywords used within the acquired mass of data.
  • the word cloud is presented in alphabetical order, with the particular characteristic of assigning a larger font to the most important words: it is therefore a weighted list.
  • the weight of the words which is rendered with characters of different sizes, is understood for example as frequency of use within the acquired data. The larger the character, the higher the frequency of the keyword.
  • the server 2 comprises further a module adapted for Analysis of Human Language Data, therefore specific for text analysis, known as Idioms.
  • the Idioms module allows to study the statistical outliers of the assessed words.
  • this module performs: a. recognition of new words, and therefore study of neologisms; b. learning of new ways of communicating, linked for example to technological and social evolution, which leads every day to new words or abbreviations.
  • the Idioms module is connected to a module 40 known as "Dictionary_Multi-Idioms", which comprises preferably both multilingual translation dictionaries 42 and thematic dictionaries 44.
  • the dictionaries that constitute the Dictionary _Multi-Idioms 40 are the element on which the text analysis process is based.
  • the thematic dictionaries 44 also known as "Topic_Dictionary", are dictionaries in the language or idiom of the text, which are specific for the topic being dealt with (for example a medical dictionary, a computer technology dictionary, and so forth), comprise an indication of the ranking of each word and aid the system in understanding the specific search topic and allow to concentrate analysis only on words that are pertinent to the subject of the search and/or of the statistical study.
  • the stop word process is performed after the system according to the invention has recognized the language or idiom with which the text being processed has been written, and only after loading into the system the appropriate thematic dictionary 44.
  • the multilingual translation dictionaries 42 and the thematic dictionaries 44 can be extended both by self-learning, which occurs simultaneously with the analysis of the texts, and manually, by means of the intervention of operators, as well as by means of connections to foreign language Universities.
  • search topic is a news event related to an event related to the Bardo museum in Tunis and that the source 3 is Twitter.
  • step 30 the user of the station 1 defines the search keys to be entered in a request to be sent to the server 2.
  • these words can be Twitter hashtags such as #museum; #Tunis or #Tunisia; #attack; # victims; "world” and other words comprised in the set of words shown in Figure 4. It is assumed that the word "Bardo" is not comprised within the search keys.
  • the server 2 is capable of translating these terms into other languages by means of appropriate multi-language translation dictionaries 42, also using self-learning mechanisms, and of automatically extending the search to sources 3 in a language other than that of the search keys. Therefore, in the example, the server 2 is capable of also finding user comments (tweets) that were written in a language other than the language of the search keys.
  • step 31 the server 2 extrapolates from the request the search keys conveniently in addition to other parameters (for example the time that can be used for the search) and performs the search.
  • the search step performed by the server 2 provides for the preliminary creation of a script for example based on the Knime software.
  • step 32 the server 2 obtains from the queried source 3 and after a given period of time a plurality of data items in response, for example comments published by various users on their Twitter profiles.
  • the data obtained are subjected to a preprocessing step, so as to make analysis easier, for example by eliminating punctuation, filtering certain terms, converting the characters into a format, for example from uppercase to lowercase and in general performing stemming operations by using adapted computer technology aids.
  • step 33 the server 2 assigns to each obtained data item a given data item ranking or value; this operation is performed by the means 21 , which attribute for example the data ranking or value on the basis of the frequency with which a given data item occurs in the acquired mass of data.
  • the means 21 which attribute for example the data ranking or value on the basis of the frequency with which a given data item occurs in the acquired mass of data.
  • the word “attack”, which is assumed to be present with high frequency is assigned the value 5
  • the word "world”, which is assumed to be present less frequently is assigned the value 2
  • the word “Bardo" which is not present in the search keys but is in any case present in the acquired data (for example with a frequency that exceeds a preset threshold) is assigned the default value 1.
  • the means 22 assign each word to a given group.
  • the word "Bardo" might be assigned to the same group to which the word “attack” belongs. It is understood that these values are indicated merely for exemplifying purposes.
  • the means 21 modify the values of the words to which such means 21 initially assigned a default value, for example on the basis of the value of the words assigned to the same group.
  • Other criteria can be used to modify the data item rankings or values associated with the data, in particular the data with which a default value is associated initially, for example, one might assign the corresponding statistical index of the group or cluster to which they belong.
  • step 34 the means 24 show a word cloud, i.e., visual information in which the words that have a larger value, optionally modified as indicated in step 33, are shown in a highlighted manner with respect to the others.
  • Figure 5 shows an example of this display, wherein for example the word "museum” appears more emphasized than the others (some of the words might appear in a truncated form or might be modified on the basis of the pre-processing operation).
  • the system thus conceived allows to overcome the qualitative limitations of the background art, allowing to make searches easier, to facilitate the processing of the data and to reduce the efforts for selection of the most pertinent data.
  • the system advantageously allows the recovery of information related to various topical areas and to focus attention on the most important data, assigning to these data an appropriate weight/value.
  • the resulting visual information allows for example an operator to concentrate the writing of an article or the creation of reports ("report") by assigning a greater weight to the most relevant words found.
  • the system thus obtained allows to study social phenomena linked to the spread of a news item, to study the ways in which propagation of an information item occurs and on the basis of these studies perform analysis related to the reactions caused by the diffusion of the news proper.
  • system can also be used to provide statistical models that allow to anticipate fluctuations of the main stock exchange indexes, to perform risk management studies, to write science articles easily and to provide tools for aiding scientific research.

Abstract

A system for extrapolation and statistical processing of data that can be acquired from one or more data sources, comprising a station of a user, provided with means for generating a request to query data sources, and a server, comprising means adapted to receive the request and to query one or more sources on the basis of the content of the request; the system having the particularity of organizing the acquired data in groups and of assigning statistical indexes and values to the data and the groups; the system being further adapted to generate a visual information comprising the plurality of data acquired. The invention also relates to a method and a software application that are consistent with the system.

Description

SYSTEM FOR EXTRAPOLATION AND STATISTICAL PROCESSING OF DATA THAT CAN BE ACQUIRED FROM ONE OR MORE DATA SOURCES
The present invention relates to the field of information extraction and analysis, in particular to a system for the extrapolation and statistical processing of data that can be acquired from one or more data sources. Within the scope of the present invention, these acquired data can be for example structured data, unstructured data, quantitative data, qualitative data, text data, opinion data, or web data, and they can have a value also of the time and/or space type.
Systems and studies are known which allow the systematic collection, preservation and analysis of data from a plurality of sources with the goal of aiding operators in making certain decisions, for example on the subject of product distribution, advertising effectiveness, assessment of risks linked to an investment or user appreciation of a given product or service. Although access to information sources usually is not a problem, consider for example the large quantity of opinions and comments expressed by consumers on the Internet, it is rather complicated to extrapolate from the enormous mass of available data the information that is actually useful, i.e., select the information that has a certain usefulness for a specific sector. These searches, which usually assume the choice of a statistical sample of population to be analyzed and the use of given search keys, often have unsatisfactory results. Excessively generic searches can obtain a lot of data, which must then be screened by operators with a great expenditure of time, while excessively specific searches, on the other hand, can be inadequate or lead to misleading results. The choice of the search keys might in fact be incorrect, might not include relevant words because they have not been considered or are even unknown to the operator, or might assign to certain words a relatively lower important with respect to others.
The aim of the present invention is to overcome the limitations of the background art noted above, devising a new system capable of allowing to acquire information from a plurality of data and information sources and of processing them so as to allow a rapid and effective analysis thereof.
Within this aim, an object of the present invention is to select, among publicly accessible information, that which are most relevant for a given topic.
A further object of the invention is to allow the correlation of information acquired from various media channels.
Another object of the present invention is to have a structure that is simple, relatively easy to provide in practice, safe in use and effective in operation, and relatively low in cost.
This aim and these and other objects that will become better apparent hereinafter are achieved by a system according to appended claim 1 , by a method according to appended claim 9 and by a computer program according to appended claim 10.
Advantageously, the system according to the invention allows to process and store large volumes of data, preferably of the text type, acquiring them from various media channels, for example newspapers, social networks, blogs.
Conveniently, the system according to the invention allows to provide data that are useful for performing statistical studies on the way in which an information item propagates over information channels such as online ' newspapers, social networks, blogs, the media and the various data sources that are present on the Internet.
Advantageously, the system according to the invention allows to perform a web data search, but it can also reprocess the information in realtime with historicized information, optionally enhancing it with other information that arrives from other data sources, such as for example the data that originate from electromechanical tools that send data in real time to the system, which then historicizes them. Advantageously, the system according to the invention allows to study the way in which an information item spreads geographically.
Validly, the system according to the invention allows to study financial markets by historicizing the main indexes and stocks, creating appropriate statistical indexes.
Further characteristics and advantages of the invention will become better apparent from the following detailed description, given by way of nonlimiting example, accompanied by the corresponding figures, wherein:
Figure 1 is a block diagram of an embodiment of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources according to the present invention;
Figure 2 is a block diagram showing a detail of the embodiment of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources, shown in Figure 1 ;
Figure 3 is a flowchart showing the operation of an embodiment of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources according to the present invention;
Figure 4 is a view of a first aspect of an embodiment of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources according to the present invention;
Figure 5 is a view of a second aspect, in particular a word cloud, of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources according to the present invention;
Figure 6 is a block diagram of the Query Creator of an embodiment of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources according to the present invention;
Figure 7 is a block diagram of the process for creation of the topic lists, or Probes_Lists, of an embodiment of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources according to the present invention; Figure 8 is a block diagram showing the module that comprises the dictionaries, or Dictionary_MultiIdioms, of an embodiment of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources according to the present invention;
Figure 9 is a block diagram showing a query in a system of a known type;
Figure 10 is a block diagram of a query in an embodiment of the system for extrapolation and statistical processing of data that can be acquired from one or more data sources according to the present invention.
An exemplifying architecture of the system according to the present invention is summarized in the block diagram of Figure 1.
The system comprises a station of a user 1, a server 2 and a telecommunications network 4, such as for example the Internet, which connects the station 1 to the server 2 and the server 2 to a plurality of data sources 3.
The station of a user 1 comprises a terminal or terminal computer that allows to process information generated for example upon input or request of the user or operator and is capable of executing software instructions, in particular a software application for entering a request to be forwarded to the server 2. Such user station 1 is capable of communicating, preferably by means of such software application, with the server 2. In the preferred embodiment the station of a user 1 comprises a computer which comprises various hardware and software resources and among other things allows to provide data to the user, to receive input from the user, to reprocess such input and send it to the server 2. In the preferred embodiment, the user of the station 1 formulates this request in the form of a text data item. For example, a user of the station 1 might seek information related to a given topic, related to a news event, a fashion event, a financial event. Preferably, the user of the station 1 formulates this request by identifying a list of relevant words or search keys and sends this request to the server 2. The server 2 comprises known hardware architectures and is arranged geographically in a position that is preferably remote with respect to the station 1 , with which it is capable of communicating by means of the telecommunications network 4. The server 2 receives data from the station 1 and in particular requests to query sources 3 related to a given field or topic.
The server 2 comprises at least the means 20, 21 , 22, 23, 24. Such means are preferably implemented as modules of a software application that can be executed by the server 2.
The server 2 comprises further storage means ("Bigdata Architecture" block 56 in Figure 6) adapted to store all the information and data related to the system for extrapolation and statistical processing of data that can be acquired from one or more data sources. In a preferred embodiment of the system 10 according to the invention, the storage means comprise a database that is stored on appropriately sized storage media.
The means 20 are adapted to store data acquired from one or more sources 3 on the basis of the content of the request forwarded from the station 1. In particular, the server 2 receives a request that is formulated by a user of the station 1 and generates a query intended for one or more of the sources 3 in order to obtain a plurality of information items from the network. The choice of the sources 3 to be queried can be indicated in the request but it can also be determined on the basis of different criteria. For example, the means 20 can therefore perform this search by using the main search engines, such as, by way of nonlimiting example, Google and Yahoo, query the main online newspapers (local, regional, national, European and global), blogs, social networks, websites, for example financial websites, public or private data banks and the like. Furthermore, the means 20 collect these acquired data obtained by means of this search and store them in the storage means of the server 2 or in storage means that are in any case accessible from the server 2.
In a preferred embodiment of the invention, the server 2 comprises further a Query Creator which is adapted to generate, as a consequence of the request formulated by the user, the optimized query intended for one or more of the sources 3 for acquiring the data.
The Query Creator uses customized probes, shown in the block diagram of Figure 6 by the "Probes" block 50. The Probes 50 perform that work of searching and assessing the data on the basis both of preset criteria and of data reprocessing outputs, and do so in a totally automatic manner without any human intervention.
The Probes 50, in order to perform their activities and therefore assess the data, require topic lists, also known as "Probes Lists" 52 and 54. The Probes_Lists 52, 54 define the behavior of the Probes 50, i.e.: what they have to search for, how to search the data, and where they have to perform the data search. The Probes_Lists 52, 54 therefore define the activities that the probes must perform.
Within the scope of the query generated by the server 2, it is possible to define as content of the statistical search: a. a sentence (understood as a set of words ordered according to semantic logic criteria); b. a single alphanumeric word; or c. a set of bytes that compose the entire image or a part of such image.
The process for the search and acquisition of data performed by the server 2 comprises: a. public data that are present on the Internet/social networks/blogs; b. data that are present in shared and/or remote folders and/or on remote filesystems; c. data originating from streams of information originated from remote architectures.
It should be noted that the Probes 50 used in the system according to the invention search for data not only on the Internet like others using a third-party web search engine (for example Google or Yahoo), but also in shared and/or remote folders, remote servers and/or internal databases, such as for example hospital patient records, company databases, shared public folders, or electromechanical probes, in order to search for as much data as possible related to a single search topic defined by the query.
The block diagram of Figure 7 shows in detail the process with which the Probes_Lists 52, 54 are created. The data are assessed in the system according to the invention on the basis of the source 3 from which they originate and the cluster to which they belong. Studying a phenomenon by means of the originating source and the cluster to which it belongs allows to delve into the learning of social phenomena. Furthermore, it is possible to assign to each type of data item the correct placement within one or more categories; for example, a data item or information item can belong to News and to Finance and to Fashion.
For example, if the system according to the invention is short on information for the study of a particular phenomenon, it is capable of generating internally a query for searching for data and of passing it to the probes, which in turn deal with the search for the data item proper at the various available data sources 3. In practice, if the system lacks information, it automatically creates the queries without any human intervention.
The means 21 , for attributing to each one of such acquired data items a value, associate with such acquired and stored data items a given value that is defined as ranking of the data item. In particular, such means 21 detect the frequency, for example the statistical frequency, with which a given data item is present in the acquired data. For example, if the data are two newspaper articles related to a given news event and therefore these data are of the text type, i.e., words, the means 21 can identify the words that occur most frequently in both texts. Conveniently, these words to be identified may also not be indicated in the request entered by the user of the station 1 : in this manner, advantageously, it is possible to identify relevant words that were not indicated expressly by the operator or cannot be traced back to the search keys contained in the operator request. In other words, while the search keys entered by the operator allow to find relevant information, the means 21 allow to find in such acquired information additional data that can be important for the operator. The means 21 therefore assign to the words that are present most frequently a given value (for example, assuming that the value that can be assigned is comprised in a range of integers from 1 to 5, assigning the value 4 or 5) which is greater than the value assigned to the words that occur less frequently or are not present at all (to which for example the value 1 or 2 might be associated). Preferably, the means 21 assign to all the data a value that is comprised in a range of allowable values. Preferably, this range comprises a range of positive integers. In one embodiment, all the data are initially assigned a ranking value that is equal to a preset or default value, for example 1 , and this value is subsequently modified by the means 21 on the basis of the calculated frequency. Preferably, the default value is the lowest of the values comprising allowable values. In one embodiment, the means 21 initially assign to each data item a default value which is the lowest among the ones allowed and is preferably further modified by the means 21 on the basis of the calculated frequency and again, with methods that are more clearly described hereinafter, by the means 23.
In a preferred embodiment of the system according to the invention, the acquired data are analyzed by means of a na'ive Bayes classifier, a classifier based on Bayes' theorem, such that it is necessary to know in advance the conditional probabilities related to the problem. In the following exemplifying model of na'ive Bayes classifier, the independence of the feature is assumed, i.e., it is assumed that the presence or absence of a feature in a data set is not correlated with the presence or absence of other features.
The model is thus summarized:
1. collect all the words and token elements that occur in the text;
2. create the Vocabulary = distinct words + token;
3. estimate P(vj) e P(Wk|vj) where:
a. P(Wk|vj) = probability of having a word k given the target value j; b. P(vj) = target value probability.
In pseudocode, given a Text T and a Vocabulary V, the model is as follows:
FOR EACH vi IN V DO:
Docj = subset of T where target value = vj
P(vj) = (|Docj|)/(|T|)
Tj = document created by concatenating Docj
N = number of total words and tokens in Tj, including duplicates
FOR EACH (word, token) IN V DO:
*nk = frequency of word, token in Text
*P(Wk|vj)=(nk+ l)/(n + |V|)
The means 22 for grouping the acquired data allow to identify correlations among the acquired data and to divide these data into groups or clusters. In particular, for this purpose the means 22 comprise a clustering engine that receives in input any type of clustering algorithm. In the preferred embodiment, this correlation is provided by means of a partitional clustering algorithm of the K-mean or K-medoids type. These means 22 therefore might enter in a same group of data to which the means 21 have assigned different values. For example, a data item to which a default value has been given might share the same of group assigned to data to which the maximum allowed value has been given instead.
The means 23 for calculating a statistical index allow to attribute a given value, termed group statistical index, to each one of the groups identified by the means 22. In particular, for this purpose the means 23 comprise a statistical engine that receives in input any type of statistical algorithm. In the preferred embodiment, this value is given by the ratio between the sum of the values assigned by the means 21 to each one of the data items comprised in the group to which they belong, assigned by the means 22, and the number of data items contained in the same group. In the preferred embodiment, the means 23 are further suitable to modify the value associated with a given data item by the means 21 on the basis of the value associated with the group to which said data item has been assigned. For example, the means 23 assign to the data to which the means 21 have associated a data item ranking equal to the default ranking, for example 1 , as a replacement a value that is equal to or based on the statistical index of the group, for example 4 or another value. Advantageously, and if appropriate, it is possible to modulate the value assigned initially by the means 21 , adapting it to the value associated with the data that have a same affinity, for example data to which the same group or cluster has been assigned. In this manner the data can have a value and therefore a relevance that does not depend exclusively on the repetition frequency of the data item in the information (for example text articles) acquired from the queried sources 3. In particular, it is possible further to modulate, partially modify or replace the ranking value of the data, in particular of the data to which a default value has been assigned, on the basis of the ranking value of the data of the assigned group or of the statistical index of the group.
The means 24 allow to sort the acquired data according to a given criterion and to organize these data in a data structure, for example a priority list or a sorted tree. The data structure can be sorted according to various criteria, for example on the basis of the statistical group index, on the value or ranking of the data item, or on both. An example of this sorted tree is shown in Figure 4. Furthermore, the means 24 are adapted to display visual information associated with such list, in particular which shows in a highlighted manner the data that are considered priority and are therefore more relevant than the ones considered less important. An example of this visualization in the form of a word cloud is shown in Figure 5.
A word cloud is a visual representation of keywords used within the acquired mass of data. In general, the word cloud is presented in alphabetical order, with the particular characteristic of assigning a larger font to the most important words: it is therefore a weighted list. The weight of the words, which is rendered with characters of different sizes, is understood for example as frequency of use within the acquired data. The larger the character, the higher the frequency of the keyword.
In a preferred embodiment of the system according to the invention, the server 2 comprises further a module adapted for Analysis of Human Language Data, therefore specific for text analysis, known as Idioms.
The Idioms module allows to study the statistical outliers of the assessed words. In particular, this module performs: a. recognition of new words, and therefore study of neologisms; b. learning of new ways of communicating, linked for example to technological and social evolution, which leads every day to new words or abbreviations.
In an even more preferred embodiment of the system according to the invention, the Idioms module is connected to a module 40 known as "Dictionary_Multi-Idioms", which comprises preferably both multilingual translation dictionaries 42 and thematic dictionaries 44. The dictionaries that constitute the Dictionary _Multi-Idioms 40 are the element on which the text analysis process is based. The thematic dictionaries 44, also known as "Topic_Dictionary", are dictionaries in the language or idiom of the text, which are specific for the topic being dealt with (for example a medical dictionary, a computer technology dictionary, and so forth), comprise an indication of the ranking of each word and aid the system in understanding the specific search topic and allow to concentrate analysis only on words that are pertinent to the subject of the search and/or of the statistical study.
The stop word process is performed after the system according to the invention has recognized the language or idiom with which the text being processed has been written, and only after loading into the system the appropriate thematic dictionary 44.
In an embodiment of the system according to the invention, the multilingual translation dictionaries 42 and the thematic dictionaries 44 can be extended both by self-learning, which occurs simultaneously with the analysis of the texts, and manually, by means of the intervention of operators, as well as by means of connections to foreign language Universities.
With reference to the flowchart of Figure 3, the operation of an embodiment of the system according to the invention is now illustrated. By way of nonlimiting example, it is assumed that the search topic is a news event related to an event related to the Bardo museum in Tunis and that the source 3 is Twitter.
In step 30 the user of the station 1 defines the search keys to be entered in a request to be sent to the server 2. For example, these words can be Twitter hashtags such as #museum; #Tunis or #Tunisia; #attack; # victims; "world" and other words comprised in the set of words shown in Figure 4. It is assumed that the word "Bardo" is not comprised within the search keys. Preferably, the server 2 is capable of translating these terms into other languages by means of appropriate multi-language translation dictionaries 42, also using self-learning mechanisms, and of automatically extending the search to sources 3 in a language other than that of the search keys. Therefore, in the example, the server 2 is capable of also finding user comments (tweets) that were written in a language other than the language of the search keys.
In step 31 , the server 2 extrapolates from the request the search keys conveniently in addition to other parameters (for example the time that can be used for the search) and performs the search. Preferably, the search step performed by the server 2 provides for the preliminary creation of a script for example based on the Knime software.
In step 32, the server 2 obtains from the queried source 3 and after a given period of time a plurality of data items in response, for example comments published by various users on their Twitter profiles. The data obtained are subjected to a preprocessing step, so as to make analysis easier, for example by eliminating punctuation, filtering certain terms, converting the characters into a format, for example from uppercase to lowercase and in general performing stemming operations by using adapted computer technology aids.
In step 33, the server 2 assigns to each obtained data item a given data item ranking or value; this operation is performed by the means 21 , which attribute for example the data ranking or value on the basis of the frequency with which a given data item occurs in the acquired mass of data. For example, as exemplified by the word cloud in Figure 5, the word "attack", which is assumed to be present with high frequency is assigned the value 5, the word "world", which is assumed to be present less frequently, is assigned the value 2, while the word "Bardo", which is not present in the search keys but is in any case present in the acquired data (for example with a frequency that exceeds a preset threshold) is assigned the default value 1. Then the means 22 assign each word to a given group. For example, the word "Bardo" might be assigned to the same group to which the word "attack" belongs. It is understood that these values are indicated merely for exemplifying purposes. Then, according to one embodiment, the means 21 modify the values of the words to which such means 21 initially assigned a default value, for example on the basis of the value of the words assigned to the same group. For example, the word "Bardo" might be assigned a value equal to the ratio between the sum of the data item ranking or value of the words or of part of the words of the group or cluster to which it belongs and the number of words of the group or cluster proper (for example, the ratio might be calculated by considering the words "tourists", "victims", "died", #news and #world, and the respective data item rankings or values as shown in Figure 4 as equal to: (4+5+5+5+2)/(5) = 4.2, which might be approximated to 4) allow to modulate this value on the basis of the value assigned to the correlated data. Other criteria can be used to modify the data item rankings or values associated with the data, in particular the data with which a default value is associated initially, for example, one might assign the corresponding statistical index of the group or cluster to which they belong.
In step 34, the means 24 show a word cloud, i.e., visual information in which the words that have a larger value, optionally modified as indicated in step 33, are shown in a highlighted manner with respect to the others. Figure 5 shows an example of this display, wherein for example the word "museum" appears more emphasized than the others (some of the words might appear in a truncated form or might be modified on the basis of the pre-processing operation).
It has thus been shown that the method and system described achieve the intended aim and objects. In particular, it has been shown that the system thus conceived allows to overcome the qualitative limitations of the background art, allowing to make searches easier, to facilitate the processing of the data and to reduce the efforts for selection of the most pertinent data. In this manner, the system advantageously allows the recovery of information related to various topical areas and to focus attention on the most important data, assigning to these data an appropriate weight/value. The resulting visual information allows for example an operator to concentrate the writing of an article or the creation of reports ("report") by assigning a greater weight to the most relevant words found. The system thus obtained, by way also of the choice of sources 3 located in different places, allows to study social phenomena linked to the spread of a news item, to study the ways in which propagation of an information item occurs and on the basis of these studies perform analysis related to the reactions caused by the diffusion of the news proper.
Clearly, numerous modifications are evident and can be performed promptly by the person skilled in the art without abandoning the protective scope of the present invention.
For example, it is obvious for the person skilled in the art that the system can also be used to provide statistical models that allow to anticipate fluctuations of the main stock exchange indexes, to perform risk management studies, to write science articles easily and to provide tools for aiding scientific research.
Therefore, the scope of the protection of the claims must not be limited by the illustrations or preferred embodiments shown in the description by way of example, but rather the claims must comprise all the characteristics of patentable novelty that reside in the present invention, including all the characteristics that would be treated as equivalents by the person skilled in the art.
The disclosures in Italian Patent Application no. 102015000024569
(UB2015A001469), from which this application claims priority, are incorporated herein by reference.
Where technical features mentioned in any claim are followed by reference signs, those reference signs have been included for the sole purpose of increasing the intelligibility of the claims and accordingly such reference signs do not have any limiting effect on the interpretation of each element identified by way of example by such reference signs.

Claims

1. A system for extrapolation and statistical processing of data that can be acquired from one or more data sources, comprising:
- a station ( 1 ) of a user, provided with means for generating a request to query data sources (3), said request comprising one or more search keys;
- a server (2), comprising means adapted to receive said request and to query one or more of said sources (3) on the basis of the content of said request;
said system being characterized in that said server (2) comprises furthermore:
- means (20) adapted to store data acquired from one or more of said sources (3) in response to said query;
- means (21 ) for attributing to each one of said acquired data a data item ranking or value;
- means (22) for grouping said acquired data into groups or clusters;
- means (23) for calculating, for each of said groups, a value, defined as statistical group index, said value being given, for each one of such groups, by the ratio between the sum of the values of the respective data comprised in the group and the number of said data items comprised in the group;
- means (24) for sorting and organizing said acquired data and for displaying visual information associated with said acquired data; said data being sorted on the basis of the calculated statistical indexes of said groups or on the basis of said data rankings or values or both.
2. The system according to claim 1, characterized in that said statistical group index is comprised in a range that has a minimum value and a maximum value, said data item ranking or value being comprised within said range.
3. The system according to claim 1 or 2, characterized in that said means (21) for assigning a value are adapted to assign a value or ranking equal to, or based on, a repetition frequency, said repetition frequency being calculated as the ratio between the number of repetitions of said data item within the acquired data and the total number of data acquired in response to said query.
4. The system according to claim 1 or 2 or 3, characterized in that said search keys and said acquired data are of the text type, said assignment means (21 ) being further adapted to assign a predefined value to each acquired data item that lacks a match in said search keys.
5. The system according to one or more of the preceding claims, characterized in that said assignment means (21 ) are adapted further to assign to each acquired data item that lacks a match in said search keys a value that is equal to, or based on, the statistical index of the group to which said data item belongs.
6. The system according to one or more of the preceding claims, characterized in that said station ( 1) of a user comprises an interface for entering a request, said request comprising, in addition to said one or more search keywords, one or more of the following information items: a time data item, equal to the time that can be used by the server (2) to acquire said data; one or more indication of sources (3) to be queried, one or more geographical locations in which said sources (3) to be queried are located.
7. The system according to one or more of the preceding claims, characterized in that said acquired data are analyzed by means of a na'ive Bayes classifier.
8. The system according to one or more of the preceding claims, characterized in that said means (22) for grouping said acquired data group said data into groups or clusters by means of a partitional clustering algorithm preferably of the K-mean or K-medoids type.
9. A method for extrapolation and statistical processing of data that can be acquired from one or more sources, comprising the steps of:
- generating a request to query data sources (3) that comprises one or more search keys;
- querying one or more of said data sources (3) on the basis of the content of said request;
- storing data acquired from said one or more sources (3) in response to said query;
- assigning to each one of said acquired data a value or ranking;
- grouping said acquired data into groups or clusters preferably by means of a partitional clustering algorithm of the K-mean or K-medoids type;
- calculating, for each one of said groups, a value, defined as statistical group index, said value being constituted, for each of said groups, by the ratio between the sum of the values of the respective data comprised in the group and the number of said data items comprised within the group;
- sorting and organizing said acquired data and showing visual information associated with said acquired data; said data being sorted on the basis of the calculated statistical indexes of said groups or on the basis of said data rankings or values or both.
10. A computer program adapted to be stored on a data processing medium and comprising instructions of the software type adapted to implement the steps of the method according to claim 9.
PCT/IB2016/053619 2015-06-17 2016-06-17 System for extrapolation and statistical processing of data that can be acquired from one or more data sources WO2016203446A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CH01544/17A CH712818B1 (en) 2015-06-17 2016-06-17 System and method for the extrapolation and statistical processing of data obtained from one or more data sources.

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IT102015000024569 2015-06-17
ITUB20151469 2015-06-17

Publications (1)

Publication Number Publication Date
WO2016203446A1 true WO2016203446A1 (en) 2016-12-22

Family

ID=55409935

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2016/053619 WO2016203446A1 (en) 2015-06-17 2016-06-17 System for extrapolation and statistical processing of data that can be acquired from one or more data sources

Country Status (2)

Country Link
CH (1) CH712818B1 (en)
WO (1) WO2016203446A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6795820B2 (en) * 2001-06-20 2004-09-21 Nextpage, Inc. Metasearch technique that ranks documents obtained from multiple collections
WO2015000083A1 (en) * 2013-07-05 2015-01-08 Anysolution, Inc. System and method for ranking online content

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6795820B2 (en) * 2001-06-20 2004-09-21 Nextpage, Inc. Metasearch technique that ranks documents obtained from multiple collections
WO2015000083A1 (en) * 2013-07-05 2015-01-08 Anysolution, Inc. System and method for ranking online content

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ADINA LIPAI: "World Wide Web Metasearch Clustering Algorithm", INFORMATICA ECONOMICA, 2008, VOL. XII, ISSUE 2, 25 November 2008 (2008-11-25), pages 5 - 11, XP055267225, Retrieved from the Internet <URL:http://revistaie.ase.ro/content/46/Adina%20Lipai.pdf> [retrieved on 20160420] *
ZAMIR O ET AL: "Grouper: a dynamic clustering interface to Web search results", COMPUTER NETWORKS, ELSEVIER SCIENCE PUBLISHERS B.V., AMSTERDAM, NL, vol. 31, no. 11-16, 17 May 1999 (1999-05-17), pages 1361 - 1374, XP004304560, ISSN: 1389-1286, DOI: 10.1016/S1389-1286(99)00054-7 *

Also Published As

Publication number Publication date
CH712818B1 (en) 2020-12-30

Similar Documents

Publication Publication Date Title
US9836511B2 (en) Computer-generated sentiment-based knowledge base
US8868558B2 (en) Quote-based search
CN107688616B (en) Make the unique facts of the entity appear
US10592841B2 (en) Automatic clustering by topic and prioritizing online feed items
US10235637B2 (en) Generating feature vectors from RDF graphs
CN107506472B (en) Method for classifying browsed webpages of students
US10915543B2 (en) Systems and methods for enterprise data search and analysis
Ahlgren Research on sentiment analysis: the first decade
US20160124947A1 (en) Systems and methods for enterprise data search and analysis
CN111931034A (en) Data searching method, device, equipment and storage medium
KR102413961B1 (en) Method for providing news analysis service using robotic process automation monitoring
WO2016203446A1 (en) System for extrapolation and statistical processing of data that can be acquired from one or more data sources
JP2013200795A (en) Associative searching system, associative searching server and program
Cherichi et al. Using big data values to enhance social event detection pattern
US20180349358A1 (en) Non-transitory computer-readable storage medium, information processing device, and information generation method
CN107818091B (en) Document processing method and device
WO2018103585A1 (en) Method and apparatus for sorting webpage information articles
US9646099B2 (en) Generating resources for support of online services
Al-Hamami et al. Development of an opinion blog mining system
Dzhurenko et al. Analysis of Text Mining methods in Web search.
JP2012243130A (en) Information retrieval device, method and program
Tsekouras et al. Social Web Observatory: An entity-driven, holistic information summarization platform across sources
Terazawa et al. Sentiment polarity analysis for generating search result snippets based on paragraph vector
Shannaq Adapt clustering methods for arabic documents
Hládek et al. Evaluation set for Slovak news information retrieval

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16744857

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 10201700001544

Country of ref document: CH

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16744857

Country of ref document: EP

Kind code of ref document: A1