WO2011145036A1

WO2011145036A1 - System and method for detecting network contents, computer program product therefor

Info

Publication number: WO2011145036A1
Application number: PCT/IB2011/052125
Authority: WO
Inventors: Giuseppe Provera
Original assignee: Convey S.R.L.
Priority date: 2010-05-18
Filing date: 2011-05-16
Publication date: 2011-11-24
Also published as: IT1399704B1; ITTO20100413A1

Abstract

A system for detecting the contents of web pages, for example for detecting improper contents related to counterfeiting, IPR piracy and similar illegal behaviour, comprising at least a user terminal (10) equipped with a browser to surf web pages of a network (N). Coupled to the browser is a contents detection module (TB) configured to detect the contents of pages opened via said browser with a contents detection action and real-time feed-back driven via said browser.

Description

"System and method for detecting network contents, computer program product therefor"

-k ~k ~k ~k

Technical field

The present disclosure relates to techniques for detecting network contents, for example on networks such as the Internet.

The present disclosure has been devised by paying attention to the possible use in detecting improper contents, for instance for protecting owners of industrial property rights (IPRs) from counterfeiting, piracy and similar illegal behaviour.

Technological background

Over the recent years, various solutions have been devised in order to monitor the Internet network and its contents, for example in order to protect owners of industrial property rights from counterfeiting, piracy acts, abuses, improper usage, etc. This both to respond to the increasing need of protecting the rights of use /economic exploitation and with the aim of protecting the image of a subject being cited.

These techniques are identified with various designations such as "Internet Intelligence", "Web Monitoring", "Web Mining", "Competitive Intelligence", "Social Network Analysis", "Web 2.0 Analytics", or, with more relevance to intellectual property aspects, with designations such as "Brand Monitoring", "Brand Protection", "TradeMarks Intelligence", "Copyright Protection", "Internet IP Protection & Management" and so forth.

The relative literature is quite extensive, as witnessed, by mere example, by documents such as US-B-6 401 118, , US-A-6 983 320, WO-A-0108382 , WO-A- 2007/047871.

Various detection toolbars provide for collecting primarily browsing data from the browser of a "remote" user (see, for example, http: //www. alexa . com/toolbar) , carrying out checks (e.g. antiphishing) based on pre^¬ existing databases on a central server, by providing the results to the user (see, for example, http : //toolbar . netcraft . com/ ) , or supporting information sharing between user communities (see, for example,

http : / /www . google . com/intl/it /toolbar/ie/index . html ) . In any case, these solutions are based on the assumption that an established centralized computer system exists characterized by significant web page acquisition capabilities.

Object and summary

The inventor has observed that the existing technological solutions have tackled the monitoring problem by establishing centralized computer systems, characterized by significant web page acquisition capabilities, by resorting to website contents crawling/spidering technologies and high-level contents analysis capabilities, for example via complex text analysis algorithms based on "NLP - Natural Language Processing" methodologies.

The inventor has observed that such centralized approach presents numerous drawbacks.

For instance, as regards web page acquisition capability, the available acquisition power, no matter how large, is inevitably constrained by the existing technical limitations (available internet download bandwidth or concentrated download of high volumes of pages and stocking/storage thereof) and by the limitations stemming from the browsing/acquisition logics of the network crawler (limited in fact by a number of finite/limited search strategies, which are essentially repetitive in time as they are conceived and developed by few "thinking subjects", based on their own experience, own field knowledge and/or of specific goals of the monitoring activity) .

Also, as regards content analysis ability, irrespective of the degree of sophistications of the textual analysis algorithms and "NLP - Natural Language Processing" methodologies applied, certain linguistic ambiguity situations may arise in different contexts and/or with high interaction or mix between textual contents and graphic contents, which frequently elude any correct analysis and correct appreciation of the level of "danger" in the page as regards protecting IPRs from the point of view of the owner (of text, images, brand, multimedia contents, etc.)

Moreover, the functional architecture and its characteristic impact on the users is always related to a very high cost of technological systems, whose management is (almost) never entrusted by those subjects that have a true knowledge of the intangible good (for example, IP asset) subject to monitoring/protection. A typical productive process is almost always out-sourced to expert computer engineers who, for the final report, interface with the Legal/IP departments of the owner of the intellectual property rights (IPRs) . The owner's organization hardly develops a deep knowledge/conscience of the nature of the process and cannot provide the subjects who manage the intelligence activity with any experience and feedback. In other words, the daily, intense and differentiated use of the Internet which characterizes the actual work of any person or organization does not find any appreciable improvement in the perspective of protecting IPRs and/or the image in the network.

The object of various embodiments is to provide a solution for detecting contents in a network capable of overcoming the drawbacks outlined above.

In various embodiments, that object is achieved thanks to a system having the characteristics specifically recited in the claims that follow. The invention also relates to a corresponding method, as well as a computer program product, loadable in the memory of at least one computer and including software code portions for implementing the steps of the method when the product is run on at least one computer. As used herein, reference to such a computer program product is intended to be equivalent to reference to a computer readable medium containing instructions for controlling a processing system to coordinate the implementation of the method according the invention. Reference to "at least one computer" is evidently intended to highlight the possibility that the present invention may be implemented in a modular and/or distributed form.

The claims form an integral part of the technical disclosure of the invention as provided herein.

Various embodiments comprise at least one user terminal equipped with a browser to surf web pages in a network such as the Internet; the browser is coupled with a contents detection module configured for detecting the contents of the pages opened from time to time by the user via the browser, thus performing a contents detection action driven via the browser.

In various embodiments, the above-mentioned contents detection module is configured for supplying the user terminal with feed-back information on the contents of said pages.

In various embodiments, the above-mentioned contents detection module may be configured to interact with a centralized server subsystem in order to send to such a centralized server subsystem contents information on the above-mentioned pages with a view to possibly storing such data in a repository in the centralized server subsystem.

Various embodiments are able to use, for example in relation to IPRs and related areas, the successful and widespread model of antivirus systems, which are normally established in an organized context, with a server-side application of high technical profile and client applications distributed over all the workstations included in the Intranet/Extranet to complete an active protection framework, with high "center/remote" synergies.

Various embodiments introduce in a IPR protection model the concept of a remote application possibly adapted to interact with a central application and which may not just be activated within the framework of an organization (professional, company, body) , but is also applicable to subjects grouped at different levels (e.g. inter-company, consortium, category, territory, community, etc) .

Various embodiments are based on an innovative application, capable of providing, in real time, the capability of executing a specialized analysis of a situation of interest (e.g. use/abuse of some intangible IPRs/asset in a web page) in the very moment where a person finds himself or herself surfing a page when working daily on a workstation: such person is made immediately capable of understanding/construing the signals (at different levels as appropriate) that the application produces in real time, leaving then to the person and to his/her knowledge/understanding of the "IP asset" involved the decision as to whether the signal/output received should be subjected to further and more in-depth centralized analysis, as mentioned above . Various embodiments exploit the fact that the technical/computer science intelligence for the analysis of critical situations present in the contents of web pages (e.g. in terms of integrity of IPRs, or associations which may dangerous for the image of an individual/company) as provided by a remote station presents various advantages (and also complementarity) with respect to a centralized solution, in particular:

- it is included in the browser, that is in the main instrument for Internet surfing, namely a software feature present in any computer capable of connecting to the network;

- it acts in real time, providing feed-back information immediately;

- it is activated by several subjects

(theoretically all those subjects capable of using the network) , without distinctions in terms of role/activity and without time constraints and/or predefined modes that might affect the behavioural usage;

it is oriented and characterized by network exploration logics as numerous as the subjects that use it, the activities that these carry out, the problems that, from time to time, these must solve on the Internet when working every day in their respective tasks and responsibilities (technical, commercial, financial, organizational, relational, etc.);

- it may benefit from immediate human supervision, as it is made available to subjects that have intelligence, creativity, memory and relationship ability at a very high degree in comparison with the management capabilities of software algorithms of centralized systems, which are predefined and limited;

- it may be integrated with centralized technical and methodological components, which are capable of cross-checking and verifying situations emerging from the action of plural persons at the periphery, who may have detected similar/identical results following even quite different paths/logics, thus making it possible to investigate to a deeper extent (not in real time) particularly complex analysis aspects thanks to the possible presence of more powerful computing capabilities ;

is has a very small unitary cost, hardly comparable to the very high cost of centralized intelligence systems, thus making extended usage possible ;

- it facilitates spreading an effective culture of "IP asset" and protection of a company/product in any organized context and adopting a widespread monitoring/protection practice of intangible assets (e.g. IPRs and/or Brand reputation) also in small enterprises .

Various embodiments support the user while currently surfing the Internet, for whatever purpose, by supplying real-time information on the presence in each page visited of one or more elements of interest precisely selected in an initial configuration stage, irrespective of whether these are of a textual type, or distinctive signs, or images.

Various embodiments may be integrated in any Internet surfing browser (e.g. Internet Explorer) and may be implemented, for instance, as a toolbar capable of calling adequate function libraries to execute differentiated and complementary operations on text (and/or images) present on the web page viewed by the user .

Various embodiments lend themselves to an implementation represented by the search/analysis within a HTML text comprising a page of a brand, or a logo, in order to determine specific use/abuse situations, not only in terms of IPRs, but also, for example, for the purpose of detecting aspects/elements connected to brand image and its reputation (the so called "word-of-mouth" in Social Media/Social Networks ) .

In various embodiments, analysis is performed in real-time while a web page is being surfed and the user is notified that it has been completed by an acoustic and/or visual signal.

In various embodiments, the analysis results may be reviewed immediately and, if deemed interesting, the user may authorize that these are sent to a central server for data collection.

In various embodiments, the data present in the central server may be possibly subject to further processing and/or integrated with further information acquired from the Internet (e.g. server locations hosting the page) .

In various embodiments, the central server data may be made available to the user that originated them and, possibly, also to other authorized subjects (e.g. Consortiums, Associations, public Authorities, etc.) in the form of statistical reports and/or summary charts, thanks to different views being processed.

In various embodiments, a toolbar includes a user authentication mechanism that prevents inappropriate o unauthorized use of the application (e.g. the monitoring/analysis of a third party brand; extension of use of the application beyond the end of a trial period or the expiry of a service contract/subscription, etc.).

In various embodiments, plural users within a same organization (enterprise, consortium, association, etc.) may use the toolbar by sharing the same basic configuration (e.g. for the "Fashion" field) but with different specific targets (e.g. different brands) .

In various embodiments, the configuration configuration, is, indeed, server-side, or rather is held within a special database and is communicated to the toolbar solely after (positive) access credential verification .

In various embodiments, the toolbar configuration may be modified at a single central point connected to the Internet, by making it immediately available for a subsequent start-up.

In various embodiments, the toolbar is equipped with a control on the most recent release available and is capable of self-updating the analysis components without the need for complete re-installation.

In various embodiments, the user may decide what analysis options are to be enabled/disabled within the framework of service menus offered by the toolbar.

In various embodiments, the toolbar may also include a configuration including customized stoplists, i.e. lists of term that, if present in a web page, must not be taken into account during the analysis and also black/white lists containing URLs that must/must not be taken into account in the analysis.

Brief description of the attached figures

The invention will be now described, purely by non-limiting example, with reference to the attached figures, wherein:

figure 1 represents a general system architecture

- figure 2 is a more detailed block diagram of an embodiment, and

figures 3, 4 e 5 are flow diagrams representative of operation of embodiments.

Detailed description of embodiments In the following description various specific details finalised to an in-depth understanding of the embodiments are illustrated. The embodiments may be realized without one or more of the specific details, or with other methods, components, materials, etc. In other cases, structures, materials or operations known are not shown or described in detail to avoid to make unclear the various aspects of the embodiments.

The reference to "an embodiment" within the scope of this description denotes that a particular configuration, structure o feature described in connection with the embodiment is included in at least an embodiment. Therefore, phrases such as "in an embodiment", possibly present in different places of this description, do not necessarily refer to the same embodiment. In addition, particular shapes, structures o features may be combined in a suitable manner in one or more embodiments.

The references herein used are only for convenience and to not define therefore the scope of protection or range of the embodiments.

Figure 1 is a block diagram of a system as described herein, whose components may be divided from the logical viewpoint into client components CC and server components CS, that will be assumed to be connected over a network N such as the Internet.

In the exemplary embodiments considered herein, the client components CC include in general Personal Computers or PCs 10 with a browser equipped with a toolbar TB as better described in the following.

As schematically illustrated, the PCs 10 may be both individual PCs with a "stand-alone" configuration (like the single PC depicted below in the left portion of figure 1), and PCs included in a corporate network (like the group of PCs connected to a firewall F depicted on top in the left portion of figure 1) .

These are PCs that generally represent the user terminal or end user, operating at a location remote with respect to the centralized authentication/processing system CS.

As already indicated, the browser in the PCs considered herein will be assumed to be equipped with a toolbar TB properly installed and configured so as to operate according to modes better described in the following.

As is well known, by toolbar a component (widget) is meant which is used in many user interfaces. It is typically a box or horizontal or vertical bar, where icons representing links to various system functions are present.

In various embodiments, at least some of the user terminals 10 may be include, instead of PCs, computer devices such as PDAs, evoluted portable terminals (such as iPhone®, iPad® terminals etc..) capable of supporting a browser equipped with the Toolbar TB.

In the exemplary embodiments considered herein, the components on the server side may be organized in a server network coming done to the firewall FW.

In a possible configuration, the configuration on the server side may include a data collect server 20 that is entrusted with receiving incoming data from the applications (toolbar) active remotely (that is, on the client side) . The function of the server 20 is to verify the syntactic accuracy of the input data and to provide for sending them to a database server 22 for subsequent storage within a memory or main repository 24.

In various embodiments, this component may be implemented as a web service or web application that presents outside a series of public methods (or interfaces) to be recalled by the toolbar TB (and possibly by other client applications) .

In various embodiments, the data collect server 22 may likewise communicate with an authentication server 26 in order to avoid undesired access to the remote system. The authentication server 26 verifies the user identity associated with the request originating from a remote client, with the double purpose of supplying credential validation at toolbar start-up and toolbar configuration services.

In various embodiments, a RDBMS system may be installed on the database server 22 to manages the main database 24 containing the analysis results originating from the remote systems. In various embodiments, the database 24 may be comprised of appropriate tables, stored procedures, views, triggers, etc...

In various embodiments, a processing server 28 may then be present, consisting in one or more servers that process in the background data saved in the main repository 24.

In various embodiments, one or more servers 30 may then be present with the role of report servers, in order to take care of supplying to a caller the views of interest on data present in the repository 24. A possible implementation includes a web application that submits a graphical report to the end-user.

As schematized in figure 2, in various embodiments, the toolbar TB (below indicated by the specific reference 100) may be integrated in the browser 102 of the terminals 10 as a plug-in (or addon) . The integration mechanism may be different as a function of the browser typology adopted, since each browser supplies different APIs (Application Programming Interfaces) and interaction instruments.

To carry out its task, the toolbar 100 communicates with analysis libraries 104 installed on the user terminal 10. In various embodiments, in order to mask the implementation of the analysis libraries, the toolbar 100 may be integrated in a specific interface 106 that represents a common entry-point. In such a way, by keeping the interface 106 unaltered, it is possible to replace, as required, the underlying libraries 104, for example when these have been developed following a specific protocol.

In various embodiments, the analysis libraries 104 may be used each for a different purpose, and the values resulting from their respective processing may be used in real time for a composite and weighted calculation of the final result returned to the toolbar.

In various embodiments, the libraries 104 may be identified based on the type of analysis they perform, that is:

statistical analysis: this analyses the HTML page code yielding as a result a set of quantitative values related to the presence of a content searched (for instance, a brand) within the HTML code in specific technical positions (for example, in Tags);

- classification: this analyses the actual page content, by returning one or more respective classifications (e.g. the predominant language; the contents typology traced back to specific categories, differentiated as a function of the application sector, etc.) which may be used to properly "weigh" the indicators/parameters/values detected by other libraries ;

- semantic analysis: this locates and analyses in the page specific textual combinations wherein a text searched and significant and/or specific elements for each application field (e.g., searches the brand together with textual elements that configure improper uses, abusive and/or parasitical of the same) are simultaneously present. This component provides for different configurations of the "semantic analysis" process in relation to the application field/sector (e.g. fashion, luxury, agri-food, pharmaceutical, software, services, etc.);

- image analysis: this locates and analyses the images present in the page and determines the degree of similarity with respect to one or more "sample" images of interest for the user, loaded in the application (or residing in the remote server) at the time of its initial configuration and/or at subsequent times, on request of the user;

- hidden text analysis: this analyses the HTML text of the page searching for parts/elements of interest, not visible to the end user within the browser .

All the above cited analysis typologies (with the possible exception of image analysis), may be focused, during configuration, to search and analyze also "distorted textual dictions" and dictions similar of the content of interest (e.g. a brand), present in the page under inspection, generated by resemblance of "sounding" (for example the so called "Italian sounding" for a fashion brand, etc.) and/or due to typing errors (so-called "mispelling" or

"typosquatting" ) .

Figure 3 schematically represents an example of a user authentication stage within the framework of the system previously described.

Specifically, the flow diagram of figure 3 assumes that in step 1000 the user will open the browser, performing the authentication on the toolbar 100 by entering access username and password. The toolbar 100 requests a verification of the credentials from the remote server (authentication server 26) .

In the affirmative (positive outcome of step 1004), in a step 1006 the authentication server 26 returns the configuration to the toolbar 100. At this point, the toolbar 100 is active and may start listening to what is occurring in the browser 102 and to provide analysis of the pages opened by the user from time to time.

In case of the negative outcome of step 1004, the authentication server 22 will return a value representative of verification failure, whereby the user will have to authenticate again, re-starting from step 1000.

Figure 4 schematically represents an example of page within the framework of the system previously described .

Step 2000 is generally representative of the toolbar waiting for a new page to analyze, in particular within the framework of a monitoring phase 2002. Such phases involves the periodic execution of a step 2004 to check if the user has opened a new page on the browser. In the case of a negative outcome of step 2004, the toolbar simply returns to the phase 2002, to repeat then, after a standby interval, the step 2004.

In the presence of a positive outcome of the step 2004, which will indicate that the user has opened a new page, in a step 2006 the toolbar calls the analysis libraries and requests real time analysis of the page, based on the configuration current at that moment.

At the end of the analysis of each page, concise results are returned to the toolbar 100. These may be represented by a numeric indicator and/or by its graphical representation (e.g. a colour scale) that, in real time, may highlight, for example in the case of a brand, a determined danger/risk index for the abuse of the same or for the presence of concepts/terminology negative/immoral for the image of the cited subject.

If - in view of the positive outcome of a check performed in a step 2008 - results are available (absolute and/or beyond a threshold value of the numeric/graphical indicator mentioned previously) , in a step 2010 the toolbar advises the user via a visual and/or acoustic alert message (alert) .

In the presence of a negative outcome of the check performed in the step 2008, the system returns to the monitoring phase 2002.

If in the step 2008 the existence of results is detected, the user may immediately obtain an "x-ray image" of the page visited, with the specific indication of the elements of interest positively detected and save such file in a repository prearranged by the toolbar 100 in the workstation.

The toolbar has notified in real time the user that the analysis results related to the page viewed are ready, with values that exceed the predefined threshold .

For example, in the presence of a page with evidence of particular seriousness and as schematized in figure 5, starting from a step 3000, in a step 3002 the user has the faculty of authorizing immediate transmission of the data to the remote server (data collect server 20) . The toolbar 100 may provide appropriate configurations to facilitate the user the possible addition of concise information of particular interest for subsequent analysis in the central repository 2 .

Sending of data to the server in question discussion is represented in figure 5 by the step 3004, while the negative outcome of the step 3002 brings back to step 3000.

If the verification of data consistency on the server side (step 3006) yields a positive outcome (with a positive outcome of a verification step 3008), in a step 3010 the data are stored in the central repository (database) 24. Otherwise (negative outcome of the step 3008), in a step 3012 the system notifies the toolbar 100 of a data inconsistency error.

Of course, without prejudice to the principle of the invention, the implementation details and the embodiments may vary, even in a significant manner, as here set forth by mere non-limitative example, without exiting from the scope of the invention as defined by the annexed claims.

Claims

1. A system for detecting the contents of web pages, the system including at least one user terminal (10) equipped with a browser to surf web pages in a net (N) , wherein to said browser a contents detection module (TB) is coupled configured to detect (2006) the contents of pages opened (2004) via said browser with a contents detection action driven via said browser.

2. The system of claim 1, wherein said contents detection module (TB) is configured to provide (2010) to the user of said at least one user terminal (10) feedback information on the contents of said pages.

3. The system of claim 1 or claim 2, wherein said contents detection module (TB) is configured to interact with a centralized server subsystem (CS) to send (3004) to said centralized server subsystem (CS) information on the contents of said pages in view of possible storing (3010) of said information in said centralized server subsystem (CS) .

4. The system of any of previous claims, wherein said contents detection module (TB) is organized in analysis libraries (104) having functions selected out of:

- statistical analysis, to provide quantitative values indicative of the presence of a given content;

- classification, to provide indicators indicative of said contents belonging to specific classification categories ;

- semantic analysis, to detect the presence of a given text and meaningful and/or specific elements for each application sector;

- image analysis, to detect the presence of images identical or similar to sample images;

- hidden text analysis, to analyze the HTML text of a page by searching parts/elements not visible in the browser.

5. A method of detecting the contents of web pages via at least one user terminal (10) equipped with a browser to surf web pages in a net (N) , the method including coupling to said browser a contents detection module (TB) configured to detect (2006) the contents of pages opened (2004) via said browser with a contents detection action driven via said browser.

6. The method of claim 5, including providing

(2010) to the user of said at least one user terminal (10), via said contents detection module (TB) , feedback information on the contents of said pages.

7. The method of claim 5 or claim 6, including sending to a centralized server subsystem (CS) from said at least one user terminal (10) information on the contents of said pages in view of possible storing (3010) of said information in said centralized server subsystem (CS) .

8. The method of any of the preceding claims 5 to

7, wherein said detecting includes functions selected out of :

statistical analysis, to provide quantitative values indicative of the presence of a given content;

9. A computer program product, loadable in the memory of at least one computer and including software code portions to perform the method of any of claims 5 to 8 when the product is run on at least one computer.