WO2016095487A1

WO2016095487A1 - Human-computer interaction-based method for parsing high-level semantics of image

Info

Publication number: WO2016095487A1
Application number: PCT/CN2015/082908
Authority: WO
Inventors: 林格; 罗甜; 罗笑南
Original assignee: 中山大学
Priority date: 2014-12-17
Filing date: 2015-06-30
Publication date: 2016-06-23
Also published as: CN104484666A

Abstract

A human-computer interaction-based method for parsing high-level semantics of an image, comprising: scanning a source image on the basis of a portable scanning device; identifying a target in the source image; filtering and parsing a content in the source image, and extracting effective knowledge; and organizing semantics and passing the image content to a user in a voice form. For a visual disorder group and a group with the poor self-learning ability, the weak groups can be helped to experience a different world merely by means of simple scanning work, without using a visual system to describe an image by means of a computer, and this process can also be regarded as a part of their entertainment life. The method is simple in operation and good in portability.

Description

A Method of Image Advanced Semantic Analysis Based on Human-Computer Interaction

Technical field

The invention relates to the technical field of human-computer interaction, in particular to a method for high-level semantic analysis of images based on human-computer interaction.

Background technique

With the popularity of the Internet, storage technology, multimedia technology and database technology have developed rapidly, and people's demands on image applications are growing. The physics community believes that the three kinds of information unique to human beings are languages, symbols and images. The dissemination of information depends largely on vision. At least 80% of external information is obtained through visual perception. Visual is the most important thing for humans and animals. a feeling of. The semantic information contained in an image is quite rich, but not all groups have normal visual functions or good understanding, so how to automatically parse images by computer is a meaningful and challenging task. In the end, accurate semantic analysis and expression implementation are required to automatically mark the image by means of a computer.

The research of image semantics mainly focuses on the classification and retrieval based on the semantics of each layer of the image, the extraction of low-level semantic features, and the description of semantics of middle-level objects. After entering the 1990s, Content-Based Image Retrieval (CBIR) became a research hotspot and became a key technology in major research projects such as multimedia databases and digital libraries. CBIR solves the limitations of text-based image retrieval to a certain extent. It matches images by calculating the similarity between image visual features (such as color, texture, shape, etc.) and uses visual query instead of text-based query. search image. A leap in the retrieval of image visual content features such as color, texture, shape and region, and the retrieval mode of "seeking a picture" is realized. Content-based image retrieval combines domain knowledge, pattern recognition information technology and other domain knowledge, and is a synthesis of various high-tech. Some researchers have focused on the extraction and representation of visual features at the bottom of the image, and have achieved certain results. However, in practical applications, the retrieval results of traditional CBIR systems are often unsatisfactory and cannot satisfy the needs of people to retrieve images according to semantics. This is mainly because users often only have objects, events and descriptions about image descriptions for the desired images. Expressing feelings Some high-level concepts (such as vacations, cities, portraits, etc.) in the sense of meaning, the user needs the query of image semantics, not the underlying visual features of the image. The meaning of the image mentioned here is the high-level semantic feature of the image, which contains people's understanding of the image content. This understanding is judged according to the cognitive knowledge of the person and cannot be directly obtained from the underlying features of the image. This creates the "semantic gap" problem in content-based image retrieval systems, that is, the huge difference between the understanding of image content and the visual features of images automatically extracted by computers. In the 21st century, image retrieval is carried out around the hotspot of Image Semantic. Its purpose is to make the ability of computer to retrieve images to reach the level of human understanding, to achieve a natural and concise query method that is closer to the user's understanding ability, and to improve The accuracy of image retrieval. Semantic-Based Image Retrieval (SBIR) is based on the semantic features of images, and studies how to map the underlying visual features of images to the high-level semantics of images and how to describe these high-level semantics. With the introduction and gradual improvement of the "Multimedia Content Description Interface" MPEG-7 standard in September 2001, digital images will have uniform visual feature description parameters and description definition languages that express complex semantic relationships, which will facilitate semantic-based images. The search technology has made breakthroughs and has become practical and general. Image semantic automatic annotation is a key link of semantic-based image retrieval, and has become a research hotspot in image retrieval. The automatic annotation of image semantics is to add keywords to the image to represent the semantic content of the image. It can transform the visual features of the image into the annotation information of the image, inherit the high efficiency of keyword retrieval, and overcome the shortcomings of manual labeling. . The steps of the algorithm generally have two aspects: firstly, statistically learning the set of all the underlying features of the image marked with the same semantics, and obtaining the training model of the semantic class; secondly, for the image to be labeled, the underlying features of the image are also extracted. According to the trained training model of the semantic class, the probability of belonging to the semantics of the image is obtained, and thus the probability of occurrence of all semantic concepts or text keywords in the image to be labeled can be obtained. The semantic probabilities of the images are arranged in order, and several keywords with the highest probability are selected as the semantic labels of the images. As a hotspot in the field of image retrieval, automatic annotation of image semantics has broad application prospects, including medical image classification, digital library establishment and management, digital photo retrieval and management, video retrieval, and satellite remote sensing image processing.

In the image semantic description, the image content description has a hierarchical inclusion relationship of "pixel-area-target-scene", and the essence of semantic description is the process of vocabulary encoding (Encoding) and annotation (Annotation) using reasonable word formation. This process is closely related to the description of each layer of the image content. Image pixel and region information are driven by the middle and low layer data. Labeling (pixels) according to the similarity characteristics of the structure data can be used for high-level semantic coding. Provide a valid low-level entity correspondence. The middle-level "Categorization" feature of the target and scene also has obvious coding characteristics, and each category can be regarded as a simple semantic description, providing a better prototype description for the expansion of multi-semantic analysis.

We describe the different properties of an image, such as these underlying features, colors, textures, edges or shapes, etc., which have become an important topic in the field of computer vision. Identifying this information in an image may be provided in most practical applications. Useful information. However, this is definitely not the level of human interaction with this visual world, nor is it a description of the visually impaired group. What we need to do is not only to identify many individual goals in a scene, but also to identify different environments and perceive complex activities and social relationships. This is a high-level semantic recognition of image understanding, and Figure 1 is a schematic diagram of the image understanding process.

Human-computer interaction (HCI) is a study of the interaction between systems and users. The platform for communication between people and computer systems is the interface for man-machine dialogue. People-centered, natural and efficient interaction is the main goal of developing a new generation of human-computer interaction technology. The development of human-computer interaction technology has gone through three stages. Among them, the third-generation human-computer interaction interface-multi-modal user interface, based on the multimedia interface, adopts new technologies such as speech recognition, line-of-sight tracking and gesture input. Users can interact in a natural, parallel and collaborative manner in multiple forms or channels. The system quickly captures the user's intention by integrating multi-channel accurate and inexact information, effectively improving the naturalness and efficiency of human-computer interaction.

According to the development process of image annotation methods, the methods used in the literature to solve the "semantic gap" can be roughly divided into three categories according to their focus: machine learning-based methods; related feedback-based methods; ontology-based methods.

(1) Machine learning based method

At present, automatic semantic annotation of images using machine learning and statistical model learning can be divided into two categories: supervised semantic annotation and unsupervised semantic annotation. The supervised classification method first obtains an image semantic classifier by learning and training a set of semantically annotated set of sample images, and then using a classifier to merge unlabeled or uncategorized images into a semantic class. The most commonly used supervised learning techniques are Bayesian classifiers and Support Vector Machine (SVM) techniques. Unsupervised semantic annotation clusters images (or image regions) in the library into meaningful collections based on image content, such that the similarity of images within the same cluster is as large as possible, while the images located in different clusters are similar. The degree is as small as possible. Then, a statistical method is used to add a class label to each cluster to obtain semantic information in each image cluster. Simply put, its goal is to organize and cluster the input data reasonably and effectively. This method requires less training set for manual labeling, and the training data and semantic concepts are scalable. Strictly speaking, pure image clustering can not obtain explicit semantic labels for a new image. It needs to be combined with other technologies to perform automatic semantic annotation of images, give full play to its efficiency, and achieve high retrieval accuracy. .

(2) Method based on relevant feedback

The basic idea of Relevant Feedback (RF) is that during the retrieval process, the user adjusts the existing query requirements by weight according to the previous retrieval results to provide more and more direct information to the retrieval system, so that the system can better satisfy the system. User's request. Simply put, the feedback process is an interaction process between the user and the retrieval system. The system adjusts the initial query of the user and the parameters of the matching model according to the user's evaluation of the current retrieval result, thereby optimizing the retrieval result. Relevant feedback is essentially a learning process. Its method has similar ideas to human learning methods. It is a valuable method for studying semantic mapping. It can obtain better retrieval effects at both visual feature level and semantic level. . It has the characteristics of small number of samples and strong real-time requirements, but it may cause problems such as long retrieval time and oscillation of results.

(3) Method based on object ontology

Ontology has a wide range of applications in text information retrieval, but it started late in the field of image retrieval. Ontology refers to a well-known conceptual representation of objects in the field (actual and logical) and their relationships. It points to different pairs in the image The image can be defined by a collection of simple descriptors, such as "sky" defined as the "upper, uniform, blue" area. Object semantics can be obtained by discretizing and mapping the underlying features such as color, position, size, and shape to these simple semantics. For a single type of image library, the ontology-based method can get better results. For large image databases, this method does not work well. The figure below shows a schematic diagram of the current automatic implementation of the annotation by computer, as shown in Figure 2.

At present, in the field of computer vision, most researchers focus their research on target recognition and target classification. There are also many models for the classification of scene environments, but there is very little research on the recognition of events in a static image. . Moreover, most of the images are retrieved based on the content, and the images are marked as a single one, and the work is combined without coherence. It is of great research value to describe an image in a computer and to use the language to organize feedback to the user.

Summary of the invention

The object of the present invention is to overcome the deficiencies of the prior art. The method for high-level semantic analysis of images based on human-computer interaction proposed by the present invention can help such a disadvantaged group to experience another different world and also be a part of entertainment life.

In order to solve the above problems, the present invention provides a method for high-level semantic analysis of images based on human-computer interaction, including:

Scanning the source image based on the portable scanning device;

Identifying targets in the source image;

Filter and parse the content in the source image and extract effective knowledge;

Organizational semantics conveys image content to users in a voice form.

The scanning of the source image based on the portable scanning device includes:

ARM-based portable scanning device scans the source image.

The identifying the target in the source image includes:

The feature extraction of images is performed by SIFT local feature extraction, and combined with HOG features and GIST global features, image information can be acquired more comprehensively.

The filtering and parsing the content in the source image, and extracting a valid knowledge package include:

Take the word bag model image classification method to extract effective knowledge.

The word bag model image classification method includes:

Detecting feature points by image segmentation or random sampling;

Extracting local features from the image and generating descriptors;

The descriptors for these feature points utilize a clustering method in which each cluster center is a visual word;

The frequency at which each visual word appears is counted as a visual word histogram.

The organizational semantics conveying the image content to the user in a voice form includes:

The image content is delivered to the user in a voice form using a latent semantic extraction technique.

By implementing the embodiments of the present invention, the present invention is mainly directed to a group with visual impairment and a group with weak self-learning ability. It is only necessary to use a simple scanning work to describe an image through a computer without using a visual system, and can help such a vulnerable group to experience another different one. The world can also be part of the entertainment life. The operation is simple and the portability is good.

DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any creative work.

1 is a flow chart of an image processing process in the prior art;

2 is a diagram showing an example of automatic image labeling in the prior art;

3 is a flowchart of a method for high-level semantic analysis of images based on human-computer interaction in an embodiment of the present invention;

4 is a schematic diagram showing the structure of a drawing device in an embodiment of the present invention.

detailed description

The technical party in the embodiment of the present invention will be described below with reference to the accompanying drawings in the embodiments of the present invention. The present invention is clearly and completely described, and it is obvious that the described embodiments are only a part of the embodiments of the invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

The present invention is directed to any image (color image or black and white image) by means of a hand-held portable scanning device for overall scanning, so that source image information is entered into the system, the system recognizes the target in the image, and filters the content thereof. Analyze, extract effective knowledge, and organize semantics to deliver image content to users in a voice form. For example: an image of a water boating, through the system to identify a person, a boat, a lake, a fishing rod, sky, trees and other targets, the system for target analysis and image semantic organization, and finally will use voice equipment to output information : People are fishing on the lake. The main purpose of the system invention is to help visually impaired patients (amblyopia, blind people, etc.) or illiterate elderly people and preschool children to effectively identify image content without human assistance, so that the group can understand the outside world that cannot be contacted. This advanced semantic analysis system based on human-computer interaction has good compatibility and portability, and is easy to operate. The working flow chart of the system is shown in Figure 3.

(1) ARM-based portable scanning device (hardware)

The hardware layer is mainly composed of the core part of the system, the scanning part and the human-machine interface part. In addition, in order to expand its functions and adapt to a variety of applications, some extension interfaces have been reserved. The microprocessor selects the current Samsung S3C2410X chip. The chip core is ARM9TDMI core with 16KB data cache and 16KB instruction cache. The working frequency is 203MHz. The memory uses 64MB of NAND Flash and 64MB of SDRAM. The scanning part uses the SDIO handheld scanning card. Based on micro-linear CMOS image technology, this SDIO ISC scan card scans all mainstream linear barcodes. The human-machine interface part uses Samsung's LT V350QV-F05 type 3.5-inch TFT touch screen with a touchpad, which can simultaneously display and keyboard functions, which is beneficial to reduce the size of the device. The Ethernet port is used for data transmission and download. U SB, RS232 and other interfaces are reserved to facilitate the expansion of the function of the device.

(2) Feature extraction technology

Since SIFT features are invariant to illumination, scale, etc., feature extraction of images is performed using SIFT local feature extraction, combined with HOG features and GIST global features. Get more comprehensive image information.

(3) Description of BOW model

With the wide application of local features in the field of computer vision, image classification and recognition methods based on local features have also received more extensive attention. Since the local features are extracted, the number of feature points detected by each image is not uniform, which makes it impossible to start training in machine training, and these methods are based on feature points for matching, and the disadvantages of large calculation amount are prominent and cannot be satisfied. Increased image database requirements. In order to overcome these problems, scholars such as Ll-feifei of Stanford University in the United States first applied the word bag model as a feature representation to the field of computer image processing. The word bag model image classification method can not only solve the problem that the local features of the image are not uniform, but also the representation method is simple, the training classification is fast, and it has been greatly developed. Inspired by the text retrieval method, the word bag model has attracted more and more attention from scholars at home and abroad due to its high performance. The word bag model has been widely used in image classification and retrieval:

The main steps in the generation of the word bag model are:

1 Detection of feature points by image segmentation or random sampling.

2 Extract local features (SIFT) from the image and generate descriptors.

3 Descriptors on these feature points are clustered (usually using K-means clustering) to form a visual dictionary (Visual Vocabulary), where each cluster center is a visual word.

4 Count the frequency of occurrence of each visual word into a visual word histogram.

(4) Latent Semantic Extraction Technology

Many applications of natural language processing (NLP) need to explore the meaning behind the words and words. The simple literal matching is very difficult. The key lies in the synonym and the polysemy of the word. The latent semantic analysis (LSA) provides this. Part of the problem-solving method is to use the singular value decomposition (SVD) to map the high-dimensional vocabulary-document co-occurrence matrix to the low-dimensional potential semantic space, so that the surface unrelated words reflect deep connections. Probabilistic Latent Semantic Analysis (PLSA), as a variant of Latent Semantic Analysis (LSA), has a more solid mathematical foundation and an easy-to-use data generation model, and has been proven to provide better vocabulary matching for information extraction. Given a collection of documents D={d1,d2,...,dM} and a collection of words W = {w1, w2, ..., wN} and a co-occurrence frequency matrix N ≡ (nij) of a document and a word, n (di, wj) represents the frequency at which the word wj appears in the document dj. Use Z={z1,z2,...,zK} to represent a set of latent semantics, and K is a manually specified constant. Probabilistic latent semantic analysis assumes that the “document-word” pairs are conditionally independent, and the distribution of latent semantics on documents or words is conditionally independent. Under the assumptions above, the following formula can be used to represent “documents—words”. Conditional probability:

In equation (1)

The probability of distribution of potential semantics on words is also interpreted as the contribution of words to latent semantics.

Represents the probability of a potential semantic distribution in a document, also as the probability of having the corresponding underlying semantics in the document. Probabilistic Latent Semantic Analysis According to the principle of maximum likelihood estimation, the parameters of the PLSA are calculated by taking the maximum value of the log-likelihood function as follows:

In models with implicit variables, the standard process for maximum likelihood estimation is the Expectation Maximum (EM) algorithm, which replaces two steps until convergence.

In step E, the current estimated parameter values are used to calculate the posterior probability of the implicit variable.

In step M, the expected value of the previous step is used to maximize the current parameter estimate.

Compared with the SVD decomposition in latent semantic analysis, the EM algorithm has a linear convergence speed and is simple to implement, which can make the likelihood function reach local optimum.

After constructing the image area BOW description, we can use PLSA to Discover the potential semantics of the region. We treat each region in the image as a separate document, represented by d, and the visual vocabulary is treated as a vocabulary in the document, represented by w, and the potential semantics of the region of the image is z. It is indicated that n(di, wj) represents the frequency at which the visual vocabulary wj appears in the region dj.

The regional potential semantic extraction based on the PLSA method can be divided into two steps:

Learning stage: Applying PLSA to all sets of image regions generated by the training image, and training it by EM algorithm to iterate formula (3)(4)(5)(6) until convergence

Here

In fact, it is a regional latent semantic model that describes the distribution of visual vocabulary when latent semantics appear in the image region.

Inference phase: for all areas of the test image, keep

Invariant, the EM algorithm is also used to iterate the formula (3)(5)(6) until convergence, so that each block region is obtained.

Indicates the probability that a tiled region has a latent semantic z.

Suppose we define the number of potential semantics of the region as T, and the number of regions obtained by the L-space pyramid is N=(4L-1)/3. For each block region di, we can get a T-dimensional feature vector

Considering that the spatial distribution of the potential semantics of the region also contributes to the classification of image scenes, we finally connect the T-dimensional feature vectors of all the block regions of the image into a vector.

This is the latent semantic feature of the image region we define. After obtaining the latent semantic features of the image region, we can construct the SVM classifier model to classify the image.

The invention is mainly directed to a group with visual impairment and a group with weak self-learning ability. It is only necessary to use a simple scanning work to describe images through a computer without using a visual system, and can help such a disadvantaged group to experience another different world and also be used as entertainment. Part of life. The operation is simple and the portability is good.

A person skilled in the art may understand that all or part of the various steps of the foregoing embodiments may be performed by a program to instruct related hardware. The program may be stored in a computer readable storage medium, and the storage medium may include: Read Only Memory (ROM), Random Access Memory (RAM), disk or optical disk.

In addition, the method for performing high-level semantic analysis of images based on human-computer interaction provided by the embodiments of the present invention is described in detail above. The principles and implementation manners of the present invention are described in the specific examples, and the description of the above embodiments is only The method for understanding the present invention and its core idea; at the same time, for those of ordinary skill in the art, according to the idea of the present invention, there will be changes in specific embodiments and application scopes. The description should not be construed as limiting the invention.

Claims

A method for image semantic analysis based on human-computer interaction, characterized in that it comprises:

Scanning the source image based on the portable scanning device;

Identifying targets in the source image;

Filter and parse the content in the source image and extract effective knowledge;

Organizational semantics conveys image content to users in a voice form.
The method for performing high-level semantic analysis of image based on human-computer interaction according to claim 1, wherein the scanning the source image based on the portable scanning device comprises:

ARM-based portable scanning device scans the source image.
The method for performing high-level semantic analysis of image based on human-computer interaction according to claim 2, wherein the identifying the target in the source image comprises:

The feature extraction of images is performed by SIFT local feature extraction, and combined with HOG features and GIST global features, image information can be acquired more comprehensively.
The method for human-computer interaction-based image high-level semantic analysis according to claim 3, wherein the filtering and parsing the content in the source image and extracting effective knowledge comprises:

Take the word bag model image classification method to extract effective knowledge.
The method for classifying image semantics based on human-computer interaction according to claim 4, wherein the word bag model image classification method comprises:

Detecting feature points by image segmentation or random sampling;

Extracting local features from the image and generating descriptors;

The descriptors for these feature points utilize a clustering method in which each cluster center is a visual word;

The frequency at which each visual word appears is counted as a visual word histogram.
The method for human-computer interaction-based image advanced semantic parsing according to claim 5, wherein the organizing semantics to deliver the image content to the user in a voice form comprises:

The image content is delivered to the user in a voice form using a latent semantic extraction technique.