WO2017070858A1

WO2017070858A1 - A method and a system for face recognition

Info

Publication number: WO2017070858A1
Application number: PCT/CN2015/093031
Authority: WO
Inventors: Yi Sun; Xiaogang Wang; Xiaoou Tang
Original assignee: Beijing Sensetime Technology Development Co., Ltd; Shenzhen Sensetime Technology Co., Ltd; Sensetime Group Limited
Priority date: 2015-10-28
Filing date: 2015-10-28
Publication date: 2017-05-04
Also published as: CN108496174B; CN108496174A

Abstract

Disclosed is an apparatus for face recognition, comprising: a feature extraction unit configured to extract features from input face images with a plurality of deep feature extraction hierarchies; and a recognition unit configured to calculate distances between facial features of different face images extracted by the extractor to determine if two face images are from the same identity for face verification or determine if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification, wherein each of the deep feature extraction hierarchies contains a plurality of cascaded convolution layers, local-connection layers, pooling layers, and full-connection layers, and wherein neurons in the full-connection layers are only connected to a part of neurons in a previous layer thereof, while neurons in the convolution layers and the local-connection layers are only connected to a part of neurons in local regions in a previous layer thereof.

Description

A METHOD AND A SYSTEM FOR FACE RECOGNITION

Technical Field

The present application relates to a method for face recognition and a system thereof.

Background

The number of parameters in a deep neural network is restricted by the amount of training data. Weight sparsifying algorithms help to reduce model parameters and improve the generalization ability of deep models.

The idea of reducing neural connections has been taken in designing GoogLeNet, which achieved great success on the ImageNet challenge. GoogLeNet reduced neural connections by using very small convolution kernels of sizes 1x1, 3x3, and 5x5.

The Hebbian rule of “neurons that fire together wire together” suggests that connections between strongly correlated neurons are more important than those between weakly correlated neurons. Moreover, neurons in the previous layer which are more correlated (either positively or negatively) to a given neuron in the current layer are more helpful to predict the activities of the latter.

Removing unimportant parameters in deep neural networks was studied by LeCun et al. in their seminal work Optimal Brain Damage. They took a second derivative-related criterion for removing parameters. They reduced model parameters by a factor of eight without loss of the prediction ability of the original model.

Summary

In one aspect of the present application, disclosed is an apparatus for face recognition. The apparatus may comprise an extractor having a plurality of deep neural networks with sparsified neural connections to extract facial features from multiple face regions of face images for face recognition； and a recognizer being electronically communicated with the extractor and recognizing face identities of the input face images based on the extracted facial features.

Guided by the Hebbian rule that “neurons that fire together wire together” , more neural connections between weekly correlated neurons than those between strongly correlated neurons are pruned, wherein the correlation between two connected neurons are defined by the magnitude of the correlation between their neural activations.

In one embodiment of the present application, a baseline deep neural network is first trained, and then neural connections are pruned layer-wisely from the last to the previous layers, each time only one additional layer is sparsified and the entire model is re-trained. The previously trained models are used to calculate the neural correlations and initialize the subsequent sparser models.

In one embodiment of the present application, the baseline deep neural network is similar to VGG net with every two convolutional layers following one max-pooling layer. One major difference is that the last two convolutional layers are replaced by two locally-connected layers. The aim is to learn different features in different face regions, since face is a structured object, and local connections increase the model fitting ability. The second locally-connected layer is followed by a multi-dimensional fully-connected layer. The feature representation in the fully-connected layer is used for the following face recognition.

In one embodiment of the present application, connections in the baseline model are deleted in a layer-wise fashion, from the last fully-connected layer to the previous locally-connected and convolutional layers. Let N₀ denote a well-trained baseline model. When a layer L_m is sparsified, a new model N_m is re-trained initialized by its previous model N_m-1. Therefore, a sequence of models {N₁, ..., N_M} with fewer and fewer connections are trained and N_M is the final sparse ConvNet obtained. During the whole training process, the previously learned model is used to calculate the neural correlations and guide the connection dropping procedure. The weights learned by the denser model N_M-1 are also good initialization of the sparser model N_m to be further trained.

In some of embodiments, a trainer may be electronically communicated with the extractor to add supervisory signals on the deep neural networks during training so as to learn sparse structures in convolution, local-connection, and full-connection layers, as well as adjusting neural weights in these layers.

In one embodiment of the present application, joint identification-verification supervisory signal is added to the last fully-connected layer. The same supervisory signal is also added to a few previous layers to enhance the supervision in previous feature learning stages. The supervisory signals comprise one identification supervisory signal and one verification supervisory signal, wherein the identification supervisory signal is generated by classifying features in any of the layers extracted from an input face region into one of N identities in a training dataset, and taking a classification error as the supervisory signal, and wherein the verification signal is generated by comparing features in any of the layers extracted from two input face images respectively for determining if they are from the same person, and taking a verification error as the supervisory signal.

Neural weights and neural connections are updated alternatively and iteratively. Firstly, neural connections are fixed while neural weights are adjusted by back-propagating supervisory signals through the deep neural networks. These supervisory signals are aggregated to adjust neural weights in each of convolution, local-connection, and full-connection layers during training. Then neural weights are fixed while neural connections are pruned according to correlations between neural activations of connected neurons. The majority of weakly correlated neurons are pruned. Given the sparser deep model, neural weights are updated again by fixing neural connections, and so forth.

In further aspect of the present application, disclosed is a method for face recognition, comprising: configuring a plurality of deep feature extraction hierarchies such that each of the deep feature extraction hierarchies contains a plurality of cascaded convolution layers, local-connection layers, pooling layers and full-connection layers, and neurons in the full-connection layers are only connected to a part of neurons in a previous layer thereof, while neurons in the convolution layers and the local-connection layers are only connected to a part of neurons in local regions in a previous layer thereof； training the configured deep feature extraction hierarchies to learn neural connections in the convolution layers, the local-connection layers, and the full-connection layers, and to adjust neural weights in these layers； extracting features from input face images by the trained deep feature extraction hierarchies； and recognizing face identities of the input face images based on the extracted facial features.

Brief Description of the Drawing

Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.

Fig. 1 is a schematic diagram illustrating an apparatus for face recognition consistent with some disclosed embodiments.

Fig. 2 is a schematic diagram illustrating an apparatus for face recognition when it is implemented in software, consistent with some disclosed embodiments.

Fig. 3 is a schematic diagram illustrating an example of deep neural networks with sparsified layers in the extractor as shown in Fig. 1.

Fig. 4 is a schematic flowchart illustrating the trainer as shown in Fig. 1 consistent with some disclosed embodiments.

Fig. 5 is a schematic flowchart illustrating the extractor as shown in Fig. 1 consistent with some disclosed embodiments.

Fig. 6 is a schematic flowchart illustrating the recognizer as shown in Fig. 1 consistent with some disclosed embodiments.

Detailed Description

Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well-known process operations have not been described in detail in order not to unnecessarily obscure the present invention.

As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit， ” “module” or “system. ” Furthermore， the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.

In the case that the apparatus 1000 as disclosed below is implemented with software, the apparatus 1000 may include a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion. As shown in Fig. 2, the apparatus 1000 may include one or more processors (

processors

102, 104, 106 etc. ) , a memory 112, a storage device 116, a communication interface 114, and a bus to facilitate information exchange among various components of apparatus 1000. Processors 102-106 may include a central processing unit ( “CPU” ) ， a graphic processing unit ( “GPU” ) ， or other suitable information processing devices. Depending on the type of hardware being used, processors 102-106 can include one or more printed circuit boards, and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods or run the modules that will be explained in greater detail below.

Memory 112 can include, among other things, a random access memory ( “RAM” ) and a read-only memory ( “ROM” ) . Computer program instructions can be stored, accessed, and read from memory 112 for execution by one or more of processors 102-106. For example, memory 112 may store one or more software applications. Further, memory 112 may store an entire software application or only a part of a software application that is executable by one or more of processors 102-106 to carry out the functions as disclosed below for the apparatus 1000. It is noted that although only one block is shown in Fig. 1, memory 112 may include multiple physical devices installed on a central computing device or on different computing devices.

Referring to Fig. 1 again, where the apparatus 1000 is implemented by the hardware, it may comprise an extractor 10 and a recognizer 20. The extractor 10 is configured with a plurality of deep neural networks with sparsified neural connections (referred to as sparse deep neural networks) to extract facial features from face regions of input face images. The recognizer 20 is electronically communicated with the extractor 10 and recognizes face identities of the input face images based on the extracted facial features. As will be discussed in details below, each of the sparse deep neural networks comprises a plurality of sparse convolutional layers, sparse local-connection layers, sparse full-connection layers, and pooling layers. A first one of the sparse convolutional layers extracts local facial features from input face images, and the followings of the sparse convolutional layers and sparse local-connection layers extract further local features from the extracted features outputted from a previous layer. Each of sparse full-connection layers extract global features from the extracted features outputted from a previous layer. Each of pooling layers receives features from a previous layer and reduces dimensions of the received features. The features obtained from all the sparse deep neural networks are concatenated into a feature vector as said facial features for face recognition.

In addition, the apparatus 1000 may further comprise a trainer 30 used to learn sparse neural connections (referred to as sparse structures) as well as weights on sparse connections of the sparse deep neural networks.

The Extractor 10

Fig. 5 is a schematic flowchart illustrating the feature extraction process 50 in the extractor 10, which contains three steps. In step S501, the extractor 10 forward propagates face regions of an input face image through deep neural networks with sparsified connections (referred to as sparse deep neural networks) . Then in step S502, the extractor 10 takes neural activations in last layers of sparse deep neural networks as facial features. Finally in step S503, it concatenates facial features of all sparse deep neural networks.

Given an input face image, the sparse deep neural network in the extractor 10 starts by extracting local facial features with every two sparse convolutional layers following one pooling layer. The last pooling layer (pooling layer 12) is followed by two sparse local-connection layers to further extract local facial features and one sparse full-connection layer to extract global facial features. In particular, in the sparse deep neural network, wherein neurons in the full-connection layers are only connected to a part of neurons in a previous layer thereof, while neurons in the convolution layers and the local-connection layers are only connected to a part of neurons in local regions in a previous layer thereof. This is why these layers are called as "sparse" layers.

As will be discussed later in reference to the trainer 30, the sparse deep neural network in the extractor 10 shall be trained. In one embodiment of the present application, connections in the baseline model are deleted in a layer-wise fashion, from the last fully-connected layer to the previous locally-connected and convolutional layers. Let N₀ denote a well-trained baseline model. When a layer L_m is sparsified, a new model N_m is re-trained initialized by its previous model N_m-1. Therefore, a sequence of models {N₁, ..., N_M} with fewer and fewer connections are trained and N_M is the final sparse deep neural network (also referred to as sparse ConvNet since the deep neural network contains convolutional layers) obtained. During the whole training process, the previously learned model is used to calculate the neural correlations and guide the connection dropping procedure. The weights learned by the denser model N_M-1 are also good initialization of the sparser model N_m to be further trained.

With the above constructions, by reducing model/layer parameters (i.e., neural weights on connections) of the original non-sparse layers, these sparse layers help to improve the generalization ability of the learned features, i.e., features learned on the training face images can be well generalized to test face images to distinguish the test face images well by their identities. In addition, the sparse layers reduce sizes (parameters) of neural networks, making it easier to be stored on mobile phones or other devices with limited memories.

Fig. 3 illustrates an example of sparse deep neural networks in the extractor 10 according to one embodiment of the present application. The extractor 10 contains a plurality of sparse deep neural networks. Each of the sparse deep neural networks may comprise a plurality of sparse convolution-pooling

modules

301, 302, 303 and 304, and comprises a connection module 305, as illustrated in Fig. 3. It will be appreciated that there would be more or less number of sparse convolution-pooling modules as required, although Fig. 3 illustrates 4 sparse convolution-pooling modules as an example.

As shown, each of the sparse convolution-pooling

modules

301, 302, 303 and 304 is a cascade of two sparse convolutional layers and a pooling layer. For example, the convolution-pooling modules 301 may comprise a sparse convolution layer 1, a sparse convolution layer 2 and a pooling layer 3, which are cascaded sequentially. All the sparse convolution-pooling

modules

301, 302, 303 and 304 are cascaded sequentially, and then are cascaded to the connection module 305 which further configured with two sparse local-

connection layers

13 and 14, and a sparse full-connection layer 15. Compared to convolutional layers, local-

connection layers

13 and 14 help to extract more diverse features, which are proved to be helpful in later feature extraction stages in deep neural networks. Neural activations in sparse full-connection layer 15 are used as facial features for face recognition.

Sparse convolutional layers, sparse local-connection layers, and sparse full-connection layers are convolutional layers, local-connection layers, and full-connection layers with sparsified neural connections, respectively. Given the degree of sparsity S (0 < S < 1) , the present application samples Sy|W| weights from the total number of weights |W| in a given sparse layer. Neural connections which correspond to the sampled weights are reserved. Otherwise, they are pruned from the current sparse deep neural network. The number of connections is proportional to the number of weights for all types of sparse layers in sparse deep neural networks.

Convolutional layers are configured to extract local facial features from input feature maps (which is output feature maps of a previous layer) to form output feature maps of the current layer. In particular, each convolutional layer performs convolution operations on the input feature maps to form output feature maps of the current layer, and the formed output feature maps will be input to a next layer.

Each feature map is a certain kind of features organized in 2D. Features in the same output feature map are extracted from input feature maps with the same set of neural connection weights. The convolution operation in each convolutional layer may be expressed as

Where,

xⁱ and y^j are the i-th input feature map and the j-th output feature map, respectively；

k^ijis the convolution kernel between the i-th input feature map and the j-th output feature map；

*denotes convolution；

b^jis the bias of the j-th output feature map；

ReLU nonlinearity y＝max (0, ·) is used for neurons..

Neural weights in convolutional layers are parameters in convolution kernels k^ij. In sparse convolutional layers, a portion of kernel parameters are sampled, according to the degree of sparsity S. A sampled parameter corresponds to a set of neural connections which share the same parameter. These neural connections are reserved in sparse convolutional layers. Other neural connections which take the unsampled kernel parameters as weights are pruned from a sparse convolutional layer.

Local-connection layers are also configured to extract local facial features from input feature maps (which is output feature maps of a previous layer) to form output feature maps of the current layer. Unlike convolutional layers, local-connection layers do not share neural weights across neurons on the same output feature map. The operation in each local-connection layer may be expressed as

Where, x^iris neural activations in a local region r in the i-th feature map of a previous layer. y^jr is the r-th (single) neural activation in the j-th output feature map of the current layer. k^ijr is neural weights on local connections betweeny^jr andx^ir. b^j is the bias of the j-th output feature map. y＝max (0, ·) is the ReLU nonlinearity. In sparse local-connection layers, a portion of neural weights (i.e., k^ijr for all i, j, and r) are sampled, according to the degree of sparsity S. A sampled neural weight corresponds to a single neural connection since weights are unshared between different neural connections. These sampled neural connections are reserved in sparse local-connection layers. Other unsampled neural connections are pruned from sparse local-connection layers.

The goal of cascading (sparse) convolutional layers and (sparse) local-connection layers is to extract hierarchical local features (i.e. features extracted from local regions of the input images or the input features) , wherein features extracted by higher convolutional/local-connection layers have larger effective receptive field on input images and more complex non-linearity.

Pooling layers are configured to pool local facial features from input feature maps from a previous layer to form output feature maps of the current layer. Pooling operation forms more invariant features, which is formulated as

where each neuron in the i-th output feature map yⁱ pools over an MuN local region in the i-th input feature map xⁱ, with s as the step size.

Full-connection layers in deep neural networks are configured to extract global features (features extracted from the entire region of input feature maps) from a previous layer. Full-connection layers also serve as interfaces for receiving supervisory signals during training, which will be discussed later. Full-connection layers also have the function of feature dimension reduction as pooling layers by restricting the number of neurons in them. A full-connection layer is formulated as

Where,

x denotes neural activations from a previous layer,

y denotes neural activations in the current full-connection layer,

w denotes neural weights on connections between the current full-connection layer and a previous layer. Neurons in full-connection layers linearly combine neural activations of all neurons in a previous layer, followed by ReLU non-linearity. In sparse full-connection layers, a portion of neural weights (i.e., w_i,j for all i and j) are sampled, according to the degree of sparsity S. A sampled neural weight corresponds to a single neural connection since weights are unshared between different neural connections. These sampled neural connections are reserved in sparse full-connection layers. Other unsampled neural connections are pruned from sparse full-connection layers.

Neural activations of neurons in the highest layer of sparse deep neural networks are used as facial features for face recognition. These facial features are global and can capture highly non-linear mappings from input face images to their identities. In one embodiment of the present application, neural activations of neurons in sparse full-connection layer 15 of the sparse deep neural network shown in Fig. 3 are used as facial features for face recognition. An extractor contains a plurality of sparse deep neural networks. Facial features extracted by all sparse deep neural networks are concatenated into a long feature vector as a final feature representation for face recognition.

The recognizer 20

The recognizer 20 operates to calculate distances between facial features of different face images extracted by the extractor 10 to determine if two face images are from the same identity for face verification or determine if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification. Fig. 6 is a schematic flowchart illustrating the recognition process 60 in the recognizer 20. In step S601, the recognizer 20 calculates distances between facial features extracted from different face images by the extractor 10. Then in step S602, the recognizer 20 determines if two face images are from the same identity for face verification, or, alternatively, in step S603, it determines one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification.

In the recognizer 20, two face images are determined to belong to the same identity if their feature distance is smaller than a threshold, or the probe face image is determined to belong to the same identity as one of gallery face images if their feature distance is the smallest compared to feature distances of the probe face image to all the other gallery face images, wherein feature distances determined by the recognizer 20 could be Euclidean distances, Joint Bayesian distances, cosine distances, Hamming distances, or any other distances.

The Trainer 30

The trainer 30 is used to learn the sparse structures (i.e., neural connections) of the sparse deep neural networks as well as neural weights on connections of the sparse deep neural networks in the extractor 10. As illustrated in Fig. 4, in step S401 the trainer 30 first trains an initial dense neural network N₀ with network structure T₀. For example, the initial structure T₀ could be the one shown in Fig. 3 by replacing the conventional convolutional layers, the conventional local-connection layers, and the conventional full-connection layers with the sparse convolutional layers, the sparse local-connection layers, and the sparse full-connection layers, respectively. Then, given a sequence of layers L₁, L₂, ..., L_M to be sparsified and the corresponding pre-specified degrees of sparsify S₁, S₂, ..., S_M, the trainer 30 iteratively prunes (sparsifies) the neural connections as illustrated in step S402 and learns the neural weights on reserved neural connections as illustrated in step S403. In the m-th iteration (for m ＝ 1, 2, ..., M) , the trainer 30 first in step S402 prunes the neural connections in layer L_m according to the neural correlations in netwrok N_m-1 and the pre-specified sparsity degree S_m. Let T_m be the sparser structure after pruning. The trainer 30 then in step S403 trains a sparser network N_m with structure T_m, wherein weights on connections of network N_m are initialized by those of network N_m-1. After iterative connection pruning (sparsifying) and weight updating (m＞＝M, at step S404) , the trainer 30 finally in step S405 outputs the sparsified and well-trained neural network N_M.

Given a layer L_m to be pruned (sparsified) and the pre-specified degree of sparsity S_m (0 < S_m < 1) for the given layer, the present application samples S_m·|W| weights from the total number of weights |W| of layer L_m. The number of connections is proportional to the number of weights for all types of sparse layers (including sparse convolutional layers, sparse local-connection layers, and sparse full-connection layers) in the sparse deep neural network. The sampling is based on neural correlations with such a principle that keeps connections (and the corresponding weights) where neurons connected have high correlations and drops connections between weakly correlated neurons. This is because neurons in one layer which have stronger correlations to neurons in the upper layer have stronger predictive power for the activities of the latter. Note that neurons with strong negative correlations are also useful in predicting neural activations. If a neuron is viewed as a detector of a certain visual pattern, its positively correlated neurons in the lower layer provide evidence on the visual pattern, while its negatively correlated neurons help to reduce false alarms. In practice, the present application may also keep a small portion of connections between weakly correlated neurons due to the reason that predictions from weakly correlated neurons are complementary to those from highly correlated neurons.

First, the full-connection layers and local-connection layers in which weights are not shared are considered. Weights and connections are one-to-one mapped in these layers. Given a neuron a_i in the current layer and its K connected neurons b_i1, b_i2, ..., b_iK in the previous layer, the correlation coefficient between a_i to each of b_ik for k ＝ 1, 2, ..., K is (for simplicity, when the present application refers to a neuron, it also means its neural activations)

where

and

denote the mean and standard deviation of a_i and b_ik, respectively, which are evaluated on a separate training set. Since both positively and negatively correlated neurons are helpful for the predictions, the corresponding connections are considered respectively. From all r_ik for k ＝ 1, 2, ..., K, first take out all positive coefficients and sort them in descending order, denotedfor k ＝ 1, 2, ..., K⁺. Then randomly sample λSK⁺and (1-λ) SK⁺coefficients from the coefficients ranked in the first and the second half of the sorted correlation coefficients, respectively. Weights/connections corresponding to the sampled coefficients are reserved while others are deleted. The present application takes λ＝0.75. In other words, connections from the half of higher correlations are three times as much as those from the half of lower correlations. The total kept connections/weights are SK⁺, which depends on the degree of sparsity S.

The negative coefficients are processed in a similar way, except that the absolute value of the coefficients are considered and more coefficients (and the corresponding connections/weights) of higher absolute values are kept. The total sampled negative coefficients are SK^-, given K^-negative coefficients from r_ik for k ＝ 1, 2, ..., K. Connections from each of output neurons a_i are processed in the same way. Suppose there are N output neurons a_i for i ＝ 1, 2, ..., N. Then the total sampled weights/connections areSKN.

For convolutional layers, the set of correlation coefficients between neurons with shared connecting weights are jointly considered to determine whether a weight (or a set of connections with shared weights) should be reserved or deleted. Leta_imbe the m-th neuron in the i-th feature map of the current layer, and it is connected to K neuronsb_mk in the previous layer for k ＝ 1, 2, ..., K. (K equals the filter size, e.g., 3x3, times the number of input channels. ) The set of K neurons b_mk are determined by the position m. There are a total of M neurons in the i-th output feature map asa_im for m ＝ 1, 2, ..., M. They all share the same set of K weights, although connected to different sets of neurons in the previous layerb_mk for m ＝ 1, 2, ..., M. Weights between a_im and b_mk are shared for m ＝ 1, 2, ..., M. We calculate the mean magnitude of the correlation coefficients betweena_im andb_mk for m ＝ 1, 2, ..., M as

Similar to the case in the full-connection layers and local-connection layers, given the degree of sparsity S, SK mean correlation coefficients (and the corresponding weights) from the set of K coefficients r_ik for k ＝ 1, 2, ..., K are selected. r_ik are sorted in descending order. λSK coefficients are randomly chosen from the first half with higher values and (1-λ) SK coefficients are randomly chosen from the second half of the correlation coefficients with lower values. Againλ is set to 0.75 in the present application. The set of K weights r_ikfor k ＝ 1, 2, ..., K are processed in the same way for all i ＝ 1, 2, ..., N (given N feature maps in the current layer) . The total sampled weights areSKN.

During the phase of weight updating, identification and verification supervisory signals in the trainer 30 are simultaneously added to each of the supervised layers (e.g., sparse full-connection layer 15 in Fig. 3) of each of the sparse deep neural networks in the extractor 10, and respectively back-propagated to the input face image, so as to update neural weights on reserved neural connections of sparse convolutional layers, sparse local-connection layers, and sparse full-connection layers of the sparse deep neural networks.

The identification supervisory signals are generated in the trainer 30 by classifying all of the supervised layer (layers selected for supervision, e.g., sparse full-connection layer 15 in Fig. 3) representations into one of N identities, wherein the classification errors are used as the identification supervisory signals.

The verification supervisory signals in the trainer 30 are generated by verifying the supervised layer representations of two compared face images, respectively, in each of the feature extraction modules, to determine if the two compared face images belong to the same identity, wherein the verification errors are used as the verification supervisory signals. Given a pair of training face images, the extractor 10 extracts two feature vectors f_i and f_j from the two face images respectively in each of the feature extraction modules. The verification error is

if f_i and f_j are features of face images of the same identity, or

if f_i and f_j are features of face images of different identities, where

is Euclidean distance of the two feature vectors, m is a positive constant value. There are errors if f_i and f_j are dissimilar for the same identity, or if f_i and f_j are similar for different identities.

Although the preferred examples of the present invention have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims is intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.

Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present invention.

The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

An apparatus for face recognition, comprising:

a feature extraction unit configured to extract features from input face images with a plurality of deep feature extraction hierarchies； and

a recognition unit configured to calculate distances between facial features of different face images extracted by the extractor to determine if two face images are from the same identity for face verification or determine if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification,

wherein each of the deep feature extraction hierarchies contains a plurality of cascaded convolution layers, local-connection layers, pooling layers, and full-connection layers, and

wherein neurons in the full-connection layers are only connected to a part of neurons in a previous layer thereof, while neurons in the convolution layers and the local-connection layers are only connected to a part of neurons in local regions in a previous layer thereof.
An apparatus of claim 1, further comprising:

a training unit configured to add supervisory signals on the feature extraction unit during training so as to learn neural connections in the convolution layers, the local-connection layers, and the full-connection layers, and to adjust neural weights in these layers.
An apparatus of claim 1, wherein neural connections in the convolution layers, the local-connection layers, and the full-connection layers and neural weights on the neural connections are learned iteratively.
An apparatus of claim 3, wherein in one iteration, the neural weights on neural connections are adjusted by fixing neural connections in the convolution, local-connection and full-connection layers, and then,

the neural connections in one or more of the convolution, local-connection and full-connection layers are pruned while fixing neural weights.
An apparatus of claim 3, wherein the neural connections are pruned according to correlations between neural activations of connected neurons, wherein a majority of connections between weakly correlated neurons are pruned while a majority of connections between strongly correlated neurons are reserved.
An apparatus of claim 3, wherein before the first iteration, neurons in the full-connection layers are connected to all neurons in a previous layer thereof, while neurons in the sparse convolution modules and the sparse local-connection modules are connected to all neurons in local regions in a previous layer thereof, respectively.
An apparatus of claim 3, wherein for the second and the following iterations, the neuron connections are taken from reserved neuron connections in the previous iteration and the neuron weights on these neuron connections are initialized by neuron weights learned in the previous iteration.
An apparatus of claim 7, wherein the neuron weights on the reserved neuron connections from the previous iteration are adjustable according to joint identification-verification supervisory signals.
An apparatus of claim 8, wherein the joint identification-verification supervisory signals comprises an identification supervisory signal and a verification supervisory signal,

wherein,

the identification supervisory signal is generated by classifying features extracted from an input face region into one of N identities in a training dataset, and taking the classification error as the supervisory signal, while the verification signal is generated by comparing features extracted from two input face images respectively to tell if they are from the same person, and taking the verification error as the supervisory signal.
An apparatus of claim 1, wherein features extracted by a plurality of deep feature extraction hierarchies in the feature extraction unit are concatenated for face recognition.
An apparatus of claim 10, wherein distances between the concatenated features extracted from two input face images are compared to a threshold to determine if the two input face images are from the same person for face verification, or distances between features of an input query face image to features of each of face images in a face image database are computed to determine which identity in the face image database the input query face image belongs to for face identification.
A method for face recognition, comprising:

configuring a plurality of deep feature extraction hierarchies such that each of the deep feature extraction hierarchies contains a plurality of cascaded convolution layers, local-connection layers, pooling layers and full-connection layers, and neurons in the full-connection layers are only connected to a part of neurons in a previous layer thereof, while neurons in the convolution layers and the local-connection layers are only connected to a part of neurons in local regions in a previous layer thereof；

training the configured deep feature extraction hierarchies to learn neural connections in the convolution layers, the sparse local-connection layers, and the full-connection layers, and to adjust neural weights in these layers；

extracting features from input face images by the trained deep feature extraction hierarchies, ； and

recognizing faces based on features extracted from each input face image by the feature extraction unit.
A method of claim 12, wherein the training further comprises:

learning iteratively neural connections in the convolution layers, the local-connection layers, and the full-connection layers and neural weights on the neural connections.
A method of claim 13, wherein in one iteration, the training comprises:

adjusting the neural weights on neural connections by fixing neural connections in the convolution, local-connection and full-connection layers, and,

pruning the neural connections in one or more of the convolution, local-connection and full-connection layers while fixing the neural weights thereof.
A method of claim 14 wherein the neural connections are pruned according to correlations between neural activations of connected neurons, wherein a majority of connections between weakly correlated neurons are pruned while a majority of connections between strongly correlated neurons are reserved.
A method of claim 13, wherein before the first iteration, neurons in the full-connection layers are connected to all neurons in a previous layer thereof, while neurons in the sparse convolution modules and the sparse local-connection modules are connected to all neurons in local regions in a previous layer thereof, respectively.
A method of claim 13, wherein for the second and the following iterations, the neuron connections are taken from reserved neuron connections in the previous iteration and the neuron weights on these neuron connections are initialized by neuron weights learned in the previous iteration.
A method of claim 17, wherein the neuron weights on the reserved neuron connections from the previous iteration are adjustable according to joint identification-verification supervisory signals.
A method of claim 18, wherein the joint identification-verification supervisory signals comprises an identification supervisory signal and a verification supervisory signal,

wherein,

the identification supervisory signal is generated by classifying features extracted from an input face region into one of N identities in a training dataset, and taking the classification error as the supervisory signal, while the verification signal is generated by comparing features extracted from two input face images respectively to tell if they are from the same person, and taking the verification error as the supervisory signal.
A method of claim 11, wherein features extracted by a plurality of deep feature extraction hierarchies in the feature extraction unit are concatenated for face recognition.
A system for face recognition, comprising:

a memory that stores executable components； and

a processor electrically coupled to the memory to execute the executable components to:

configure a plurality of deep feature extraction hierarchies such that each of the deep feature extraction hierarchies contains a plurality of cascaded convolution layers, local-connection layers, pooling layers and full-connection layers, and neurons in the full-connection layers are only connected to a part of neurons in a previous layer thereof, while neurons in the convolution layers and the local-connection layers are only connected to a part of neurons in local regions in a previous layer thereof；

training the configured deep feature extraction hierarchies；

extract features from input face images with the trained deep feature extraction hierarchies； and

recognize faces based on features extracted from each input face image by the feature extraction unit.
A system of claim 21, wherein the processor is further configured to execute the executable components for training the configured deep feature extraction hierarchies by adding supervisory signals on them so as to learn neural connections in the convolution layers, the local-connection layers, and the full-connection layers, and to adjust neural weights in these layers.
A system of claim 22, wherein the training further comprises:

learning iteratively neural connections in the convolution layers, the local-connection layers, and the full-connection layers and neural weights on the neural connections.