WO2017070858A1 - A method and a system for face recognition - Google Patents

A method and a system for face recognition Download PDF

Info

Publication number
WO2017070858A1
WO2017070858A1 PCT/CN2015/093031 CN2015093031W WO2017070858A1 WO 2017070858 A1 WO2017070858 A1 WO 2017070858A1 CN 2015093031 W CN2015093031 W CN 2015093031W WO 2017070858 A1 WO2017070858 A1 WO 2017070858A1
Authority
WO
WIPO (PCT)
Prior art keywords
layers
neurons
neural
connections
local
Prior art date
Application number
PCT/CN2015/093031
Other languages
French (fr)
Inventor
Yi Sun
Xiaogang Wang
Xiaoou Tang
Original Assignee
Beijing Sensetime Technology Development Co., Ltd
Shenzhen Sensetime Technology Co., Ltd
Sensetime Group Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sensetime Technology Development Co., Ltd, Shenzhen Sensetime Technology Co., Ltd, Sensetime Group Limited filed Critical Beijing Sensetime Technology Development Co., Ltd
Priority to CN201580085498.9A priority Critical patent/CN108496174B/en
Priority to PCT/CN2015/093031 priority patent/WO2017070858A1/en
Publication of WO2017070858A1 publication Critical patent/WO2017070858A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present application relates to a method for face recognition and a system thereof.
  • the number of parameters in a deep neural network is restricted by the amount of training data. Weight sparsifying algorithms help to reduce model parameters and improve the generalization ability of deep models.
  • GoogLeNet reduced neural connections by using very small convolution kernels of sizes 1x1, 3x3, and 5x5.
  • the Hebbian rule of “neurons that fire together wire together” suggests that connections between strongly correlated neurons are more important than those between weakly correlated neurons. Moreover, neurons in the previous layer which are more correlated (either positively or negatively) to a given neuron in the current layer are more helpful to predict the activities of the latter.
  • the apparatus may comprise an extractor having a plurality of deep neural networks with sparsified neural connections to extract facial features from multiple face regions of face images for face recognition; and a recognizer being electronically communicated with the extractor and recognizing face identities of the input face images based on the extracted facial features.
  • a baseline deep neural network is first trained, and then neural connections are pruned layer-wisely from the last to the previous layers, each time only one additional layer is sparsified and the entire model is re-trained.
  • the previously trained models are used to calculate the neural correlations and initialize the subsequent sparser models.
  • the baseline deep neural network is similar to VGG net with every two convolutional layers following one max-pooling layer.
  • One major difference is that the last two convolutional layers are replaced by two locally-connected layers.
  • the aim is to learn different features in different face regions, since face is a structured object, and local connections increase the model fitting ability.
  • the second locally-connected layer is followed by a multi-dimensional fully-connected layer.
  • the feature representation in the fully-connected layer is used for the following face recognition.
  • connections in the baseline model are deleted in a layer-wise fashion, from the last fully-connected layer to the previous locally-connected and convolutional layers.
  • N 0 denote a well-trained baseline model.
  • a new model N m is re-trained initialized by its previous model N m-1 . Therefore, a sequence of models ⁇ N 1 , ..., N M ⁇ with fewer and fewer connections are trained and N M is the final sparse ConvNet obtained.
  • the previously learned model is used to calculate the neural correlations and guide the connection dropping procedure.
  • the weights learned by the denser model N M-1 are also good initialization of the sparser model N m to be further trained.
  • a trainer may be electronically communicated with the extractor to add supervisory signals on the deep neural networks during training so as to learn sparse structures in convolution, local-connection, and full-connection layers, as well as adjusting neural weights in these layers.
  • joint identification-verification supervisory signal is added to the last fully-connected layer.
  • the same supervisory signal is also added to a few previous layers to enhance the supervision in previous feature learning stages.
  • the supervisory signals comprise one identification supervisory signal and one verification supervisory signal, wherein the identification supervisory signal is generated by classifying features in any of the layers extracted from an input face region into one of N identities in a training dataset, and taking a classification error as the supervisory signal, and wherein the verification signal is generated by comparing features in any of the layers extracted from two input face images respectively for determining if they are from the same person, and taking a verification error as the supervisory signal.
  • Neural weights and neural connections are updated alternatively and iteratively. Firstly, neural connections are fixed while neural weights are adjusted by back-propagating supervisory signals through the deep neural networks. These supervisory signals are aggregated to adjust neural weights in each of convolution, local-connection, and full-connection layers during training. Then neural weights are fixed while neural connections are pruned according to correlations between neural activations of connected neurons. The majority of weakly correlated neurons are pruned. Given the sparser deep model, neural weights are updated again by fixing neural connections, and so forth.
  • a method for face recognition comprising: configuring a plurality of deep feature extraction hierarchies such that each of the deep feature extraction hierarchies contains a plurality of cascaded convolution layers, local-connection layers, pooling layers and full-connection layers, and neurons in the full-connection layers are only connected to a part of neurons in a previous layer thereof, while neurons in the convolution layers and the local-connection layers are only connected to a part of neurons in local regions in a previous layer thereof; training the configured deep feature extraction hierarchies to learn neural connections in the convolution layers, the local-connection layers, and the full-connection layers, and to adjust neural weights in these layers; extracting features from input face images by the trained deep feature extraction hierarchies; and recognizing face identities of the input face images based on the extracted facial features.
  • Fig. 1 is a schematic diagram illustrating an apparatus for face recognition consistent with some disclosed embodiments.
  • Fig. 2 is a schematic diagram illustrating an apparatus for face recognition when it is implemented in software, consistent with some disclosed embodiments.
  • Fig. 3 is a schematic diagram illustrating an example of deep neural networks with sparsified layers in the extractor as shown in Fig. 1.
  • Fig. 4 is a schematic flowchart illustrating the trainer as shown in Fig. 1 consistent with some disclosed embodiments.
  • Fig. 5 is a schematic flowchart illustrating the extractor as shown in Fig. 1 consistent with some disclosed embodiments.
  • Fig. 6 is a schematic flowchart illustrating the recognizer as shown in Fig. 1 consistent with some disclosed embodiments.
  • the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit, ” “module” or “system. ” Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
  • the apparatus 1000 may include a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion.
  • the apparatus 1000 may include one or more processors (processors 102, 104, 106 etc. ) , a memory 112, a storage device 116, a communication interface 114, and a bus to facilitate information exchange among various components of apparatus 1000.
  • Processors 102-106 may include a central processing unit ( “CPU” ) , a graphic processing unit ( “GPU” ) , or other suitable information processing devices.
  • processors 102-106 can include one or more printed circuit boards, and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods or run the modules that will be explained in greater detail below.
  • Memory 112 can include, among other things, a random access memory ( “RAM” ) and a read-only memory ( “ROM” ) .
  • Computer program instructions can be stored, accessed, and read from memory 112 for execution by one or more of processors 102-106.
  • memory 112 may store one or more software applications.
  • memory 112 may store an entire software application or only a part of a software application that is executable by one or more of processors 102-106 to carry out the functions as disclosed below for the apparatus 1000. It is noted that although only one block is shown in Fig. 1, memory 112 may include multiple physical devices installed on a central computing device or on different computing devices.
  • the apparatus 1000 may comprise an extractor 10 and a recognizer 20.
  • the extractor 10 is configured with a plurality of deep neural networks with sparsified neural connections (referred to as sparse deep neural networks) to extract facial features from face regions of input face images.
  • the recognizer 20 is electronically communicated with the extractor 10 and recognizes face identities of the input face images based on the extracted facial features.
  • each of the sparse deep neural networks comprises a plurality of sparse convolutional layers, sparse local-connection layers, sparse full-connection layers, and pooling layers.
  • a first one of the sparse convolutional layers extracts local facial features from input face images, and the followings of the sparse convolutional layers and sparse local-connection layers extract further local features from the extracted features outputted from a previous layer.
  • Each of sparse full-connection layers extract global features from the extracted features outputted from a previous layer.
  • Each of pooling layers receives features from a previous layer and reduces dimensions of the received features. The features obtained from all the sparse deep neural networks are concatenated into a feature vector as said facial features for face recognition.
  • the apparatus 1000 may further comprise a trainer 30 used to learn sparse neural connections (referred to as sparse structures) as well as weights on sparse connections of the sparse deep neural networks.
  • a trainer 30 used to learn sparse neural connections (referred to as sparse structures) as well as weights on sparse connections of the sparse deep neural networks.
  • Fig. 5 is a schematic flowchart illustrating the feature extraction process 50 in the extractor 10, which contains three steps.
  • step S501 the extractor 10 forward propagates face regions of an input face image through deep neural networks with sparsified connections (referred to as sparse deep neural networks) .
  • step S502 the extractor 10 takes neural activations in last layers of sparse deep neural networks as facial features.
  • step S503 it concatenates facial features of all sparse deep neural networks.
  • the sparse deep neural network in the extractor 10 starts by extracting local facial features with every two sparse convolutional layers following one pooling layer.
  • the last pooling layer (pooling layer 12) is followed by two sparse local-connection layers to further extract local facial features and one sparse full-connection layer to extract global facial features.
  • the sparse deep neural network wherein neurons in the full-connection layers are only connected to a part of neurons in a previous layer thereof, while neurons in the convolution layers and the local-connection layers are only connected to a part of neurons in local regions in a previous layer thereof. This is why these layers are called as "sparse" layers.
  • connections in the baseline model are deleted in a layer-wise fashion, from the last fully-connected layer to the previous locally-connected and convolutional layers.
  • N 0 denote a well-trained baseline model.
  • N M is the final sparse deep neural network (also referred to as sparse ConvNet since the deep neural network contains convolutional layers) obtained.
  • the previously learned model is used to calculate the neural correlations and guide the connection dropping procedure.
  • the weights learned by the denser model N M-1 are also good initialization of the sparser model N m to be further trained.
  • these sparse layers help to improve the generalization ability of the learned features, i.e., features learned on the training face images can be well generalized to test face images to distinguish the test face images well by their identities.
  • the sparse layers reduce sizes (parameters) of neural networks, making it easier to be stored on mobile phones or other devices with limited memories.
  • Fig. 3 illustrates an example of sparse deep neural networks in the extractor 10 according to one embodiment of the present application.
  • the extractor 10 contains a plurality of sparse deep neural networks.
  • Each of the sparse deep neural networks may comprise a plurality of sparse convolution-pooling modules 301, 302, 303 and 304, and comprises a connection module 305, as illustrated in Fig. 3. It will be appreciated that there would be more or less number of sparse convolution-pooling modules as required, although Fig. 3 illustrates 4 sparse convolution-pooling modules as an example.
  • each of the sparse convolution-pooling modules 301, 302, 303 and 304 is a cascade of two sparse convolutional layers and a pooling layer.
  • the convolution-pooling modules 301 may comprise a sparse convolution layer 1, a sparse convolution layer 2 and a pooling layer 3, which are cascaded sequentially. All the sparse convolution-pooling modules 301, 302, 303 and 304 are cascaded sequentially, and then are cascaded to the connection module 305 which further configured with two sparse local-connection layers 13 and 14, and a sparse full-connection layer 15.
  • connection module 305 which further configured with two sparse local-connection layers 13 and 14, and a sparse full-connection layer 15.
  • local-connection layers 13 and 14 help to extract more diverse features, which are proved to be helpful in later feature extraction stages in deep neural networks. Neural activations in sparse full-connection layer 15 are used as facial features for face recognition.
  • Sparse convolutional layers, sparse local-connection layers, and sparse full-connection layers are convolutional layers, local-connection layers, and full-connection layers with sparsified neural connections, respectively.
  • the degree of sparsity S (0 ⁇ S ⁇ 1)
  • the present application samples Sy
  • Convolutional layers are configured to extract local facial features from input feature maps (which is output feature maps of a previous layer) to form output feature maps of the current layer.
  • each convolutional layer performs convolution operations on the input feature maps to form output feature maps of the current layer, and the formed output feature maps will be input to a next layer.
  • Each feature map is a certain kind of features organized in 2D.
  • Features in the same output feature map are extracted from input feature maps with the same set of neural connection weights.
  • the convolution operation in each convolutional layer may be expressed as
  • x i and y j are the i-th input feature map and the j-th output feature map, respectively;
  • k ij is the convolution kernel between the i-th input feature map and the j-th output feature map
  • b j is the bias of the j-th output feature map
  • Neural weights in convolutional layers are parameters in convolution kernels k ij .
  • a portion of kernel parameters are sampled, according to the degree of sparsity S.
  • a sampled parameter corresponds to a set of neural connections which share the same parameter. These neural connections are reserved in sparse convolutional layers. Other neural connections which take the unsampled kernel parameters as weights are pruned from a sparse convolutional layer.
  • Local-connection layers are also configured to extract local facial features from input feature maps (which is output feature maps of a previous layer) to form output feature maps of the current layer. Unlike convolutional layers, local-connection layers do not share neural weights across neurons on the same output feature map. The operation in each local-connection layer may be expressed as
  • x ir is neural activations in a local region r in the i-th feature map of a previous layer.
  • y jr is the r-th (single) neural activation in the j-th output feature map of the current layer.
  • k ijr is neural weights on local connections betweeny jr andx ir .
  • b j is the bias of the j-th output feature map.
  • a portion of neural weights i.e., k ijr for all i, j, and r
  • a sampled neural weight corresponds to a single neural connection since weights are unshared between different neural connections. These sampled neural connections are reserved in sparse local-connection layers. Other unsampled neural connections are pruned from sparse local-connection layers.
  • the goal of cascading (sparse) convolutional layers and (sparse) local-connection layers is to extract hierarchical local features (i.e. features extracted from local regions of the input images or the input features) , wherein features extracted by higher convolutional/local-connection layers have larger effective receptive field on input images and more complex non-linearity.
  • Pooling layers are configured to pool local facial features from input feature maps from a previous layer to form output feature maps of the current layer. Pooling operation forms more invariant features, which is formulated as
  • each neuron in the i-th output feature map y i pools over an MuN local region in the i-th input feature map x i , with s as the step size.
  • Full-connection layers in deep neural networks are configured to extract global features (features extracted from the entire region of input feature maps) from a previous layer. Full-connection layers also serve as interfaces for receiving supervisory signals during training, which will be discussed later. Full-connection layers also have the function of feature dimension reduction as pooling layers by restricting the number of neurons in them.
  • a full-connection layer is formulated as
  • x denotes neural activations from a previous layer
  • y denotes neural activations in the current full-connection layer
  • w denotes neural weights on connections between the current full-connection layer and a previous layer. Neurons in full-connection layers linearly combine neural activations of all neurons in a previous layer, followed by ReLU non-linearity. In sparse full-connection layers, a portion of neural weights (i.e., w i,j for all i and j) are sampled, according to the degree of sparsity S. A sampled neural weight corresponds to a single neural connection since weights are unshared between different neural connections. These sampled neural connections are reserved in sparse full-connection layers. Other unsampled neural connections are pruned from sparse full-connection layers.
  • Neural activations of neurons in the highest layer of sparse deep neural networks are used as facial features for face recognition. These facial features are global and can capture highly non-linear mappings from input face images to their identities.
  • neural activations of neurons in sparse full-connection layer 15 of the sparse deep neural network shown in Fig. 3 are used as facial features for face recognition.
  • An extractor contains a plurality of sparse deep neural networks. Facial features extracted by all sparse deep neural networks are concatenated into a long feature vector as a final feature representation for face recognition.
  • the recognizer 20 operates to calculate distances between facial features of different face images extracted by the extractor 10 to determine if two face images are from the same identity for face verification or determine if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification.
  • Fig. 6 is a schematic flowchart illustrating the recognition process 60 in the recognizer 20. In step S601, the recognizer 20 calculates distances between facial features extracted from different face images by the extractor 10.
  • step S602 the recognizer 20 determines if two face images are from the same identity for face verification, or, alternatively, in step S603, it determines one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification.
  • two face images are determined to belong to the same identity if their feature distance is smaller than a threshold, or the probe face image is determined to belong to the same identity as one of gallery face images if their feature distance is the smallest compared to feature distances of the probe face image to all the other gallery face images, wherein feature distances determined by the recognizer 20 could be Euclidean distances, Joint Bayesian distances, cosine distances, Hamming distances, or any other distances.
  • the Trainer 30 The Trainer 30
  • the trainer 30 is used to learn the sparse structures (i.e., neural connections) of the sparse deep neural networks as well as neural weights on connections of the sparse deep neural networks in the extractor 10.
  • the trainer 30 first trains an initial dense neural network N 0 with network structure T 0 .
  • the initial structure T 0 could be the one shown in Fig. 3 by replacing the conventional convolutional layers, the conventional local-connection layers, and the conventional full-connection layers with the sparse convolutional layers, the sparse local-connection layers, and the sparse full-connection layers, respectively.
  • the trainer 30 iteratively prunes (sparsifies) the neural connections as illustrated in step S402 and learns the neural weights on reserved neural connections as illustrated in step S403.
  • the trainer 30 first in step S402 prunes the neural connections in layer L m according to the neural correlations in netwrok N m-1 and the pre-specified sparsity degree S m .
  • T m be the sparser structure after pruning.
  • the trainer 30 then in step S403 trains a sparser network N m with structure T m , wherein weights on connections of network N m are initialized by those of network N m-1 .
  • the present application samples S m ⁇
  • the number of connections is proportional to the number of weights for all types of sparse layers (including sparse convolutional layers, sparse local-connection layers, and sparse full-connection layers) in the sparse deep neural network.
  • the sampling is based on neural correlations with such a principle that keeps connections (and the corresponding weights) where neurons connected have high correlations and drops connections between weakly correlated neurons.
  • neurons in one layer which have stronger correlations to neurons in the upper layer have stronger predictive power for the activities of the latter.
  • neurons with strong negative correlations are also useful in predicting neural activations. If a neuron is viewed as a detector of a certain visual pattern, its positively correlated neurons in the lower layer provide evidence on the visual pattern, while its negatively correlated neurons help to reduce false alarms. In practice, the present application may also keep a small portion of connections between weakly correlated neurons due to the reason that predictions from weakly correlated neurons are complementary to those from highly correlated neurons.
  • weights and connections are one-to-one mapped in these layers.
  • the negative coefficients are processed in a similar way, except that the absolute value of the coefficients are considered and more coefficients (and the corresponding connections/weights) of higher absolute values are kept.
  • the set of correlation coefficients between neurons with shared connecting weights are jointly considered to determine whether a weight (or a set of connections with shared weights) should be reserved or deleted.
  • the set of K neurons b mk are determined by the position m.
  • There are a total of M neurons in the i-th output feature map asa im for m 1, 2, ..., M.
  • the total sampled weights areSKN.
  • identification and verification supervisory signals in the trainer 30 are simultaneously added to each of the supervised layers (e.g., sparse full-connection layer 15 in Fig. 3) of each of the sparse deep neural networks in the extractor 10, and respectively back-propagated to the input face image, so as to update neural weights on reserved neural connections of sparse convolutional layers, sparse local-connection layers, and sparse full-connection layers of the sparse deep neural networks.
  • the supervised layers e.g., sparse full-connection layer 15 in Fig. 3
  • the identification supervisory signals are generated in the trainer 30 by classifying all of the supervised layer (layers selected for supervision, e.g., sparse full-connection layer 15 in Fig. 3) representations into one of N identities, wherein the classification errors are used as the identification supervisory signals.
  • the verification supervisory signals in the trainer 30 are generated by verifying the supervised layer representations of two compared face images, respectively, in each of the feature extraction modules, to determine if the two compared face images belong to the same identity, wherein the verification errors are used as the verification supervisory signals.
  • the extractor 10 Given a pair of training face images, the extractor 10 extracts two feature vectors f i and f j from the two face images respectively in each of the feature extraction modules.
  • the verification error is if f i and f j are features of face images of the same identity, or if f i and f j are features of face images of different identities, where is Euclidean distance of the two feature vectors, m is a positive constant value. There are errors if f i and f j are dissimilar for the same identity, or if f i and f j are similar for different identities.

Abstract

Disclosed is an apparatus for face recognition, comprising: a feature extraction unit configured to extract features from input face images with a plurality of deep feature extraction hierarchies; and a recognition unit configured to calculate distances between facial features of different face images extracted by the extractor to determine if two face images are from the same identity for face verification or determine if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification, wherein each of the deep feature extraction hierarchies contains a plurality of cascaded convolution layers, local-connection layers, pooling layers, and full-connection layers, and wherein neurons in the full-connection layers are only connected to a part of neurons in a previous layer thereof, while neurons in the convolution layers and the local-connection layers are only connected to a part of neurons in local regions in a previous layer thereof.

Description

A METHOD AND A SYSTEM FOR FACE RECOGNITION Technical Field
 The present application relates to a method for face recognition and a system thereof.
Background
 The number of parameters in a deep neural network is restricted by the amount of training data. Weight sparsifying algorithms help to reduce model parameters and improve the generalization ability of deep models.
 The idea of reducing neural connections has been taken in designing GoogLeNet, which achieved great success on the ImageNet challenge. GoogLeNet reduced neural connections by using very small convolution kernels of sizes 1x1, 3x3, and 5x5.
 The Hebbian rule of “neurons that fire together wire together” suggests that connections between strongly correlated neurons are more important than those between weakly correlated neurons. Moreover, neurons in the previous layer which are more correlated (either positively or negatively) to a given neuron in the current layer are more helpful to predict the activities of the latter.
 Removing unimportant parameters in deep neural networks was studied by LeCun et al. in their seminal work Optimal Brain Damage. They took a second derivative-related criterion for removing parameters. They reduced model parameters by a factor of eight without loss of the prediction ability of the original model.
Summary
 In one aspect of the present application, disclosed is an apparatus for face recognition. The apparatus may comprise an extractor having a plurality of deep  neural networks with sparsified neural connections to extract facial features from multiple face regions of face images for face recognition; and a recognizer being electronically communicated with the extractor and recognizing face identities of the input face images based on the extracted facial features.
 Guided by the Hebbian rule that “neurons that fire together wire together” , more neural connections between weekly correlated neurons than those between strongly correlated neurons are pruned, wherein the correlation between two connected neurons are defined by the magnitude of the correlation between their neural activations.
 In one embodiment of the present application, a baseline deep neural network is first trained, and then neural connections are pruned layer-wisely from the last to the previous layers, each time only one additional layer is sparsified and the entire model is re-trained. The previously trained models are used to calculate the neural correlations and initialize the subsequent sparser models.
 In one embodiment of the present application, the baseline deep neural network is similar to VGG net with every two convolutional layers following one max-pooling layer. One major difference is that the last two convolutional layers are replaced by two locally-connected layers. The aim is to learn different features in different face regions, since face is a structured object, and local connections increase the model fitting ability. The second locally-connected layer is followed by a multi-dimensional fully-connected layer. The feature representation in the fully-connected layer is used for the following face recognition.
 In one embodiment of the present application, connections in the baseline model are deleted in a layer-wise fashion, from the last fully-connected layer to the previous locally-connected and convolutional layers. Let N0 denote a well-trained baseline model. When a layer Lm is sparsified, a new model Nm is re-trained  initialized by its previous model Nm-1. Therefore, a sequence of models {N1, ..., NM} with fewer and fewer connections are trained and NM is the final sparse ConvNet obtained. During the whole training process, the previously learned model is used to calculate the neural correlations and guide the connection dropping procedure. The weights learned by the denser model NM-1 are also good initialization of the sparser model Nm to be further trained.
 In some of embodiments, a trainer may be electronically communicated with the extractor to add supervisory signals on the deep neural networks during training so as to learn sparse structures in convolution, local-connection, and full-connection layers, as well as adjusting neural weights in these layers.
 In one embodiment of the present application, joint identification-verification supervisory signal is added to the last fully-connected layer. The same supervisory signal is also added to a few previous layers to enhance the supervision in previous feature learning stages. The supervisory signals comprise one identification supervisory signal and one verification supervisory signal, wherein the identification supervisory signal is generated by classifying features in any of the layers extracted from an input face region into one of N identities in a training dataset, and taking a classification error as the supervisory signal, and wherein the verification signal is generated by comparing features in any of the layers extracted from two input face images respectively for determining if they are from the same person, and taking a verification error as the supervisory signal.
 Neural weights and neural connections are updated alternatively and iteratively. Firstly, neural connections are fixed while neural weights are adjusted by back-propagating supervisory signals through the deep neural networks. These supervisory signals are aggregated to adjust neural weights in each of convolution, local-connection, and full-connection layers during training. Then neural weights are fixed while neural connections are pruned according to correlations between neural  activations of connected neurons. The majority of weakly correlated neurons are pruned. Given the sparser deep model, neural weights are updated again by fixing neural connections, and so forth.
 In further aspect of the present application, disclosed is a method for face recognition, comprising: configuring a plurality of deep feature extraction hierarchies such that each of the deep feature extraction hierarchies contains a plurality of cascaded convolution layers, local-connection layers, pooling layers and full-connection layers, and neurons in the full-connection layers are only connected to a part of neurons in a previous layer thereof, while neurons in the convolution layers and the local-connection layers are only connected to a part of neurons in local regions in a previous layer thereof; training the configured deep feature extraction hierarchies to learn neural connections in the convolution layers, the local-connection layers, and the full-connection layers, and to  adjust neural weights in these layers; extracting features from input face images by the trained deep feature extraction hierarchies; and recognizing face identities of the input face images based on the extracted facial features.
Brief Description of the Drawing
 Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.
 Fig. 1 is a schematic diagram illustrating an apparatus for face recognition consistent with some disclosed embodiments.
 Fig. 2 is a schematic diagram illustrating an apparatus for face recognition when it is implemented in software, consistent with some disclosed embodiments.
 Fig. 3 is a schematic diagram illustrating an example of deep neural networks with sparsified layers in the extractor as shown in Fig. 1.
 Fig. 4 is a schematic flowchart illustrating the trainer as shown in Fig. 1 consistent with some disclosed embodiments.
 Fig. 5 is a schematic flowchart illustrating the extractor as shown in Fig. 1 consistent with some disclosed embodiments.
 Fig. 6 is a schematic flowchart illustrating the recognizer as shown in Fig. 1 consistent with some disclosed embodiments.
Detailed Description
 Reference will now be made in detail to some specific embodiments of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. The present invention may be practiced without some or all of these specific details. In other instances, well-known process operations have not been described in detail in order not to unnecessarily obscure the present invention.
 As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely  software embodiment (including firmware, resident software, micro-code, etc. ) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit, ” “module” or “system. ” Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer-usable program code embodied in the medium.
 In the case that the apparatus 1000 as disclosed below is implemented with software, the apparatus 1000 may include a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion. As shown in Fig. 2, the apparatus 1000 may include one or more processors ( processors  102, 104, 106 etc. ) , a memory 112, a storage device 116, a communication interface 114, and a bus to facilitate information exchange among various components of apparatus 1000. Processors 102-106 may include a central processing unit ( “CPU” ) , a graphic processing unit ( “GPU” ) , or other suitable information processing devices. Depending on the type of hardware being used, processors 102-106 can include one or more printed circuit boards, and/or one or more microprocessor chips. Processors 102-106 can execute sequences of computer program instructions to perform various methods or run the modules that will be explained in greater detail below.
 Memory 112 can include, among other things, a random access memory ( “RAM” ) and a read-only memory ( “ROM” ) . Computer program instructions can be stored, accessed, and read from memory 112 for execution by one or more of processors 102-106. For example, memory 112 may store one or more software applications. Further, memory 112 may store an entire software application or only a part of a software application that is executable by one or more of processors 102-106 to carry out the functions as disclosed below for the apparatus 1000. It is noted that  although only one block is shown in Fig. 1, memory 112 may include multiple physical devices installed on a central computing device or on different computing devices.
 Referring to Fig. 1 again, where the apparatus 1000 is implemented by the hardware, it may comprise an extractor 10 and a recognizer 20. The extractor 10 is configured with a plurality of deep neural networks with sparsified neural connections (referred to as sparse deep neural networks) to extract facial features from face regions of input face images. The recognizer 20 is electronically communicated with the extractor 10 and recognizes face identities of the input face images based on the extracted facial features. As will be discussed in details below, each of the sparse deep neural networks comprises a plurality of sparse convolutional layers, sparse local-connection layers, sparse full-connection layers, and pooling layers. A first one of the sparse convolutional layers extracts local facial features from input face images, and the followings of the sparse convolutional layers and sparse local-connection layers extract further local features from the extracted features outputted from a previous layer. Each of sparse full-connection layers extract global features from the extracted features outputted from a previous layer. Each of pooling layers receives features from a previous layer and reduces dimensions of the received features. The features obtained from all the sparse deep neural networks are concatenated into a feature vector as said facial features for face recognition.
 In addition, the apparatus 1000 may further comprise a trainer 30 used to learn sparse neural connections (referred to as sparse structures) as well as weights on sparse connections of the sparse deep neural networks.
The Extractor 10
 Fig. 5 is a schematic flowchart illustrating the feature extraction process 50 in the extractor 10, which contains three steps. In step S501, the extractor 10 forward propagates face regions of an input face image through deep neural networks with  sparsified connections (referred to as sparse deep neural networks) . Then in step S502, the extractor 10 takes neural activations in last layers of sparse deep neural networks as facial features. Finally in step S503, it concatenates facial features of all sparse deep neural networks.
 Given an input face image, the sparse deep neural network in the extractor 10 starts by extracting local facial features with every two sparse convolutional layers following one pooling layer. The last pooling layer (pooling layer 12) is followed by two sparse local-connection layers to further extract local facial features and one sparse full-connection layer to extract global facial features. In particular, in the sparse deep neural network, wherein neurons in the full-connection layers are only connected to a part of neurons in a previous layer thereof, while neurons in the convolution layers and the local-connection layers are only connected to a part of neurons in local regions in a previous layer thereof. This is why these layers are called as "sparse" layers.
 As will be discussed later in reference to the trainer 30, the sparse deep neural network in the extractor 10 shall be trained. In one embodiment of the present application, connections in the baseline model are deleted in a layer-wise fashion, from the last fully-connected layer to the previous locally-connected and convolutional layers. Let N0 denote a well-trained baseline model. When a layer Lm is sparsified, a new model Nm is re-trained initialized by its previous model Nm-1. Therefore, a sequence of models {N1, ..., NM} with fewer and fewer connections are trained and NM is the final sparse deep neural network (also referred to as sparse ConvNet since the deep neural network contains convolutional layers) obtained. During the whole training process, the previously learned model is used to calculate the neural correlations and guide the connection dropping procedure. The weights learned by the denser model NM-1 are also good initialization of the sparser model Nm to be further trained.
 With the above constructions, by reducing model/layer parameters (i.e., neural weights on connections) of the original non-sparse layers, these sparse layers help to improve the generalization ability of the learned features, i.e., features learned on the training face images can be well generalized to test face images to distinguish the test face images well by their identities. In addition, the sparse layers reduce sizes (parameters) of neural networks, making it easier to be stored on mobile phones or other devices with limited memories.
 Fig. 3 illustrates an example of sparse deep neural networks in the extractor 10 according to one embodiment of the present application. The extractor 10 contains a plurality of sparse deep neural networks. Each of the sparse deep neural networks may comprise a plurality of sparse convolution-pooling  modules  301, 302, 303 and 304, and comprises a connection module 305, as illustrated in Fig. 3. It will be appreciated that there would be more or less number of sparse convolution-pooling modules as required, although Fig. 3 illustrates 4 sparse convolution-pooling modules as an example.
 As shown, each of the sparse convolution-pooling  modules  301, 302, 303 and 304 is a cascade of two sparse convolutional layers and a pooling layer. For example, the convolution-pooling modules 301 may comprise a sparse convolution layer 1, a sparse convolution layer 2 and a pooling layer 3, which are cascaded sequentially. All the sparse convolution-pooling  modules  301, 302, 303 and 304 are cascaded sequentially, and then are cascaded to the connection module 305 which further configured with two sparse local- connection layers  13 and 14, and a sparse full-connection layer 15. Compared to convolutional layers, local- connection layers  13 and 14 help to extract more diverse features, which are proved to be helpful in later feature extraction stages in deep neural networks. Neural activations in sparse full-connection layer 15 are used as facial features for face recognition.
 Sparse convolutional layers, sparse local-connection layers, and sparse  full-connection layers are convolutional layers, local-connection layers, and full-connection layers with sparsified neural connections, respectively. Given the degree of sparsity S (0 < S < 1) , the present application samples Sy|W| weights from the total number of weights |W| in a given sparse layer. Neural connections which correspond to the sampled weights are reserved. Otherwise, they are pruned from the current sparse deep neural network. The number of connections is proportional to the number of weights for all types of sparse layers in sparse deep neural networks.
 Convolutional layers are configured to extract local facial features from input feature maps (which is output feature maps of a previous layer) to form output feature maps of the current layer. In particular, each convolutional layer performs convolution operations on the input feature maps to form output feature maps of the current layer, and the formed output feature maps will be input to a next layer.
 Each feature map is a certain kind of features organized in 2D. Features in the same output feature map are extracted from input feature maps with the same set of neural connection weights. The convolution operation in each convolutional layer may be expressed as
Figure PCTCN2015093031-appb-000001
Where,
xi and yj are the i-th input feature map and the j-th output feature map, respectively;
kijis the convolution kernel between the i-th input feature map and the j-th output feature map;
*denotes convolution;
bjis the bias of the j-th output feature map;
ReLU nonlinearity y=max (0, ·) is used for neurons..
 Neural weights in convolutional layers are parameters in convolution kernels kij. In sparse convolutional layers, a portion of kernel parameters are sampled,  according to the degree of sparsity S. A sampled parameter corresponds to a set of neural connections which share the same parameter. These neural connections are reserved in sparse convolutional layers. Other neural connections which take the unsampled kernel parameters as weights are pruned from a sparse convolutional layer.
 Local-connection layers are also configured to extract local facial features from input feature maps (which is output feature maps of a previous layer) to form output feature maps of the current layer. Unlike convolutional layers, local-connection layers do not share neural weights across neurons on the same output feature map. The operation in each local-connection layer may be expressed as
Figure PCTCN2015093031-appb-000002
Where, xiris neural activations in a local region r in the i-th feature map of a previous layer. yjr is the r-th (single) neural activation in the j-th output feature map of the current layer. kijr is neural weights on local connections betweenyjr andxir. bj is the bias of the j-th output feature map. y=max (0, ·) is the ReLU nonlinearity. In sparse local-connection layers, a portion of neural weights (i.e., kijr for all i, j, and r) are sampled, according to the degree of sparsity S. A sampled neural weight corresponds to a single neural connection since weights are unshared between different neural connections. These sampled neural connections are reserved in sparse local-connection layers. Other unsampled neural connections are pruned from sparse local-connection layers.
 The goal of cascading (sparse) convolutional layers and (sparse) local-connection layers is to extract hierarchical local features (i.e. features extracted from local regions of the input images or the input features) , wherein features extracted by higher convolutional/local-connection layers have larger effective receptive field on input images and more complex non-linearity.
 Pooling layers are configured to pool local facial features from input feature maps from a previous layer to form output feature maps of the current layer. Pooling  operation forms more invariant features, which is formulated as
Figure PCTCN2015093031-appb-000003
where each neuron in the i-th output feature map yi pools over an MuN local region in the i-th input feature map xi, with s as the step size.
Full-connection layers in deep neural networks are configured to extract global features (features extracted from the entire region of input feature maps) from a previous layer. Full-connection layers also serve as interfaces for receiving supervisory signals during training, which will be discussed later. Full-connection layers also have the function of feature dimension reduction as pooling layers by restricting the number of neurons in them. A full-connection layer is formulated as
Figure PCTCN2015093031-appb-000004
Where,
x denotes neural activations from a previous layer,
y denotes neural activations in the current full-connection layer,
w denotes neural weights on connections between the current full-connection layer and a previous layer. Neurons in full-connection layers linearly combine neural activations of all neurons in a previous layer, followed by ReLU non-linearity. In sparse full-connection layers, a portion of neural weights (i.e., wi,j for all i and j) are sampled, according to the degree of sparsity S. A sampled neural weight corresponds to a single neural connection since weights are unshared between different neural connections. These sampled neural connections are reserved in sparse full-connection layers. Other unsampled neural connections are pruned from sparse full-connection layers.
Neural activations of neurons in the highest layer of sparse deep neural networks are used as facial features for face recognition. These facial features are global and can capture highly non-linear mappings from input face images to their identities. In one embodiment of the present application, neural activations of neurons  in sparse full-connection layer 15 of the sparse deep neural network shown in Fig. 3 are used as facial features for face recognition. An extractor contains a plurality of sparse deep neural networks. Facial features extracted by all sparse deep neural networks are concatenated into a long feature vector as a final feature representation for face recognition.
The recognizer 20
The recognizer 20 operates to calculate distances between facial features of different face images extracted by the extractor 10 to determine if two face images are from the same identity for face verification or determine if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification. Fig. 6 is a schematic flowchart illustrating the recognition process 60 in the recognizer 20. In step S601, the recognizer 20 calculates distances between facial features extracted from different face images by the extractor 10. Then in step S602, the recognizer 20 determines if two face images are from the same identity for face verification, or, alternatively, in step S603, it determines one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification.
In the recognizer 20, two face images are determined to belong to the same identity if their feature distance is smaller than a threshold, or the probe face image is determined to belong to the same identity as one of gallery face images if their feature distance is the smallest compared to feature distances of the probe face image to all the other gallery face images, wherein feature distances determined by the recognizer 20 could be Euclidean distances, Joint Bayesian distances, cosine distances, Hamming distances, or any other distances.
The Trainer 30
The trainer 30 is used to learn the sparse structures (i.e., neural connections) of  the sparse deep neural networks as well as neural weights on connections of the sparse deep neural networks in the extractor 10. As illustrated in Fig. 4, in step S401 the trainer 30 first trains an initial dense neural network N0 with network structure T0. For example, the initial structure T0 could be the one shown in Fig. 3 by replacing the conventional convolutional layers, the conventional local-connection layers, and the conventional full-connection layers with the sparse convolutional layers, the sparse local-connection layers, and the sparse full-connection layers, respectively. Then, given a sequence of layers L1, L2, ..., LM to be sparsified and the corresponding pre-specified degrees of sparsify S1, S2, ..., SM, the trainer 30 iteratively prunes (sparsifies) the neural connections as illustrated in step S402 and learns the neural weights on reserved neural connections as illustrated in step S403. In the m-th iteration (for m = 1, 2, ..., M) , the trainer 30 first in step S402 prunes the neural connections in layer Lm according to the neural correlations in netwrok Nm-1 and the pre-specified sparsity degree Sm. Let Tm be the sparser structure after pruning. The trainer 30 then in step S403 trains a sparser network Nm with structure Tm, wherein weights on connections of network Nm are initialized by those of network Nm-1. After iterative connection pruning (sparsifying) and weight updating (m>=M, at step S404) , the trainer 30 finally in step S405 outputs the sparsified and well-trained neural network NM.
Given a layer Lm to be pruned (sparsified) and the pre-specified degree of sparsity Sm (0 < Sm < 1) for the given layer, the present application samples Sm·|W| weights from the total number of weights |W| of layer Lm. The number of connections is proportional to the number of weights for all types of sparse layers (including sparse convolutional layers, sparse local-connection layers, and sparse full-connection layers) in the sparse deep neural network. The sampling is based on neural correlations with such a principle that keeps connections (and the corresponding weights) where neurons connected have high correlations and drops connections between weakly correlated neurons. This is because neurons in one layer which have stronger correlations to neurons in the upper layer have stronger predictive power for  the activities of the latter. Note that neurons with strong negative correlations are also useful in predicting neural activations. If a neuron is viewed as a detector of a certain visual pattern, its positively correlated neurons in the lower layer provide evidence on the visual pattern, while its negatively correlated neurons help to reduce false alarms. In practice, the present application may also keep a small portion of connections between weakly correlated neurons due to the reason that predictions from weakly correlated neurons are complementary to those from highly correlated neurons.
First, the full-connection layers and local-connection layers in which weights are not shared are considered. Weights and connections are one-to-one mapped in these layers. Given a neuron ai in the current layer and its K connected neurons bi1, bi2, ..., biK in the previous layer, the correlation coefficient between ai to each of bik for k = 1, 2, ..., K is (for simplicity, when the present application refers to a neuron, it also means its neural activations)
Figure PCTCN2015093031-appb-000005
where
Figure PCTCN2015093031-appb-000006
and
Figure PCTCN2015093031-appb-000007
denote the mean and standard deviation of ai and bik, respectively, which are evaluated on a separate training set. Since both positively and negatively correlated neurons are helpful for the predictions, the corresponding connections are considered respectively. From all rik for k = 1, 2, ..., K, first take out all positive coefficients and sort them in descending order, denotedfor k = 1, 2, ..., K+. Then randomly sample λSK+and (1-λ) SK+coefficients from the coefficients ranked in the first and the second half of the sorted correlation coefficients, respectively. Weights/connections corresponding to the sampled coefficients are reserved while others are deleted. The present application takes λ=0.75. In other words, connections from the half of higher correlations are three times as much as those from the half of lower correlations. The total kept connections/weights are SK+, which depends on the degree of sparsity S.
The negative coefficients are processed in a similar way, except that the  absolute value of the coefficients are considered and more coefficients (and the corresponding connections/weights) of higher absolute values are kept. The total sampled negative coefficients are SK-, given K-negative coefficients from rik for k = 1, 2, ..., K. Connections from each of output neurons ai are processed in the same way. Suppose there are N output neurons ai for i = 1, 2, ..., N. Then the total sampled weights/connections areSKN.
For convolutional layers, the set of correlation coefficients between neurons with shared connecting weights are jointly considered to determine whether a weight (or a set of connections with shared weights) should be reserved or deleted. Letaimbe the m-th neuron in the i-th feature map of the current layer, and it is connected to K neuronsbmk in the previous layer for k = 1, 2, ..., K. (K equals the filter size, e.g., 3x3, times the number of input channels. ) The set of K neurons bmk are determined by the position m. There are a total of M neurons in the i-th output feature map asaim for m = 1, 2, ..., M. They all share the same set of K weights, although connected to different sets of neurons in the previous layerbmk for m = 1, 2, ..., M. Weights between aim and bmk are shared for m = 1, 2, ..., M. We calculate the mean magnitude of the correlation coefficients betweenaim andbmk for m = 1, 2, ..., M as
Figure PCTCN2015093031-appb-000009
Similar to the case in the full-connection layers and local-connection layers, given the degree of sparsity S, SK mean correlation coefficients (and the corresponding weights) from the set of K coefficients rik for k = 1, 2, ..., K are selected. rik are sorted in descending order. λSK coefficients are randomly chosen from the first half with higher values and (1-λ) SK coefficients are randomly chosen from the second half of the correlation coefficients with lower values. Againλ is set to 0.75 in the present application. The set of K weights rikfor k = 1, 2, ..., K are processed in the same way for all i = 1, 2, ..., N (given N feature maps in the current layer) . The total sampled weights areSKN.
During the phase of weight updating, identification and verification supervisory signals in the trainer 30 are simultaneously added to each of the supervised layers (e.g., sparse full-connection layer 15 in Fig. 3) of each of the sparse deep neural networks in the extractor 10, and respectively back-propagated to the input face image, so as to update neural weights on reserved neural connections of sparse convolutional layers, sparse local-connection layers, and sparse full-connection layers of the sparse deep neural networks.
The identification supervisory signals are generated in the trainer 30 by classifying all of the supervised layer (layers selected for supervision, e.g., sparse full-connection layer 15 in Fig. 3) representations into one of N identities, wherein the classification errors are used as the identification supervisory signals.
The verification supervisory signals in the trainer 30 are generated by verifying the supervised layer representations of two compared face images, respectively, in each of the feature extraction modules, to determine if the two compared face images belong to the same identity, wherein the verification errors are used as the verification supervisory signals. Given a pair of training face images, the extractor 10 extracts two feature vectors fi and fj from the two face images respectively in each of the feature extraction modules. The verification error is 
Figure PCTCN2015093031-appb-000010
if fi and fj are features of face images of the same identity, or 
Figure PCTCN2015093031-appb-000011
if fi and fj are features of face images of different identities, where
Figure PCTCN2015093031-appb-000012
is Euclidean distance of the two feature vectors, m is a positive constant value. There are errors if fi and fj are dissimilar for the same identity, or if fi and fj are similar for different identities.
Although the preferred examples of the present invention have been described,  those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims is intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention.
Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present invention.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present invention has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Claims (23)

  1. An apparatus for face recognition, comprising:
    a feature extraction unit configured to extract features from input face images with a plurality of deep feature extraction hierarchies; and
    a recognition unit configured to calculate distances between facial features of different face images extracted by the extractor to determine if two face images are from the same identity for face verification or determine if one of the input images, as a probe face image, is belonging to a same identity as one of gallery face images consisting of the input images for face identification,
    wherein each of the deep feature extraction hierarchies contains a plurality of cascaded convolution layers, local-connection layers, pooling layers, and full-connection layers, and
    wherein neurons in the full-connection layers are only connected to a part of neurons in a previous layer thereof, while neurons in the convolution layers and the local-connection layers are only connected to a part of neurons in local regions in a previous layer thereof.
  2. An apparatus of claim 1, further comprising:
    a training unit configured to add supervisory signals on the feature extraction unit during training so as to learn neural connections in the convolution layers, the local-connection layers, and the full-connection layers, and to adjust neural weights in these layers.
  3. An apparatus of claim 1, wherein neural connections in the convolution layers, the local-connection layers, and the full-connection layers and neural weights on the neural connections are learned iteratively.
  4. An apparatus of claim 3, wherein in one iteration, the neural weights on neural  connections are adjusted by fixing neural connections in the convolution, local-connection and full-connection layers, and then,
    the neural connections in one or more of the convolution, local-connection and full-connection layers are pruned while fixing neural weights.
  5. An apparatus of claim 3, wherein the neural connections are pruned according to correlations between neural activations of connected neurons, wherein a majority of connections between weakly correlated neurons are pruned while a majority of connections between strongly correlated neurons are reserved.
  6. An apparatus of claim 3, wherein before the first iteration, neurons in the full-connection layers are connected to all neurons in a previous layer thereof, while neurons in the sparse convolution modules and the sparse local-connection modules are connected to all neurons in local regions in a previous layer thereof, respectively.
  7. An apparatus of claim 3, wherein for the second and the following iterations, the neuron connections are taken from reserved neuron connections in the previous iteration and the neuron weights on these neuron connections are initialized by neuron weights learned in the previous iteration.
  8. An apparatus of claim 7, wherein the neuron weights on the reserved neuron connections from the previous iteration are adjustable according to joint identification-verification supervisory signals.
  9. An apparatus of claim 8, wherein the joint identification-verification supervisory signals comprises an identification supervisory signal and a verification supervisory signal,
    wherein,
    the identification supervisory signal is generated by classifying features extracted from an input face region into one of N identities in a training dataset, and taking the  classification error as the supervisory signal, while the verification signal is generated by comparing features extracted from two input face images respectively to tell if they are from the same person, and taking the verification error as the supervisory signal.
  10. An apparatus of claim 1, wherein features extracted by a plurality of deep feature extraction hierarchies in the feature extraction unit are concatenated for face recognition.
  11. An apparatus of claim 10, wherein distances between the concatenated features extracted from two input face images are compared to a threshold to determine if the two input face images are from the same person for face verification, or distances between features of an input query face image to features of each of face images in a face image database are computed to determine which identity in the face image database the input query face image belongs to for face identification.
  12. A method for face recognition, comprising:
    configuring a plurality of deep feature extraction hierarchies such that each of the deep feature extraction hierarchies contains a plurality of cascaded convolution layers, local-connection layers, pooling layers and full-connection layers, and neurons in the full-connection layers are only connected to a part of neurons in a previous layer thereof, while neurons in the convolution layers and the local-connection layers are only connected to a part of neurons in local regions in a previous layer thereof;
    training the configured deep feature extraction hierarchies to learn neural connections in the convolution layers, the sparse local-connection layers, and the full-connection layers, and to adjust neural weights in these layers;
    extracting features from input face images by the trained deep feature extraction hierarchies, ; and
    recognizing faces based on features extracted from each input face image by the feature extraction unit.
  13. A method of claim 12, wherein the training further comprises:
    learning iteratively neural connections in the convolution layers, the local-connection layers, and the full-connection layers and neural weights on the neural connections.
  14. A method of claim 13, wherein in one iteration, the training comprises:
    adjusting the neural weights on neural connections by fixing neural connections in the convolution, local-connection and full-connection layers, and,
    pruning the neural connections in one or more of the convolution, local-connection and full-connection layers while fixing the neural weights thereof.
  15. A method of claim 14 wherein the neural connections are pruned according to correlations between neural activations of connected neurons, wherein a majority of connections between weakly correlated neurons are pruned while a majority of connections between strongly correlated neurons are reserved.
  16. A method of claim 13, wherein before the first iteration, neurons in the full-connection layers are connected to all neurons in a previous layer thereof, while neurons in the sparse convolution modules and the sparse local-connection modules are connected to all neurons in local regions in a previous layer thereof, respectively.
  17. A method of claim 13, wherein for the second and the following iterations, the neuron connections are taken from reserved neuron connections in the previous iteration and the neuron weights on these neuron connections are initialized by neuron weights learned in the previous iteration.
  18. A method of claim 17, wherein the neuron weights on the reserved neuron connections from the previous iteration are adjustable according to joint identification-verification supervisory signals.
  19. A method of claim 18, wherein the joint identification-verification supervisory signals comprises an identification supervisory signal and a verification supervisory signal,
    wherein,
    the identification supervisory signal is generated by classifying features extracted from an input face region into one of N identities in a training dataset, and taking the classification error as the supervisory signal, while the verification signal is generated by comparing features extracted from two input face images respectively to tell if they are from the same person, and taking the verification error as the supervisory signal.
  20. A method of claim 11, wherein features extracted by a plurality of deep feature extraction hierarchies in the feature extraction unit are concatenated for face recognition.
  21. A system for face recognition, comprising:
    a memory that stores executable components; and
    a processor electrically coupled to the memory to execute the executable components to:
    configure a plurality of deep feature extraction hierarchies such that each of the deep feature extraction hierarchies contains a plurality of cascaded convolution layers, local-connection layers, pooling layers and full-connection layers, and neurons in the full-connection layers are only connected to a part of neurons in a previous layer thereof, while neurons in the convolution layers and the local-connection layers are only connected to a part of neurons in local regions in a previous layer thereof;
    training the configured deep feature extraction hierarchies;
    extract features from input face images with the trained deep feature extraction hierarchies; and
    recognize faces based on features extracted from each input face image by the feature extraction unit.
  22. A system of claim 21, wherein the processor is further configured to execute the executable components for training the configured deep feature extraction hierarchies by adding supervisory signals on them so as to learn neural connections in the convolution layers, the local-connection layers, and the full-connection layers, and to adjust neural weights in these layers.
  23. A system of claim 22, wherein the training further comprises:
    learning iteratively neural connections in the convolution layers, the local-connection layers, and the full-connection layers and neural weights on the neural connections.
PCT/CN2015/093031 2015-10-28 2015-10-28 A method and a system for face recognition WO2017070858A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201580085498.9A CN108496174B (en) 2015-10-28 2015-10-28 Method and system for face recognition
PCT/CN2015/093031 WO2017070858A1 (en) 2015-10-28 2015-10-28 A method and a system for face recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/093031 WO2017070858A1 (en) 2015-10-28 2015-10-28 A method and a system for face recognition

Publications (1)

Publication Number Publication Date
WO2017070858A1 true WO2017070858A1 (en) 2017-05-04

Family

ID=58629770

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/093031 WO2017070858A1 (en) 2015-10-28 2015-10-28 A method and a system for face recognition

Country Status (2)

Country Link
CN (1) CN108496174B (en)
WO (1) WO2017070858A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107346423A (en) * 2017-06-30 2017-11-14 重庆科技学院 The face identification method of autoassociative memories based on cell neural network
WO2022016278A1 (en) * 2020-07-21 2022-01-27 Royal Bank Of Canada Facial recognition tokenization

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344731B (en) * 2018-09-10 2022-05-03 电子科技大学 Lightweight face recognition method based on neural network
CN109815814B (en) * 2018-12-21 2023-01-24 天津大学 Face detection method based on convolutional neural network

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7646894B2 (en) * 2006-02-14 2010-01-12 Microsoft Corporation Bayesian competitive model integrated with a generative classifier for unspecific person verification
US7668346B2 (en) * 2006-03-21 2010-02-23 Microsoft Corporation Joint boosting feature selection for robust face recognition
US7684651B2 (en) * 2006-08-23 2010-03-23 Microsoft Corporation Image-based face search
US8218880B2 (en) * 2008-05-29 2012-07-10 Microsoft Corporation Linear laplacian discrimination for feature extraction
CN103530657A (en) * 2013-09-26 2014-01-22 华南理工大学 Deep learning human face identification method based on weighting L2 extraction
WO2015154206A1 (en) * 2014-04-11 2015-10-15 Xiaoou Tang A method and a system for face verification

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7646894B2 (en) * 2006-02-14 2010-01-12 Microsoft Corporation Bayesian competitive model integrated with a generative classifier for unspecific person verification
US7668346B2 (en) * 2006-03-21 2010-02-23 Microsoft Corporation Joint boosting feature selection for robust face recognition
US7684651B2 (en) * 2006-08-23 2010-03-23 Microsoft Corporation Image-based face search
US8218880B2 (en) * 2008-05-29 2012-07-10 Microsoft Corporation Linear laplacian discrimination for feature extraction
CN103530657A (en) * 2013-09-26 2014-01-22 华南理工大学 Deep learning human face identification method based on weighting L2 extraction
WO2015154206A1 (en) * 2014-04-11 2015-10-15 Xiaoou Tang A method and a system for face verification

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107346423A (en) * 2017-06-30 2017-11-14 重庆科技学院 The face identification method of autoassociative memories based on cell neural network
WO2022016278A1 (en) * 2020-07-21 2022-01-27 Royal Bank Of Canada Facial recognition tokenization

Also Published As

Publication number Publication date
CN108496174B (en) 2020-02-11
CN108496174A (en) 2018-09-04

Similar Documents

Publication Publication Date Title
US11645835B2 (en) Hypercomplex deep learning methods, architectures, and apparatus for multimodal small, medium, and large-scale data representation, analysis, and applications
US10019629B2 (en) Skeleton-based action detection using recurrent neural network
CN105447498B (en) Client device, system and server system configured with neural network
WO2016119076A1 (en) A method and a system for face recognition
Hartawan et al. Disaster victims detection system using convolutional neural network (CNN) method
CN108229347B (en) Method and apparatus for deep replacement of quasi-Gibbs structure sampling for human recognition
CN111523621A (en) Image recognition method and device, computer equipment and storage medium
US20170293838A1 (en) Deep high-order exemplar learning for hashing and fast information retrieval
CN112639828A (en) Data processing method, method and equipment for training neural network model
CN110222718B (en) Image processing method and device
CN110765860A (en) Tumble determination method, tumble determination device, computer apparatus, and storage medium
CN113435253B (en) Multi-source image combined urban area ground surface coverage classification method
WO2017070858A1 (en) A method and a system for face recognition
WO2016086330A1 (en) A method and a system for face recognition
Parashar et al. Deep learning pipelines for recognition of gait biometrics with covariates: a comprehensive review
Verma et al. Wild animal detection from highly cluttered images using deep convolutional neural network
CN114266897A (en) Method and device for predicting pox types, electronic equipment and storage medium
CN113536970A (en) Training method of video classification model and related device
WO2020171904A1 (en) Human body part segmentation with real and synthetic images
Passalis et al. Deep supervised hashing using quadratic spherical mutual information for efficient image retrieval
Gavrilov et al. Convolutional neural networks: Estimating relations in the ising model on overfitting
WO2022063076A1 (en) Adversarial example identification method and apparatus
US20230094415A1 (en) Generating a target classifier for a target domain via source-free domain adaptation using an adaptive adversarial neural network
Singh et al. Optimization of stochastic networks using simulated annealing for the storage and recalling of compressed images using SOM
Devi et al. A review of image classification and object detection on machine learning and deep learning techniques

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15906924

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15906924

Country of ref document: EP

Kind code of ref document: A1