WO2016026064A1

WO2016026064A1 - A method and a system for estimating facial landmarks for face image

Info

Publication number: WO2016026064A1
Application number: PCT/CN2014/000785
Authority: WO
Inventors: Xiaoou Tang; Shizhan ZHU; Cheng Li; Chen Change Loy
Original assignee: Xiaoou Tang
Priority date: 2014-08-20
Filing date: 2014-08-20
Publication date: 2016-02-25
Also published as: CN107004136A; CN107004136B

Abstract

Disclosed are a method for estimating facial landmarks for a face image, and a system for estimating facial landmarks for a face image. The method may comprise: retrieving a first face image dataset with first type landmark annotations and a second face image dataset with second type landmark annotations; transferring the first type landmark annotations from the first face image dataset to the second face image dataset to obtain pseudo first type annotations for the second face image dataset; and combining the second face image dataset with the pseudo second type landmark annotations and the first face image dataset to make the second face image dataset have the first type landmark annotations.

Description

A Method and a System for Estimating Facial Landmarks for Face image Technical field

[0001] The present application relates to a method for estimating facial landmarks for a face image, and a system for estimating facial landmarks for a face image.

Background

[0002] Face alignment is a critical component of various face analyses, such as face verification and expression classification. Various benchmark datasets have been released, each of which containing large quantities of labeled images. Despite the databases were collected with the goal of being as rich and diverse as possible, inherent bias across datasets is unavoidable in practice.

[0003] The bias presents in the form of different characteristics and distributions exists across datasets. For instance, one set mainly contains white Caucasian male with mostly frontal faces, while another set consists of challenging samples with various poses or severe occlusions. In addition, the distribution difference between profile views can differ as much as over 10% across datasets. Clearly, training a model on one dataset forcefully would lead to over-fitting easily, and causing poor performance on unseen domain. To improve generalization, it is of practical interest to combine different databases so as to leverage the characteristics and distributions of multiple sources. This thought, however, is hindered by the annotation gaps, which requires huge effort to standardize before databases fusion is possible.

Summary of invention

[0004] In one aspect of the present application, there is disclosed a method for estimating facial landmarks for a face image, comprising:

retrieving a first face image dataset with first type landmark annotations and a second face image dataset with second type landmark annotations; transferring the first type landmark annotations from the first face image dataset to the second face image dataset to obtain pseudo first type annotations for the second face image dataset; and

combining the second face image dataset with the pseudo second type landmark annotations and the first face image dataset to make the second face image dataset have the first type landmark annotations.

[0005] In another aspect of the present application, there is disclosed a system for estimating facial landmarks for a face image, comprising:

a transductive alignment device configured to retrieve a first face image dataset with first type landmark annotations and a second face image dataset with second type landmark annotations, and transfer the first type landmark annotations from the first face image dataset to the second face image dataset to obtain pseudo first type annotations for the second face image dataset; and

a data augmentation device configured to combine the second face image dataset with the pseudo second type landmark annotations and the first face image dataset to make the second face image dataset have the first type landmark annotations.

Brief Description of the Drawing

[0006] Exemplary non-limiting embodiments of the present invention are described below with reference to the attached drawings. The drawings are illustrative and generally not to an exact scale. The same or similar elements on different figures are referenced with the same reference numbers.

[0007] Fig. 1 is s a schematic diagram illustrating an exemplary system 100 for transferring face landmark annotations according to one embodiment of the present application.

[0008] Fig. 2 is a schematic diagram illustrating an exemplary block diagram for the transductive alignment device 10 according to one embodiment of the present application.

[0009] Fig. 3 illustrates flow chart for the process 200 to show how the units 101-106 cooperate to obtain a pseudo S-Type annotation for the new training set.

[0010] Fig. 4 is a schematic flowchart illustrating detailed process for the transductive model training unit consistent with some disclosed embodiments of the present application.

[0011] Fig. 5 illustrates a flow chart for process of the data augmentation device consistent with another disclosed embodiment of the present application.

[0012] Fig. 6 is a schematic diagram illustrating an exemplary system for determining face landmark according to one embodiment of the present application.

[0013] Fig. 7 illustrates a flow chart for process of the training device to train the predicting device according to one embodiment of the present application.

[0014] Fig. 8 illustrates a flow chart for process of detailed process for the predicting device according to one embodiment of the present application.

Detailed Description

[0015] Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When appropriate, the same reference numbers are used throughout the drawings to refer to the same or like parts.

[0016] Fig. 1 is a schematic diagram illustrating an exemplary system 100 for transferring face landmark annotations according to one embodiment of the present application. As illustrated in Fig. 1, the system 100 for transferring face landmark annotations may comprise a transductive alignment device 10 and a data augmentation device 20.

[0017] The transductive alignment device 10 is configured to retrieve an first (original) training set for a first face image with S-type landmark annotations (hereinafter, also referred to "Set 1") and a second (new) training set with T-type landmark annotations (hereinafter, also referred to "Set 2"), and to transfer the S-type landmark annotations from the original face image dataset (training dataset) to the new training dataset so as to obtain pseudo S-Type annotations for the new training set. In the embodiments of the present application, the landmark annotations may comprise facial landmark points on a given face image, such as eyes, nose, and mouth corners. The data augmentation device 20 is then configured to combine the new training set with the pseudo S-Type landmark annotations and the original training set into an augmented dada training set, i.e. to make the new training set with the S-type landmark annotations. According to some embodiments of the present application S-type might be denser with a plurality of (for example, 194 or more) landmarks, and even face outer contour is annotated, while T-Type might be sparse, with only a few of (for example, 5) landmarks only on eyes and mouth corners).

[0018] This transductive alignment devicelO may only predict S-type annotation on new training dataset only when the T-type annotations on the new training set are provided. But the goal of the present application is to predict S-type annotations for arbitrary input face image, such that no T-type annotations are needed to predict the landmark annotations. Since more diversified training samples from the new training dataset are included, it can get more robust model for predicting S-type landmarks of facial images.

[0019] In one embodiment of the present application, the transductive alignment device is further configured to determine a transductive model {M_{PCA k}, M_{reg k}] from a common landmark index between the first type landmark annotations and the second type landmark annotations, an initial first-type annotations, and the first face image dataset; and transfer, based on the transductive model, the first landmark annotations from the first face image dataset to the second face image dataset to obtain pseudo first type annotations for the second face image dataset. Fig. 2 is a schematic diagram illustrating an exemplary block diagram for the transductive alignment device 10 according to one embodiment of the present application. As shown in Fig. 2, the transductive alignment device 10 may comprise a common landmarks determination unit 101, a mapping unit 102, a first annotation estimated unit 103, a transductive model unit 104, a second annotation estimated unit 105 and a pseudo annotation determination unit 106.

[0020] Fig. 3 illustrates flow chart for the process 300 to show how the units 101-106 cooperate to obtain pseudo S-Type landmark annotations for the new training dataset.

[0021] At step S301, the common landmark determination unit 101 operates to retrieve a first training dataset {I^ x_s, B- ) for first face image with S-type landmark annotations (Set 1) and a second training set {I₂, Χχ_> B₂}) with T-type landmark annotations (Set 2), wherein the first and the second training datasets include bounding box B] and B₂ of each face in the image I_x and the image 1₂, respectively, where Ij represents face images from the training image set with index i, x^ represents landmark location (in x-y coordinates) and Bj and B2 represent bounding box of image 7_X and the image I₂ , respectively. And then the common landmarks determination unit 101 determines a plurality of common landmark index

(Xs) common f°^r two types of annotations, i.e. S-type landmark annotations in data Set 1 and T-type landmark annotations in data Set 2. In the embodiment, common landmarks (Xs) common exist across the data Set 1 and data Set 2. Common landmark annotations are defined as facial landmarks that are well labeled with decisive semantic definition across different datasets, such as left and right eyes corners, mouth corners and pupil centers.

[0022] At step S302, the mapping unit 102 operates to learn mapping matrix T from common landmark annotation indexes (xs)common to S-type landmarks x_s in original training dataset, i.e., Set 1. In order to learn the mapping, it may use simple linear regression, and the general learning scheme is T = (xsc^xsc ^~lxsc^x _> ^m which x_sc is short for (x_s)_Common, and '*' in '(x_s)_Common * T' means matrix multiply, not convolution.

[0023] At step S303, the first annotation estimated unit 103 operates to compute the initial or estimated S-type annotations x on the data Set 1 based on the common landmarks ( s) common obtained from step S201 and the mapping T obtained from step S202 by rule of

x⁼ (^xs)common * T. 1)

[0024] At Step S304, the transductive model training unit 104 operates to determine a transductive model from the common landmarks index (xs) common from step S301, the initial S-type annotations x, and the first training dataset

{I_lf x_s, B- ) with S-type landmark annotations (i.e., data Set 1), which will be discussed later in reference to Fig. 4.

[0025] At step S305, the second annotation estimated unit 105 receives new training dataset (i.e. Set 2) (with T-type annotations {I₂, X_j, #₂}) and use the mapping T obtained from S302 and the common landmark indexes (Xr^common obtained from S301 to get initialized/estimated annotation x for the new training dataset (data Set 2) by rule of

X ^~ (.Χψ) common * T · 2)

[0026] At step S306, for each of the iteration K, the pseudo annotation determination unit 106 operates to extract local appearance information 0(x) for the data Set 1 , and feature Jacobian 0(x^*)— 0(x) only for common landmarks

(½) common-_^ and then concatenate the local appearance information 0(x) and the feature Jacobian as features / by rule of f(x)= [(0(**) - 0( ))_common, 0(x)_private] 3)

where [] means matrix concatenation,

0(x) is to extract local SIFT (Scale Invariant Feature Transform) features according to the coordinates x, and SIFT will be treated as black box.

[0027] And then the pseudo annotation determination unit 106 operates to calculate an estimated annotation error Δ x based on the transductive model by rule of:

A x = M_reg (M_PCA(0 4) where, M_PCA transforms original features into PCA (Principal Component Analysis) features, M_reg transforms PCA features into regression displacement target.

[0028] The pseudo annotation determination unit 106 then updates the current estimated annotation x by rule of formula 5) and outputs x from last iteration, i.e. the pseudo annotation {I₂, ¾, 5₂}: x = x + A x 5)

[0029] Hereinafter, the detailed process for the transductive model training unit 104 will be further discussed in reference to Fig. 4.

[0030] At step S3041 , the training dataset will be prepared by the transductive model training unit 104. To be specific, the transductive model training unit 104 receives the first training dataset {I^ x_s}) for a first face image with S-type landmark annotations (data Set 1) and prepares the following data and then begins to train for k iterations:

1) Common Landmark Index (xs)common_?

2) Face Images I = I_{l 5}

3) Initialized/estimated annotation x

4) Ground Truth annotation x^* = x_s

[0031] At step S3042, the transductive model training unit 104 operates to extract: (1) local appearance information φ(χ) for the data Set 1 , and (2) Feature Jacobian φ(χ*) — φ(χ) ONLY for common landmarks ( s)commom ^and then concatenates these two parts (1) and (2) as features f by rule of formula 3) as stated in the above.

[0032] At S3043, the transductive model training unit 104 computes the dissimilarity between the estimated current shape x and ground truth shape x^* by rule of Δ x = x^*— x.

[0033] At S3044, the transductive model training unit 104 gets a PCA projection model M_PCA via performing PCA analysis on the features f; and gets a mapping M_reg via ridge regression from PCA-projected features to dissimilarity. In one embodiment of the present application, for the purpose of training, principle component analysis (PCA) is conducted using singular value decomposition, which outputs a PCA projection model M_PCAcontaining a mean vector and projection coefficient. In the testing stage, the PCA -projected features are obtained by first subtracting the original feature with mean vector, and then performing matrix multiplication with the projection coefficient. Ridge regression is a mapping function containing coefficient and bias, which will be used to obtainAx as shown in Equation 4)·

[0034] At step S3045, the transductive model training unit 104 operates to determine if the estimated shape converge to the ground truth shape. If yes, at step S2046, the transductive model training unit 104 will determine the transductive model M (containing PCA (principle component analysis) projection model and mapping function for each iteration) by rule of

M_T = {M_PCA,k, M_reg,k), Vk = 1,2, .... 6)

[0035] Otherwise, at step S3047, the estimated annotation will be updated as x = x + M_reg(M_PCA(f)) and then input it to step S3041.

[0036] Hereinafter, the data augmentation device 20 will be discussed in details. As mentioned, the data augmentation device 20 is configured to combine the new training set with the pseudo S-Type landmark annotations and the original training set into an augmented dada training set. The S-type landmark annotations for the new training set might be inaccurate, so it is called "pseudo S-Type annotation" and thus subsequent data augmentation process is necessary to move the error from the pseudo S-Type annotation.

[0037] Fig. 5 illustrates a flow chart 500 for process of the data augmentation device 20. In particular, at step S501, the data augmentation device 20 operates to filter erroneous transferred annotations from the pseudo S-Type landmark annotations in the new training dataset by comparing the estimated common landmarks x_s and ground truth common landmarks so as to get a cleaned training set {Ι₂', Xs'_> B₂'}. At step 502, the data augmentation device 20 receives original training set (data Set 1) (with S-type landmark annotations {Ι_1; x_s, B- ) and then combine the cleaned new training set with the original training set to obtain {I_A, x_s, B}.

[0038] Fig. 6 is a schematic diagram illustrating an exemplary system 1000 for determining face landmark according to one embodiment of the present application. As shown in Fig. 6, besides the transductive alignment device 10 and the data augmentation device 20, the system 1000 may further comprise a training device 30 and a predicting device 40. The operations of the transductive alignment device 10 and the data augmentation device 20 in system 1000 are the same as those of system 100, and thus detailed description thereof will be omitted hereinafter.

[0039] The combined dataset generated by the data augmentation device 20 may be treated as the predetermined training set for the training device 30 to train the predicting device 40.

[0040] Fig. 7 illustrates a flow chart 700 for process of the training device 30 to train the predicting device 40. At step S701, the training device 30 receives from the data augmentation device 20 the augmented training set with bounding box of images {I_A, x_s, B} and then learn an initializing function init(B) to estimate relation between initial landmarks and the bounding box B, so as to get initialized landmarks x according to the bounding box B and learned init(B) . The function intit may be determined intuitively. For example, it may generate initial landmarks relative to the bounding box, e.g. to locate initial left eye center, the relative place from all training samples will be averagely learned, and then it finds left eye is located with height 0.25 to the up and 0.3 to the left. If there is a bounding box of testing samples with up: 100, down:200,left:500,right:600, then the initial coordinates of left eye center would be x=530, y=125. The present application always uses the 0.25 and 0.3 for all samples with respect to left eye center and other landmarks are the same.

[0041] At step S702, the training data set will be prepared. To be specific, the training device 30 receives the first training set x_s}) for the first face image with S-type landmark annotation (data Set 1) and prepares the following data and then begins to to train for k iterations:

Face Images I = I_A,

Initialized/estimated annotation x

Ground Truth annotation x^* = x_s,

[0042] At step S703, the training device 30 operates to extract local appearance information φ(χ) for the augmented training set {I_A, x_s, B} and represents the extract local appearance information as features f.

[0043] At step S704, the training device 30 operates to compute the

dissimilarity Δ x between the estimated current shape x and ground truth shape x^* by rule of Δχ = x^*— x.

[0044] At step S705, the training device 30 gets a PCA (principle component analysis) projection model M_{PCA k} via performing PCA analysis on the features f; and gets a mapping M_{reg k} via ridge regression from PCA-projected features to dissimilarity.

[0045] At step S706, the training device 30 operates to determine if the estimated shape converge to the ground truth shape. If yes, at step S707, the training device 30 will determine a Model M = {M_{PCA k}, M_{reg k}), Vk = 1,2, .... (containing

PCA projection model and mapping function for each iteration)

[0046] Otherwise, at step S708, the estimated annotation will be updated as x = x + M_reg(M_PCA(f)) and then input it to step S702 to repeat the steps S703-708 so as to obtain a Robust Model Trained Model M and initializing function init(B).

[0047] Referring to Fig. 6 again, the predicting device 40 is configured to receive face image with pre-detected bounding box B and predict the facial landmarks positions, i.e. estimated 2D coordinates (x and y) of facial landmarks of the received face image. The detailed process for the predicting device 40 will be further discussed in reference to Fig. 8.

[0048] At step S801, the predicting device 40 gets the initializing function init(B) from the raining device 30 and gets initialized landmarks x according to the bounding box B and init(B) for the received face image. At step S802, the predicting device 40 gets the robust model trained model M from the training device 30, and then, for each iteration, the predicting device 40 computes local appearance information φ(χ) as features f and calculates the estimated Δ χ by rule of

Δ x = M_reg(M_PCA(f)) . And then, the predicting device 40 operates to update the landmarks x by rule of x = x + Δ x. In the last, the unit 40 output x from last iteration of the iteration K.

[0049] In the above, the systems 10 and 100 have been discussed in the case they are implemented using certain hardware or the combination of the hardware and the software. It shall be appreciated that the systems 10 and 100 may be also implemented using software. In addition, the embodiments of the present invention may be adapted to a computer program product embodied on one or more computer readable storage media (comprising but not limited to disk storage, CD-ROM, optical memory and the like) containing computer program codes.

[0050] In the case that the systems 10 and 100 are implemented with software, the these systems 100 may run in a general purpose computer, a computer cluster, a mainstream computer, a computing device dedicated for providing online contents, or a computer network comprising a group of computers operating in a centralized or distributed fashion.

[0051] Although the preferred examples of the present invention have been described, those skilled in the art can make variations or modifications to these examples upon knowing the basic inventive concept. The appended claims is intended to be considered as comprising the preferred examples and all the variations or modifications fell into the scope of the present invention. [0052] Obviously, those skilled in the art can make variations or modifications to the present invention without departing the spirit and scope of the present invention. As such, if these variations or modifications belong to the scope of the claims and equivalent technique, they may also fall into the scope of the present invention.

Claims

What is claimed is:

1. A method for estimating facial landmarks for a face image, comprising:

retrieving a first face image dataset with first type landmark annotations and a second face image dataset with second type landmark annotations;

transferring the first type landmark annotations from the first face image dataset to the second face image dataset to obtain pseudo first type annotations for the second face image dataset; and

2. A method according to claim 1, wherein the first type landmark annotations comprise S-type landmark annotations; and the second type landmark annotations comprise T-type landmark annotations.

3. A method according to claim 1, wherein the transferring further comprises: determining a transductive model from a common landmark index between the first type landmark annotations and the second type landmark annotations, initial first-type annotations, and the first face image dataset; and

transferring, based on the transductive model, the first landmark annotations from the first face image dataset to the second face image dataset to obtain pseudo first type landmark annotations for the second face image dataset.

4. A method according to claim 3, wherein the determining further comprises:

1) determining a plurality of common landmarks indexes for the first type landmark annotations and the second type landmark annotations;

2) learning a mapping matrix from the determined common landmark indexes (Xs) common to the first type landmark annotations; 3) determining initial/estimated first-type annotations for the second face image dataset based on the common landmark indexes and the mapping matrix;

4) determining the transductive model {M_{PCA k}, M_{reg k} from the common landmark index, the initial first type annotations, and the first face image dataset.

5. A method according to claim 4, wherein the transferring further comprises:

5) determining an estimated annotation x for the second face image dataset from the mapping matrix and the common landmark indexes;

6) determining an estimated error Δ x based on the transductive model, local appearance information 0(x) for the first face image dataset, and Feature Jacobian 0(x^*)— 0(x) for common landmark indexes (xs^common-^''

7) updating the current estimated annotation x by rule of x = x + Δ x so as to obtain the pseudo landmark annotations,

where x^* represents a ground truth annotation for x,

B_x and B₂ represent a bounding box of an image for the first face image dataset and the second face image dataset, respectively.

6. A method according to claim 5, wherein the step 6) further comprises:

extracting local appearance information 0(x) for the first face image dataset; and Feature Jacobian for common landmark indexes (Xs)common ■ ^' _>

concatenating the local appearance information and the Feature Jacobian; and determining, from the concatenation of the local appearance information and the

Feature Jacobian, an estimated error Δ x based on the transductive model.

7. A method according to claim 5, wherein the step 4) further comprises:

a) extracting local appearance information for the first face image dataset, and Feature Jacobian for the common landmark indexes ; b) concatenating the local appearance information and the Feature Jacobian; c) computing a dissimilarity Δ x between an estimated current shape x and a ground truth shape x^*;

d) getting a PCA projection model M_PCA via performing PCA analysis on the features f, where f represents a concatenation of the local appearance information and the Feature Jacobian

e) getting a mapping model M_reg via ridge regression from PCA-projected features to the dissimilarity;

f) determining if the estimated shape converge to the ground truth shape;

if yes, determining the transductive model {M_PCA, M_reg];

otherwise, updating the estimated annotation by rule x = x + M_reg (M_PCA(fy) and then repeating the above steps a)-f) with the updated annotation.

8. A method according to claim 1 , wherein the combining further comprises: comparing the estimated common landmark indexes x_s and the ground truth common landmark indexes to get erroneous transferred annotations from the pseudo first type landmark annotations in the second face image dataset;

filtering out erroneous transferred annotations so as to get a cleaned face image dataset {I₂ ' , B₂ ' };

receiving the first face image dataset {\_lt x_s, B-J) ;and

combining the cleaned new face image dataset with the first face image dataset to obtain an augmented face image dataset {I_A, x_s, B}.

9. A method according to claim 8, further comprising:

receiving the augmented face image dataset with bounding box of images {I_A, x_s, B}, where B represents a bounding box of an image in the augmented face image dataset; X_s represents landmark annotation and IA represents index of face image, and estimating relation between initial landmarks and the bounding box B, so as to get initialized landmark indexes x according to the bounding box B.

10. A method according to claim 9, further comprising:

receiving the first face image dataset {I^ x_s}) and preparing the following data and then begins to train for k iterations:

face Images I = I_A,

initialized/estimated annotation x

ground truth annotation x^* = x_s,

extracting local appearance information φ(χ) for the augmented face image dataset {I_A, x_s, B} and representing the extract local appearance information as features f. computing a dissimilarity Δ χ between the estimated current shape x and ground truth shape x^*;

determining a PCA projection model M_{PCA k} via performing PCA analysis on the features f;

determining a mapping M_{reg k} via ridge regression from PCA-projected features to dissimilarity;

determining if the estimated shape converge to the ground truth shape;

if yes, determining a Model M = {M_{PCA k}, _reg,k), Vk = 1,2 ; otherwise, updating the estimated annotation as x = x + M_reg(M_PCA(f)) and repeating the above steps so as to obtain a Robust Model Trained Model M.

1 1. A method according to claim 10, further comprising:

receiving a face image with a pre-detected bounding box B; and

predicting facial landmarks positions of facial landmarks of the received face image.

12. A method according to claim 11, wherein the predicting further comprises: getting initialized landmarks x according to the bounding box B for the received face image;

computing local appearance information for the received face image;

calculating an estimated error Δ χ by rule of Δ χ = M_reg(Mp_CA(f)), where f represents the local appearance information; and

updating the landmarks x by rule of x = x + Δ x.

13. A system for estimating facial landmarks for a face image, comprising:

14. A system according to claim 13, wherein the first type landmark annotations comprise S-type landmark annotations; and the second type landmark annotations comprise T-Type landmark annotations.

15. A system according to claim 13, wherein the transductive alignment device is further configured to determine a transductive model from common landmark indexes between the first type landmark annotations and the second type landmark annotations, an initial first-type annotations, and the first face image dataset and transfer, based on the transductive model, the first landmark annotations from the first face image dataset to the second face image dataset to obtain pseudo first type landmark annotations for the second face image dataset.

16. A system according to claim 13, wherein the transductive alignment device further comprises:

a common landmarks determination unit configured to determine a plurality of common landmark indexes for the first type landmark annotations and the second type landmark annotations;

a mapping unit configured to learn a mapping matrix from the determined common landmark indexes to the first type landmark annotations;

a first annotation estimated unit configured to determine initial/estimated first-type annotations for the second face image dataset based on the common landmark indexes and the mapping matrix;

a transductive model training unit configured to determine the transductive model from the common landmark indexes, the initial first type annotations, and the first face image dataset.

17. A system according to claim 16, wherein the transductive alignment device further comprises:

a second annotation estimated unit configured to determine an estimated annotation x for the second face image dataset from the mapping matrix and the common landmark indexes;

a pseudo annotation determination unit configured to determine an estimated error Δ χ based on the transductive model, local appearance information 0(x) for the first face image dataset, and Feature Jacobian 0(x^*)— for common landmark indexes; and then update the current estimated annotation x by rule of x = x + Δ x so as to obtain the pseudo annotations,

where x^* represents a ground truth annotation for x, B₁ and B₂ represent a bounding box of an image for the first face image dataset and the second face image dataset, respectively.

18 A system according to claim 17, wherein the pseudo annotation determination unit is further configured to determine the estimated error Δ x by

Feature Jacobian, an estimated error Δ x based on the transductive model.

19. A system according to claim 17, wherein pseudo annotation determination unit is further configured to obtain the pseudo annotation by:

a) extracting local appearance information for the first face image dataset, and Feature Jacobian for the common landmark indexes ;

b) concatenating the local appearance information and the Feature Jacobian; c) computing a dissimilarity Δ x between an estimated current shape x and a ground truth shape x^*

d) getting a PCA projection model M_{PCA k} via performing PCA analysis on the features f, where f represents the concatenation of the local appearance information and the Feature Jacobian

e) getting a mapping model M_{reg k} via ridge regression from PCA-projected features to the dissimilarity;

f) determining if the estimated shape converge to the ground truth shape;

If yes, determining the transductive model {M_{PCA k}, M_{reg k}], V/c = 1,2, ....;

otherwise, updating the estimated annotation by rule x = x + M_reg M_PCA(j ) and then repeating the above steps a)-f) with the updated annotation.

20. A system according to claim 13, wherein the data augmentation device is further configured to:

compare the estimated common landmark indexes x_s and the ground truth common landmark indexes to get the erroneous transferred annotations from the pseudo first type landmark annotations in the second face image dataset;

filter out erroneous transferred annotations so as to get a cleaned face image dataset {I₂ ' , B₂ ' };

receive the first face image dataset {Ι_χ, x_s, B-J) ;and

combine the cleaned new face image dataset with the first face image dataset to obtain an augmented face image dataset {I_A, x_s, B}.

21. A system according to claim 20, further comprising:

a training device configured to receive the augmented face image dataset with bounding box of images {I_A, x_s, B}, where B represents a bounding box of an image in the augmented face image dataset; X_s represents landmark annotation and I_A represents index of face image, and

wherein the predicting device estimate relation between initial landmarks and the bounding box B, so as to get initialized landmarks x according to the bounding box B.

22. A system according to claim 21, wherein the training device is further configured to train a Robust Model Trained Model by:

receiving the first face image dataset \_lt x_s}) and preparing the following data and then begins to a train for k iteration:

face Images I = I_A,

initialized/estimated annotation x

ground truth annotation x^* = x_s, extracting local appearance information φ(χ) for the augmented face image dataset {I_A, x_s, B} and representing the extract local appearance information as features f. computing a dissimilarity Δ χ between the estimated current shape x and ground truth shape x^*;

determining if the estimated shape converge to the ground truth shape;

if yes, determining a Model M = {M_{PCA k}, _reg,k), Vk = 1,2 ; otherwise, updating the estimated annotation as x = x + M_reg(M_PCA(f)) and repeating the above steps so as to obtain a Robust Model Trained Model.

22. A system according to claim 21 , further comprising:

a predicting device configured to receive a face image with a pre-detected bounding box B, and predict facial landmarks positions of facial landmarks of the received face image.

23. A system according to claim 22, wherein the predicting device is further configured to predict facial landmarks positions by:

getting initialized landmarks x according to the bounding box B and init(B) for the received face image;

computing local appearance information for the received face image; calculating an estimated error Δ x by rule of

Δ χ = M_reg(Mp_CA(f)), where f represents the local appearance information; and updating the landmarks x by rule of x = x + Δ x.