WO2015184619A1 - Method and apparatus for estimating recessive character distribution of users - Google Patents

Method and apparatus for estimating recessive character distribution of users Download PDF

Info

Publication number
WO2015184619A1
WO2015184619A1 PCT/CN2014/079258 CN2014079258W WO2015184619A1 WO 2015184619 A1 WO2015184619 A1 WO 2015184619A1 CN 2014079258 W CN2014079258 W CN 2014079258W WO 2015184619 A1 WO2015184619 A1 WO 2015184619A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
feature
website
dominant
recessive
Prior art date
Application number
PCT/CN2014/079258
Other languages
French (fr)
Chinese (zh)
Inventor
陈宽
Original Assignee
深圳市推想大数据信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市推想大数据信息技术有限公司 filed Critical 深圳市推想大数据信息技术有限公司
Priority to CN201480000467.4A priority Critical patent/CN104205100B/en
Priority to PCT/CN2014/079258 priority patent/WO2015184619A1/en
Publication of WO2015184619A1 publication Critical patent/WO2015184619A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0241Advertisements
    • G06Q30/0277Online advertisement

Definitions

  • the method further includes: analyzing the user behavior habit according to the dominant feature and the recessive feature of the user.
  • 3 is a schematic diagram of a method for estimating a user's implicit feature distribution in the method embodiment of the present invention with implicit features and using a website to distribute in a sample space
  • 4 is a schematic diagram of a method for estimating a distribution of a dominant feature and a recessive feature in a sample space in an embodiment of a method for estimating a recessive feature distribution of a user;

Abstract

A method and apparatus for estimating recessive character distribution of users. The method comprises: acquiring users using a website and dominant characters of the users; acquiring character information of the whole population from a population database, the character information comprising dominant characters and recessive characters; and calculating recessive character distribution of the users according to the character information of the whole population, the users using the website, the dominant characters of the users and a Bayesian algorithm. By means of the method, an estimation result is more accurate when recessive characters of users are estimated.

Description

一种估算用户的隐性特征分布的方法及装置  Method and device for estimating distribution of hidden features of users
【技术领域】  [Technical Field]
本发明涉及网络技术领域, 特別涉及一种估算用户的隐性特征分布的方 法及装置。  The present invention relates to the field of network technologies, and in particular, to a method and apparatus for estimating a distribution of hidden features of a user.
【背景技术】 【Background technique】
通常情况下, 用户在使用网站时, 需要注册成为网站的用户, 而用 户注册成为网站的用户时, 需要填写注册信息, 例如: 用户名称、 身份证 号等等。  Usually, when users use the website, they need to register as users of the website. When users register as users of the website, they need to fill in the registration information, such as: user name, ID number, and so on.
若网站管理者需要进行精准的广告营销, 向不同用户推送不同广告, 则仅仅根据用户注册信息, 是不足够的, 还需要更多的用户信息, 则可根 据用户已经注册信息, 推算用户的其他信息, 例如: 知道用户的名称, 想 估算用户的年龄、 种族、 性別等等。  If the website manager needs precise advertising marketing and pushes different advertisements to different users, it is not enough according to the user registration information. If more user information is needed, the user's other information may be calculated according to the user's already registered information. Information, for example: Know the user's name, want to estimate the user's age, race, gender, etc.
现有技术中, 通过已知的显性特征估算隐性特征, 是根据贝叶斯方程 实现的, 具体如下:  In the prior art, the implicit feature is estimated by known dominant features, which is implemented according to the Bayesian equation, as follows:
假设 X是我们感兴趣估算的用户的隐性特征, 假设 t是我们能够观测 到的用户的显性特征, 想要估算出 X, 贝叶斯方程如下:  Suppose X is the recessive feature of the user we are interested in estimating. Let t be the dominant feature of the user we can observe. To estimate X, the Bayesian equation is as follows:
其中, 贝叶斯方程的样本空间为全国人口数据, 例如: t是用户名称,Wherein, the sample space of the Bayesian equation is national population data, for example: t is the user name,
X是用户的性別,通过查看全国人口数据获得每一个性別 X当中名字 t所出 现的 6率 P(t | x), 每一个性別 X的 6率 P(x), 以及名字 t出现的 6率 P(t), 从 而可以计算 ρ(χ 1 1)。 X is the gender of the user. By looking at the national population data, the 6 rate P(t | x) of the name t in each gender X, the 6 rate P(x) of each gender X, and the 6 rate of the name t appear. P(t), so that ρ(χ 1 1 ) can be calculated.
值得注意的是: 上述贝叶斯方程的样本空间为全国人口数据, 而使用 网站的用户的构成与全国人口的构成往往具有很大区別, 例如: 新浪微博 的用户人群大部份是年轻大学生, 人人网的用户大部份为在校学生。 此时, 如果强行应用贝叶斯方程, 则估算出来的隐性特征将会具有较大误差, 如 下举例说明: It is worth noting that: The sample space of the above Bayesian equation is the national population data, and the composition of the users using the website is often very different from the composition of the national population. For example: The majority of the user population of Sina Weibo is young college students. Most of the users of Renren.com are students at school. At this time, if the Bayesian equation is forcibly applied, the estimated recessive features will have a large error, such as The following examples illustrate:
如果观测到某网站 F的某用户的用户名叫 Jo (相当于显性特征 t) , 希望估算 Jo年龄层, 假设年龄层 0〜50岁为 A, 年龄层 50〜: 100岁为 B, 并且各占一半人口, 则 = = 0.5。 假设 50〜: 100的年龄层没有任何人 使用网站 F,则 Ρ | ^ = 0。通过人口数据库找到 Jo在人口当中的分布为 0〜 50年龄层为 1人, 50〜: 100年龄层有 99人, 则  If it is observed that the user name of a user of a certain website F is called Jo (equivalent to the dominant feature t), it is desirable to estimate the Jo age layer, assuming that the age group is 0 to 50 years old, A, age group 50 to: 100 years old is B, and For each half of the population, then == 0.5. Assume that 50~: 100 people in the age group do not use the website F, then Ρ | ^ = 0. Through the population database, the distribution of Jo among the population is 0 to 50. The age group is 1 person, and the 50~: 100 age group has 99 people.
P(A 11) _ Pjt) _ P(t \ A)P(A) _ P(t \ A) * 0.5 _ P(t \ A) _ 1 P(A 11) _ Pjt) _ P(t \ A)P(A) _ P(t \ A) * 0.5 _ P(t \ A) _ 1
P(B 11) _ Pit I B) 、 _ P(t I B)P(B) _ P(t I E) * 0.5 ~ P(t \ B) ~ 99  P(B 11) _ Pit I B) , _ P(t I B)P(B) _ P(t I E) * 0.5 ~ P(t \ B) ~ 99
Pit)  Pit)
根据贝叶斯方程计算得到 Jo的年龄层为 0〜50的概率为 1 %,为 50〜 100的概率为 99%, 然而实际情况就是 Jo年龄层为 0〜50的概率为 100%, 为 50〜: 100的年龄层为 0%,正是由于使用网站 F的样本空间的构成与全国 人口的构成不相同, 然而计算时却采用全国人口数据, 样本空间不同, 造 成计算结果产生的严重偏差。 而通常每个网站各有各的特点, 每个网站所 吸引的人群也各有各的特点, 人群的构成一般不同于全国人口的构成, 如 果根据全国人口数据的样本空间估算用户的隐性特征, i¾ 'J必然造成结果误 差。  According to the Bayesian equation, the probability that Jo's age layer is 0~50 is 1%, and the probability of 50~100 is 99%. However, the actual situation is that the probability of the Jo age group being 0~50 is 100%, which is 50%. ~: The age of 100 is 0%. It is precisely because the composition of the sample space using the website F is different from the composition of the national population. However, the national population data is used in the calculation, and the sample space is different, resulting in serious deviations in the calculation results. Usually, each website has its own characteristics. The people attracted by each website also have their own characteristics. The composition of the population is generally different from the composition of the national population. If the sample space of the national population data is used to estimate the hidden characteristics of the users. , i3⁄4 'J will inevitably cause a result error.
【发明内容】 [Summary of the Invention]
为了至少部分解决以上问题, 本发明提出了一种估算用户的隐性特征分 布的方法及装置, 使得在估算用户的隐性特征时, 估算结果更准确。  In order to at least partially solve the above problems, the present invention proposes a method and apparatus for estimating a user's implicit feature distribution such that the estimation result is more accurate when estimating a user's recessive feature.
为解决上述技术问题, 本发明采用的一个技术方案是一种估算用户的隐性 特征分布的方法, 包括获取使用网站的用户以及用户的显性特征; 从人口数据 库获取所有人口的特征信息, 其中, 所述特征信息包括显性特征和隐性特征; 根据所述所有人口的特征信息、 使用网站的用户和所述用户的显性特征, 结合 贝叶斯算法计算所述用户隐性特征分布。  In order to solve the above technical problem, a technical solution adopted by the present invention is a method for estimating a distribution of a hidden feature of a user, which includes obtaining a user who uses the website and a dominant feature of the user; and acquiring feature information of all the populations from the population database, wherein The feature information includes a dominant feature and a recessive feature. The user implicit feature distribution is calculated according to the feature information of the population, the user using the website, and the dominant feature of the user, in conjunction with a Bayesian algorithm.
其中, 所述根据所述所有人口的特征信息、 使用网站的用户和所述用户的 显性特征, 结合贝叶斯算法计算所述用户隐性特征分布的步骤具体为: 若在任 意用户的隐性特征下, 用户使用网站并且用户具有显性特征的概率独立性条件 成立, 则根据如下公式计算所述用户的隐性特征,
Figure imgf000004_0001
Wherein the characteristic information according to the population, the user who uses the website, and the user The dominant feature, the step of calculating the user's implicit feature distribution in combination with the Bayesian algorithm is specifically: if, under the implicit feature of any user, the user uses the website and the probability independence condition of the user having the dominant feature is established, then Calculating the hidden features of the user according to the following formula,
Figure imgf000004_0001
其中, 所述 L为大于或等于 1的整数, 所述 X为用户的隐性特征, 所述 t为 用户的显性特征, 所述/为使用所述网站的用户。  The L is an integer greater than or equal to 1, the X is a recessive feature of the user, the t is a dominant feature of the user, and the / is a user who uses the website.
其中, 进一步包括, 判断在任意用户的隐性特征下, 用户使用网站并且用 户具有显性特征的概率独立性条件是否成立, 所述判断具体步骤包括: 根据所 有人口的特征信息、 使用网站的用户和用户的显性特征, 计算任意用户的 的 值, 其中, 所述 A的计算公式如下:  The method further includes: judging whether, under the implicit feature of any user, the user uses the website and the probability independence condition of the user has a dominant feature, the determining specific steps include: according to the feature information of all the population, the user who uses the website And the explicit characteristics of the user, calculating the value of any user, wherein the calculation formula of the A is as follows:
= i3(t | x1 .... n xi ) = i 3 (t | x 1 .... nx i )
根据所述所有人口的特征信息, 计算任意用户的 Ρ2的值, 其中, 所述 Ρ2的计 算公式如下: Calculating the value of Ρ 2 of any user according to the characteristic information of all the populations, wherein the calculation formula of the Ρ 2 is as follows:
Ρ2 = I χι η…+ . n xj ^/ l x …+ . n xj Ρ 2 = I χ ι η...+ . n xj ^/ lx ...+ . n xj
若所述任意用户的 A与 P2均相等, 则所述概率独立性条件成立。 If both A and P 2 of the arbitrary user are equal, the probability independence condition is established.
其中, 所述方法还包括: 根据所述用户的显性特征和隐性特征, 分析所述 用户行为习惯。  The method further includes: analyzing the user behavior habit according to the dominant feature and the recessive feature of the user.
为解决上述技术问题, 本发明采用的另一个技术方案是: 提供一种估算用 户的隐性特征分布的装置, 包括: 第一获取模块, 用于获取使用网站的用户以 及用户的显性特征; 第二获取模块, 用于从全国人口数据库获取所有人口的特 征信息, 其中, 所述特征信息包括显性特征和隐性特征; 计算模块, 用于根据 所述所有人口的特征信息、 使用网站的用户和所述用户的显性特征, 结合贝叶 斯算法计算所述用户隐性特征分布。  In order to solve the above technical problem, another technical solution adopted by the present invention is: providing an apparatus for estimating a distribution of a hidden feature of a user, comprising: a first obtaining module, configured to acquire a user who uses the website and a dominant feature of the user; a second obtaining module, configured to acquire feature information of all the populations from the national population database, where the feature information includes a dominant feature and a recessive feature; and a calculating module, configured to use the website according to the feature information of all the populations The dominant feature of the user and the user is combined with a Bayesian algorithm to calculate the user's implicit feature distribution.
其中, 若在任意用户的隐性特征下, 用户使用网站并且用户具有显性特征 的概率独立性条件成立, 则根据如下公式计算所述用户的隐性特征,
Figure imgf000004_0002
其中, 所述 L为大于或等于 1的整数, 所述 X为用户的隐性特征, 所述 t为 用户的显性特征, 所述/为使用所述网站的用户。
Wherein, under the implicit feature of any user, if the user uses the website and the probability independence condition of the user having the dominant feature is established, the implicit feature of the user is calculated according to the following formula.
Figure imgf000004_0002
The L is an integer greater than or equal to 1, the X is a recessive feature of the user, the t is a dominant feature of the user, and the / is a user who uses the website.
其中, 所述装置还包括判断模块; 所述判断模块用于根据所述所有用户的 特征信息、 使用网站的用户和用户的显性特征, 计算任意用户的 A的值, 其中, 所述 A的计算公式如下:  The device further includes a determining module, where the determining module is configured to calculate a value of the A of any user according to the feature information of the user, the user of the website, and the dominant feature of the user, where the A Calculated as follows:
= i3(t | x1 .... n xi ) = i 3 (t | x 1 .... nx i )
和,  with,
根据所述所有用户的特征信息、 使用网站的用户和用户的显性特征, 计算 任意用户的 Ρ2的值, 其中, 所述 Ρ2的计算公式如下: Calculating the value of Ρ 2 of any user according to the characteristic information of all the users, the user using the website, and the dominant characteristics of the user, wherein the calculation formula of the Ρ 2 is as follows:
Ρ2 = I χι η…+ . n xj ^/ l x …+ . n xj Ρ 2 = I χ ι η...+ . n xj ^/ lx ...+ . n xj
以及,  as well as,
判断所述任意用户的 与 P2是否相等, 若相等, 则所述概率独立性条件成 其中, 所述装置还包括分析模块; 所述分析模块, 用于根据用户的显性特 征和隐性特征, 分析所述用户行为习惯。 Determining whether the arbitrary user is equal to P 2 , if equal, the probability independence condition is in which the device further includes an analysis module; and the analyzing module is configured to: according to a dominant feature and a recessive feature of the user , analyzing the user behavior habits.
为解决上述技术问题, 本发明采用的又一个技术方案是: 提供一种估算用 户的隐性特征分布的装置, 装置包括处理器; 处理器用于用于获取使用网站的 用户以及用户的显性特征, 和, 从人口数据库获取所有人口的特征信息, 其中, 所述特征信息包括显性特征和隐性特征, 以及, 根据所述所有人口的特征信息、 使用网站的用户和所述用户的显性特征, 结合贝叶斯算法计算所述用户隐性特 征分布;  In order to solve the above technical problem, another technical solution adopted by the present invention is: providing an apparatus for estimating a distribution of a hidden feature of a user, the apparatus comprising a processor; the processor is configured to acquire a user who uses the website and a dominant feature of the user And obtaining feature information of all the populations from the population database, wherein the feature information includes dominant features and recessive features, and, according to characteristic information of all the populations, users using the website, and dominantness of the user Feature, combining the Bayesian algorithm to calculate the user recessive feature distribution;
其中, 所述处理器根据所述根据所述所有人口的特征信息、 使用网站的用 户和所述用户的显性特征, 结合贝叶斯算法计算所述用户隐性特征分布的步骤 具体为: 所述处理器用于若在任意用户的隐性特征下, 用户使用网站并且用户 具有显性特征的概率独立性条件成立, 则根据如下公式计算所述用户的隐性特 征, p The step of calculating, by the processor according to the characteristic information of all the populations, the users using the website, and the dominant features of the user, the Bayesian algorithm to calculate the hidden feature distribution of the user is specifically: The processor is configured to calculate a recessive feature of the user according to the following formula if, under the implicit feature of any user, the user uses the website and the probability independence condition of the user has a dominant feature is established, p
Figure imgf000006_0001
Figure imgf000006_0001
其中, 所述 L为大于或等于 1的整数, 所述 X为用户的隐性特征, 所述 t为 用户的显性特征, 所述/为使用所述网站的用户。  The L is an integer greater than or equal to 1, the X is a recessive feature of the user, the t is a dominant feature of the user, and the / is a user who uses the website.
其中, 所述处理器还用于判断在任意用户的隐性特征下, 用户使用网站并 且用户具有显性特征的概率独立性条件是否成立, 所述判断具体步骤包括: 根据所有人口的特征信息、 使用网站的用户和用户的显性特征, 计算任意 用户的 的值, 其中, 所述 A的计算公式如下:  The processor is further configured to determine whether a probability independence condition that the user uses the website and the user has a dominant feature is established under the implicit feature of any user, and the determining the specific step includes: according to the feature information of all the populations, The value of any user is calculated using the explicit characteristics of the user and the user of the website, wherein the calculation formula of the A is as follows:
= i3(t | x1 .... n xi) = i 3 (t | x 1 .... nx i )
根据所述所有人口的特征信息, 计算任意用户的 Ρ2的值, 其中, 所述 Ρ2的计 算公式如下: Calculating the value of Ρ 2 of any user according to the characteristic information of all the populations, wherein the calculation formula of the Ρ 2 is as follows:
P2 = p(t I χι n…+ . n xJP l x +…. n xj P 2 = p (t I χ ι n...+ . n xJP lx +.... n xj
若所述任意用户的 与 P2均相等, 则所述概率独立性条件成立。 If any of the users is equal to P 2 , the probability independence condition is established.
其中, 所述处理器还用于根据所述用户的显性特征和隐性特征, 分析所述 用户行为习惯。  The processor is further configured to analyze the user behavior habit according to the dominant feature and the recessive feature of the user.
本发明的有益效果是: 区別于现有技术的情况, 本发明在计算用户的隐性 特征时, 加上使用该网站的用户的数据, 使得在计算网站的用户群体当中具有 显性特征的人群中具有隐性特征当中的概率时, 是以网站的用户群体作为样本 空间, 而不是全国人口数据, 样本空间的差异就不存在, 从而使得计算结果的 误差不存在, 修正计算结果, 进而使得计算结果更准确。  The beneficial effects of the present invention are: Different from the prior art, the present invention, when calculating the hidden features of the user, plus the data of the user who uses the website, enables the people who have dominant characteristics among the user groups of the computing website. When there is a probability among the recessive features, the user group of the website is taken as the sample space, instead of the national population data, the difference of the sample space does not exist, so that the error of the calculation result does not exist, the calculation result is corrected, and then the calculation is performed. The result is more accurate.
【附图说明】 [Description of the Drawings]
图 1是本发明估算用户的隐性特征分布的方法实施方式的流程图; 图 2 是本发明估算用户的隐性特征分布的方法实施方式中显性特征和隐 性特征在样本空间中分布示意图;  1 is a flow chart of an embodiment of a method for estimating a recessive feature distribution of a user according to the present invention; FIG. 2 is a schematic diagram of distribution of dominant features and recessive features in a sample space in a method for estimating a recessive feature distribution of a user according to the present invention; ;
图 3 是本发明估算用户的隐性特征分布的方法实施方式中具有隐性特征 并且使用网站的用户在样本空间中分布的示意图; 图 4 是本发明估算用户的隐性特征分布的方法实施方式中修正显性特征 和隐性特征在样本空间中分布的示意图; 3 is a schematic diagram of a method for estimating a user's implicit feature distribution in the method embodiment of the present invention with implicit features and using a website to distribute in a sample space; 4 is a schematic diagram of a method for estimating a distribution of a dominant feature and a recessive feature in a sample space in an embodiment of a method for estimating a recessive feature distribution of a user;
图 5是本发明估算用户的隐性特征分布的装置第一实施方式结构示意图; 图 6是本发明估算用户的隐性特征分布的装置第二实施方式结构示意图。  FIG. 5 is a schematic structural diagram of a first embodiment of an apparatus for estimating a recessive feature distribution of a user according to the present invention; FIG. 6 is a schematic structural diagram of a second embodiment of an apparatus for estimating a recessive feature distribution of a user according to the present invention.
【具体实施方式】 【detailed description】
下面结合附图和实施方式对本发明进行详细说明。  The invention will now be described in detail in conjunction with the drawings and embodiments.
请参阅图 1, 方法包括:  Referring to Figure 1, the method includes:
步骤 S201 : 获取使用网站的用户以及用户的显性特征;  Step S201: Obtaining a user who uses the website and a dominant feature of the user;
网站记录用户的相关信息, 例如: 用户的注册信息、 用户的访问信息等 等, 其中, 用户的相关信息通常保存在网站后台的统计数据中, 可通过统计数 据获取有哪些人使用网站, 例如: 统计数据记录张三、 李四注册成为网站的用 户, 则通过统计数据可获知张三和李四使用了网站, 当然, 用户的相关信息要 求为真实的, 例如: 真实的姓名、 真实的年龄等等。  The website records information about the user, for example: user registration information, user access information, etc., wherein the user's related information is usually stored in the statistics of the website background, and the statistics can be used to obtain who uses the website, for example: Statistical data records Zhang San, Li Si registered as a user of the website, through the statistics can be found that Zhang San and Li Si use the website, of course, the user's relevant information requirements are true, such as: real name, real age, etc. Wait.
用户的显性特征为直接获取的特征, 比如: 统计数据中记录注册用户的 真实姓名, 则姓名为用户的显示特征。  The dominant feature of the user is the directly acquired feature, for example: the statistical data records the real name of the registered user, and the name is the display feature of the user.
用户的隐性特征为无法直接获取的特征, 比如: 统计数据中没有记录注 册用户的种族, 无法通过统计数据直接获取用户的种族, 则种族为用户的隐性 特征。  The hidden feature of the user is a feature that cannot be directly obtained. For example, the statistic data does not record the race of the registered user, and the race of the user cannot be directly obtained through statistical data, and the race is a hidden feature of the user.
步骤 S202 : 从人口数据库获取所有人口的特征信息, 其中, 所述特征信 息包括显性特征和隐性特征;  Step S202: Obtain feature information of all populations from a population database, where the feature information includes a dominant feature and a recessive feature;
人口数据库详尽地记录所有人口的特征信息, 例如: 人的姓名、 性別、 年龄等等。 值得说明的是: 人口数据库的特征信息包括显性特征和隐性特征, 其中, 显性特征对应用户的显性特征, 隐性特征对应用户的隐性特征, 例如: 用户的姓名为显示特征, 则人口数据库中的姓名为显示特征, 用户的种族为隐 性特征, 则人口数据库中的种族为隐性特征。 在本发明实施方式中, 人口数据库可为由国家权威机构公布的人口数据 库, 可以从公开渠道获得到的。 The population database records in detail the characteristics of all populations, such as: person's name, gender, age, etc. It is worth noting that: the feature information of the population database includes explicit features and recessive features, wherein the dominant features correspond to the dominant features of the user, and the recessive features correspond to the hidden features of the user, for example: the user's name is a display feature, Then, the name in the population database is a display feature, and the user's race is a recessive feature, and the race in the population database is a recessive feature. In the embodiment of the present invention, the population database may be a population database published by a national authority, which may be obtained from an open source.
步骤 S203: 根据所有人口的特征信息、 使用网站的用户和用户的显性特 征, 结合贝叶斯算法计算用户隐性特征的分布;  Step S203: Calculate the distribution of the hidden features of the user according to the characteristic information of all the populations, the explicit features of the users and the users using the website, and the Bayesian algorithm;
其中, 结合贝叶斯算法计算用户隐性特征分布之前, 还需要验证在任意用 户的隐性特征下, 用户使用网站并且用户具有显性特征的概率独立性条件是否 成立, 则步骤 S203又可具体为: 若在任意用户的隐性特征下, 用户使用网站并 且用户具有显性特征的概率独立性条件成立, 根据如下公式计算所述用户的隐 性特征,  Before calculating the user implicit feature distribution in combination with the Bayesian algorithm, it is also required to verify whether the probability independence condition of the user using the website and the user has a dominant feature is established under the implicit feature of any user, and step S203 may be specific. If: under the implicit feature of any user, the user uses the website and the probability independence condition of the user having the dominant feature is established, the implicit feature of the user is calculated according to the following formula,
P(Xl n....nxJtn ) =尋 ' n…+ . ^剛 ^ +…. ^x Pjx^― nxL)——公式 χ P (Xl n .... nxJtn) = hunt 'n ... + ^ just ^ + ... ^ x Pjx ^ - .. Nx L) - Formula χ
1 L P(t ΓΛ f) 1 L P(t ΓΛ f)
其中, 所述 L为大于或等于 1的整数, 所述 X为用户的隐性特征, 所述 t为 用户的显性特征, 所述/为使用所述网站的用户。  The L is an integer greater than or equal to 1, the X is a recessive feature of the user, the t is a dominant feature of the user, and the / is a user who uses the website.
以下 L=l时, 说明公式 1的由来。 由背景技术可知, 由于使用网站的用户 的构成与全国人口的构成不一样, 若强行套用贝叶斯方程, 则会造成计算结果 产生误差。 为了避免计算结果产生误差, 需要修正样本空间, 将使用网站的用 户加入到贝叶斯方程, 则修正后的贝叶斯方程为:  The following L=l indicates the origin of Equation 1. As can be seen from the background art, since the composition of the user who uses the website is different from the composition of the national population, if the Bayesian equation is imposed, the calculation result will be inaccurate. In order to avoid errors in the calculation results, the sample space needs to be corrected, and the user using the website is added to the Bayesian equation. The modified Bayesian equation is:
P(Xl I tcf) = 。/1 )—公式 2 其中, 若概率独立条件成立, 则尸( 0/|^)=尸 |^)尸(/|^), P( Xl I tcf) = . /1)—Form 2, where the probability independent condition is established, then the corpse (0/|^)= corpse|^) corpse (/|^),
则 = j 、w——公式 3 由公式 3 可知, 把三种条件的概率问题, 简化为三种条件两两之间的概率 问题, 简化了对数据的要求。  Then = j , w -- Equation 3 It can be seen from Equation 3 that the probability problem of the three conditions is reduced to the probability problem between the three conditions and the two, which simplifies the requirements of the data.
进一步, 公式 3和公式 2可知, 简化为的贝叶斯方程需要满足概率独立条 件, 具体原因, 如下举例进行说明:  Further, Equation 3 and Equation 2 show that the simplified Bayesian equation needs to satisfy the probability independent condition. The specific reasons are as follows:
如图 2所示, 假设网站的隐性特征 X只可能呈现两个值 A和 B, 在图上 显示的便是 A和 B两个区域, 而假设 a和 b分別是图中 A和 B的面积。 假 设所能观测到的显性特征 t 由中间的小长方形所表示, 与隐性特征的两个值域 交集之处为 TA和 TB, 面积分別为 ta和 tb。 需要解决的问题是要求出 TA和 TB之间的面积比, 标准化至 1 以后便可得出两者的概率比。 As shown in Figure 2, it is assumed that the recessive feature X of the website may only present two values A and B. The two regions A and B are shown on the graph, and a and b are assumed to be A and B respectively. area. Assume that the dominant feature t that can be observed is represented by a small rectangle in the middle, and two ranges of recessive features. The intersections are TA and TB, and the areas are ta and tb, respectively. The problem that needs to be solved is to require the area ratio between TA and TB. After standardization to 1, the probability ratio of the two can be obtained.
若 A和 B为完全覆盖整个人口样本空间, 则简单贝叶斯方程为:  If A and B are completely covering the entire population sample space, the simple Bayes equation is:
Pit I A)P(A)  Pit I A)P(A)
P(A 11) _ P{t) _ P(t I A)P(A)  P(A 11) _ P{t) _ P(t I A)P(A)
P(B 11) Pit I B)P(B) P(t I B)P(B)  P(B 11) Pit I B)P(B) P(t I B)P(B)
~ Pit) ~  ~ Pit) ~
ta a  Ta a
若用图形面积比表示出来则是: =4^, 若面积比的等式两边样本空 tb tb b  If the area ratio of the figure is used, it is: =4^, if the area ratio is equal to the sample side, tb tb b
b a + b  b a + b
间相符合而均为 A+B, 而等式必然成立。 The coincidence is A+B, and the equation must be established.
若两边的样本空间不相符合, 则面积比的等式存在问题, 如图 3 所示, 假 设 B 的人群当中, 只有一部分人使用网站 F , 标记为 B ' , 面积为 b' ,而 显性特征 t 与 B ' 之间的交集为 TB ' , 面积为 tb' , 那么我们实际上感兴趣 的数值变成了
Figure imgf000009_0001
If the sample space on both sides does not match, there is a problem with the area ratio equation. As shown in Figure 3, only some people in the B group use the website F, labeled B ', and the area is b', and the dominant The intersection between the features t and B ' is TB ' and the area is tb', then the value we are actually interested in becomes
Figure imgf000009_0001
此时, 等式左边的样本空间为 A+B ' , 如果我们继续简单地套用贝叶斯方 程, 则等式右边继续为 :  At this point, the sample space to the left of the equation is A+B '. If we continue to simply apply the Bayesian equation, the right side of the equation continues to be:
Pit I A)P(A)  Pit I A)P(A)
Pit I B)P(B)  Pit I B)P(B)
此时, 等式右边的样本空间还是 A+ B。  At this point, the sample space to the right of the equation is still A+B.
如果用面积表示, 则贝叶斯方程等式左边为: , 等式的右边为: tb'  If expressed in terms of area, the left side of the Bayesian equation is: , and the right side of the equation is: tb'
ta a Ta a
^ = - , 很显然的, , 等式左边不等于等式的右边, 也就是说贝叶斯 tb b tb tb' tb ^ = - , Obviously, , the left side of the equation is not equal to the right side of the equation, that is, Bayesian tb b tb tb' tb
b a + b  b a + b
方程两边是不相等的, 简单地套用贝叶斯方程会造成计算结果产生误差。 The two sides of the equation are unequal. Simply applying the Bayesian equation will cause errors in the calculation results.
很显然的, 造成计算结果产生误差的原因为: 等式左右两边的样本空间不 致, 因此, 需要修正样本空间, 使得等式的左右两边的样本空间相一致。  Obviously, the reason for the error in the calculation result is as follows: The sample space on the left and right sides of the equation is not correct. Therefore, the sample space needs to be corrected so that the sample spaces on the left and right sides of the equation are consistent.
如图 4所示, TA的样本空间构成与 A的样本空间构成相, TB的样本构成 与 B的样本构成相同, 则在 B的人群当中使用网站 F的人为 B ' 时, 则  As shown in Fig. 4, the sample space composition of TA is the same as the sample space of A, and the sample composition of TB is the same as the sample composition of B. When the person B of the website F uses B',
tb' b' ta' a'  Tb' b' ta' a'
tb b ta a 其中, 修正样本空间, 使得 TA的样本空间构成与 A的样本空间构成相Tb b ta a Wherein, the sample space is corrected such that the sample space composition of the TA is compared with the sample space of A
TB的样本构成与 B的样本构成相同时, When the sample composition of TB is the same as the sample composition of B,
则贝叶斯方程为:
Figure imgf000010_0001
Then the Bayesian equation is:
Figure imgf000010_0001
tb' V tb' V tb V tb V b tb b' b 用面积表示为. _ V a + b _ b' a + b _ b a + b _ b a + b b = b b a + bTb' V tb' V tb V tb V b tb b' b is expressed as area. _ V a + b _ b' a + b _ ba + b _ ba + bb = bba + b
• a + b ta + tb' ta + tb' ta + tb' ta + tb' ta + tb' a + b a + b a + b a + b a + b 则修正后的贝叶斯方程可为: Ρ(Β' | t n ) = P(t 1 B)P(f 1 B)P{B) 此时,可通过人口数据库,获取人口分布数据,例如:每个隐性特征值 { x1 底 下,有多少人同时还拥有我们所观测到的显性特征值 t, 以及在总人口的比 例。 其中, 足够详细的数据库 (如美国的 Census data)可以让我们确定到每个人以 及他们相应的显性特征和隐性特征, 假设数据库当中总共有 ^个人的数据, 第 V个人的数据为 χν ),假设 为事件指标方程.我们可以对偏差贝叶斯修正 方程当中的概率做以下计算: • a + b ta + tb' ta + tb' ta + tb' ta + tb' ta + tb' a + ba + ba + ba + ba + b The modified Bayesian equation can be: Ρ(Β' | Tn ) = P(t 1 B)P(f 1 B)P{B) At this time, the population distribution data can be obtained through the population database, for example: how many people at the same time under each recessive feature value { x 1 Have the dominant eigenvalue t we observed, as well as the proportion of the total population. Among them, a sufficiently detailed database (such as Census data in the United States) allows us to determine each person and their corresponding dominant and recessive features, assuming that there are a total of ^ personal data in the database, the data of the Vth person is χ ν ), assumed to be the event indicator equation. We can do the following calculations for the probabilities in the bias Bayesian correction equation:
Pi I x) Pi I x)
χ - χ  χ - χ
X — X ί X — X ί
W  W
此时我们还需要 Ρ(/ | χ ), 即在每一种隐性特征为 χ的人群当中, 有多少人 使用使用网站 F (比方说 12-19 岁的人群当中有多少人使用该网站), 通常情况 下, 网站的后台统计数据中会记录相关的用户的数据, 通过统计数据即可获得 需要的数据。 进一步的, 对于隐性特征的总数为 η个, 则 ρ(χ' | η /) = 1  At this point we also need Ρ(/ | χ ), that is, how many people use the website F in each of the people with hidden characteristics (such as how many people in the 12-19 age group use the website) In general, the relevant user data is recorded in the background statistics of the website, and the required data can be obtained through the statistical data. Further, for the total number of recessive features is η, then ρ(χ' | η /) = 1
ι=1 因此,
Figure imgf000010_0002
圆 上述是以单个隐性特征为例进行说明的, 同理的, 可以扩充到多个隐性特 征, 则修正后的贝叶斯方程为: p
ι=1 Therefore,
Figure imgf000010_0002
The above circle is described by taking a single recessive feature as an example. Similarly, it can be extended to multiple recessive features, and the modified Bayesian equation is: p
Figure imgf000011_0001
Figure imgf000011_0001
其中, 所述 L为大于或等于 1的整数, 所述 X为用户的隐性特征, 所述 t为 用户的显性特征, 所述/为使用所述网站的用户。  The L is an integer greater than or equal to 1, the X is a recessive feature of the user, the t is a dominant feature of the user, and the / is a user who uses the website.
要说明的是: 修正样本空间, 使得 TA的样本空间构成与 A的样本空间构 成相, TB的样本构成与 B的样本构成相同时, 其中, 必定满足 6率独立性条件, 相反的, 在满足概率独立性条件下, 则 TA的样本空间构成与 A的样本空间构 成相同, TB的样本构成与 B的样本空间的构成也相同, 因此, 在使用 ^正后的 贝叶斯方程时, 可先验证是否满足概率独立性条件, 则方法还包括:  It should be noted that: the sample space is corrected such that the sample space composition of TA is phased with the sample space of A, and the sample composition of TB is the same as the sample composition of B, wherein the 6-rate independence condition is satisfied, and the opposite is satisfied. Under the condition of probability independence, the sample space structure of TA is the same as the sample space of A, and the sample structure of TB is the same as the sample space of B. Therefore, when using the Bayesian equation after positive, To verify whether the probability independence condition is met, the method further includes:
判断在任意用户的隐性特征下, 用户使用网站并且用户具有显性特征的概 率独立性条件是否成立, 所述判断具体步骤包括:  It is judged whether the probability independence condition of the user using the website and the user has a dominant feature is established under the implicit feature of any user, and the specific steps of the determining include:
根据所有人口的特征信息、 使用网站的用户和用户的显性特征, 计算任意 用户的 的值, 其中, 所述 A的计算公式如下:  Calculate the value of any user based on the characteristic information of all the population, the users using the website, and the dominant characteristics of the user. The calculation formula of the A is as follows:
= i3(t | x1 .... n xi ) = i 3 (t | x 1 .... nx i )
根据所有人口的特征信息, 计算任意用户的 Ρ2的值, 其中, Ρ2的计算公式如 下: Calculate the value of Ρ 2 for any user based on the characteristic information of all populations, where Ρ 2 is calculated as follows:
Ρ2 = I χι η…+ . n xj ^/ l x …+ . n xj Ρ 2 = I χ ι η...+ . n xj ^/ lx ...+ . n xj
若任意用户的 A与 P2均相等, 则概率独立性条件成立。 If both A and P 2 of any user are equal, the probability independence condition is established.
所述 L为大于或等于 1 的整数, 其中, L为 1 时, 则为单个隐性特征, 所 述 X为用户的隐性特征, 所述 t为用户的显性特征, 所述/为使用所述网站的用 户。  The L is an integer greater than or equal to 1, wherein when L is 1, it is a single recessive feature, the X is a recessive feature of the user, the t is a dominant feature of the user, and the / is used The user of the website.
进一步的, 在获取到用户的显性特征和隐性特征后, 则可根据用户的显性 特征和隐性特征分析网站行为习惯, 从而能够根据用户的行为习惯制定广告策 略, 或者, 向用户推送合适的增值业务等等。 其中, 在获取到用户的显性特征 和隐性特征, 能够更加准确确定用户的行为习惯, 进而使得制定的广告策略或 者推送的增值业务更加合理, 提高成功率。  Further, after obtaining the dominant feature and the recessive feature of the user, the website behavior habit can be analyzed according to the dominant feature and the recessive feature of the user, so that the advertising strategy can be formulated according to the user's behavior habit, or pushed to the user. Appropriate value-added services and more. Among them, the user's explicit features and recessive features can be obtained, and the user's behavior habits can be more accurately determined, thereby making the customized advertising strategy or the value-added service pushed more reasonable and improving the success rate.
本发明对隐性估算问题当中所产生的样本空间偏差问题进行修正, 使得估 算运算结果更加接近于正确理论值, 其中, 样本空间的偏差性越强, 使用本发 明进行修正的必要性则越强。 而在目前众多流行用户均有非常强偏差性, 比方 说外国社交媒体网站 Facebook ,在 2012年的数据显示 18-29岁的人群当中有 83%的人使用,而 65岁以上 人群只有 40%的人在使用,如果我们不采取修正的 话, 相对于 18-29 岁的概率来说, 普通的最大概率法和贝叶斯算法将错误放大 每一名用户为 65 岁以上的概率至正确值的 2 倍以上, 这会对往后的以此为基 础各项计算和分析造成重要影响, 可能会为最终结果带来严重偏差。 The invention corrects the problem of sample space deviation generated by the implicit estimation problem, so as to estimate The calculation result is closer to the correct theoretical value, and the stronger the deviation of the sample space, the stronger the necessity of using the present invention for correction. At present, many popular users have very strong biases. For example, foreign social media website Facebook, in 2012, shows that 83% of people aged 18-29 use, while only 40% of people over 65 years old People are using, if we don't take corrections, the normal maximum probability method and Bayesian algorithm will erroneously amplify each user's probability of 65 years and older to the correct value relative to the probability of 18-29 years old. More than this, this will have a significant impact on the calculations and analysis based on this, which may cause serious deviations from the final result.
在本发明实施方式中, 在计算用户的隐性特征时, 加上使用该网站的用户 的数据, 使得在计算网站的用户群体当中具有显性特征的人群中具有隐性特征 当中的概率时, 是以网站的用户群体作为样本空间, 而不是全国人口数据, 样 本空间的差异就不存在, 从而使得计算结果的误差不存在, 修正计算结果。  In the embodiment of the present invention, when calculating the implicit feature of the user, adding the data of the user who uses the website, when the probability of the recessive feature among the people having the dominant features among the user groups of the computing website is calculated, The user group of the website is used as the sample space, instead of the national population data, the difference of the sample space does not exist, so that the error of the calculation result does not exist, and the calculation result is corrected.
本发明还提供估算用户的隐性特征分布的装置第一实施方式,如图 5所示, 装置包括第一获取模块 301、 第二获取模块 302和计算模块 304。  The present invention also provides a first embodiment of an apparatus for estimating a hidden feature distribution of a user. As shown in FIG. 5, the apparatus includes a first obtaining module 301, a second obtaining module 302, and a calculating module 304.
第一获取模块 301 获取使用网站的用户以及用户的显性特征。 第二获取模 块 302 从全国人口数据库获取所有人口的特征信息, 其中, 特征信息包括显性 特征和隐性特征。  The first acquisition module 301 obtains the user who uses the website and the dominant features of the user. The second acquisition module 302 acquires feature information of all populations from the national population database, wherein the feature information includes dominant features and recessive features.
计算模块 304 于根据所有人口的特征信息、 使用网站的用户和用户的显性 特征, 结合贝叶斯算法计算用户隐性特征分布。 具体的, 计算模块 304 可在若 在任意用户的隐性特征下, 用户使用网站并且用户具有显性特征的概率独立性 条件成立, 采用贝叶斯算法计算用户隐性特征分布, 则计算模块 304 又可具体 用于若在任意用户的隐性特征下, 用户使用网站并且用户具有显性特征的概率 独立性条件成立, 则根据如下公式计算所述用户的隐性特征,  The calculation module 304 calculates the user recessive feature distribution according to the characteristic information of all the populations, the explicit features of the users and users using the website, and the Bayesian algorithm. Specifically, the calculation module 304 may be configured to calculate a user recessive feature distribution by using a Bayesian algorithm if the user uses the website and the probability independence condition of the user has a dominant feature under the implicit feature of any user, and then the calculation module 304 In another example, if the probability independence condition of the user using the website and the user has a dominant feature is established under the hidden feature of any user, the implicit feature of the user is calculated according to the following formula.
ρ ρ
Figure imgf000012_0001
Figure imgf000012_0001
其中, 所述 L为大于或等于 1的整数, 所述 X为用户的隐性特征, 所述 t为 用户的显性特征, 所述/为使用所述网站的用户, 而对于上述计算公式的由来可 参阅估算用户的隐性特征分布实施方式, 此时不再一一赘述。 装置还可包括判断模块 303和分析模块 305。判断模块 303用于根据所有用 户的特征信息、 使用网站的用户和用户的显性特征, 计算任意用户的 A的值, 其中, 所述 A的计算公式如下: The L is an integer greater than or equal to 1, the X is a recessive feature of the user, the t is a dominant feature of the user, and the / is a user using the website, and the calculation formula is The origin can be referred to the estimation of the user's implicit feature distribution implementation, which will not be repeated here. The apparatus can also include a determination module 303 and an analysis module 305. The determining module 303 is configured to calculate the value of A of any user according to the feature information of all users, the user using the website, and the dominant feature of the user, where the calculation formula of the A is as follows:
= i3(t | x1 .... n xi ) = i 3 (t | x 1 .... nx i )
和,  with,
根据所述所有用户的特征信息、 使用网站的用户和用户的显性特征, 计算 任意用户的 Ρ2的值, 其中, 所述 Ρ2的计算公式如下: Calculating the value of Ρ 2 of any user according to the characteristic information of all the users, the user using the website, and the dominant characteristics of the user, wherein the calculation formula of the Ρ 2 is as follows:
Ρ2 = I χι η…+ . n xj ^/ l x …+ . n xj Ρ 2 = I χ ι η...+ . n xj ^/ lx ...+ . n xj
以及,  as well as,
判断所述任意用户的 与 P2是否相等, 若相等, 则所述概率独立性条件成 分析模块 305 根据用户的显性特征和隐性特征, 分析用户行为习惯, 从而 能够根据用户的行为习惯制定广告策略, 或者, 向用户推送合适的增值业务等 等。 其中, 在获取到用户的显性特征和隐性特征, 能够更加准确确定用户的行 为习惯, 进而使得制定的广告策略或者推送的增值业务更加合理, 提高成功率。 Determining whether the arbitrary user is equal to P 2 , and if they are equal, the probability independence condition analysis module 305 analyzes the user behavior habit according to the dominant feature and the recessive feature of the user, thereby being able to formulate according to the behavior habit of the user. Advertising strategy, or, to push the appropriate value-added services to users, and so on. Among them, obtaining the explicit and hidden features of the user can more accurately determine the behavior habits of the user, thereby making the customized advertising strategy or the value-added service pushed more reasonable and improving the success rate.
在本发明实施方式中, 计算模块 304 在计算用户的隐性特征时, 加上使用 该网站的用户的数据, 使得在计算网站的用户群体当中具有显性特征的人群中 具有隐性特征当中的概率时, 是以网站的用户群体作为样本空间, 而不是全国 人口数据, 样本空间的差异就不存在, 从而使得计算结果的误差不存在, 修正 计算结果。  In the embodiment of the present invention, when calculating the implicit feature of the user, the calculation module 304 adds the data of the user who uses the website, so that among the people having the dominant features among the user groups of the computing website, there are hidden features among the people having the dominant features among the user groups of the computing website. When the probability is based on the user group of the website as the sample space, rather than the national population data, the difference in the sample space does not exist, so that the error of the calculation result does not exist, and the calculation result is corrected.
本发明还提供估算用户的隐性特征分布的装置第二实施方式,如图 6所示, 装置包括处理器 401、 存储器 402和总线 403。 处理器 401和存储器 402均与总 线 403连接。  The present invention also provides a second embodiment of an apparatus for estimating a hidden feature distribution of a user. As shown in FIG. 6, the apparatus includes a processor 401, a memory 402, and a bus 403. Both processor 401 and memory 402 are coupled to bus 403.
处理器 401 用于获取使用网站的用户以及用户的显性特征, 从人口数据库 获取所有人口的特征信息, 其中, 所述特征信息包括显性特征和隐性特征, 根 据所述所有人口的特征信息、 使用网站的用户和所述用户的显性特征, 结合贝 叶斯算法计算所述用户隐性特征分布。 The processor 401 is configured to acquire a feature of the user using the website and the user, and obtain feature information of all the populations from the population database, where the feature information includes a dominant feature and a recessive feature, according to the feature information of all the populations. , using the users of the website and the dominant characteristics of the user, combined with The leaves algorithm calculates the user's recessive feature distribution.
进一步的, 处理器 401 根据所述所有人口的特征信息、 使用网站的用户和 所述用户的显性特征, 结合贝叶斯算法计算所述用户隐性特征分布的步骤具体 为: 若在任意用户的隐性特征下, 用户使用网站并且用户具有显性特征的概率 独立性条件成立, 则根据如下公式计算所述用户的隐性特征,  Further, the processor 401, according to the feature information of all the populations, the user who uses the website, and the dominant features of the user, and the Bayesian algorithm to calculate the user's implicit feature distribution is specifically as follows: Under the implicit feature, if the user uses the website and the probability independence condition of the user having the dominant feature is established, the hidden feature of the user is calculated according to the following formula.
ρίν ρ ίν
Figure imgf000014_0001
Figure imgf000014_0001
其中, 所述 L为大于或等于 1的整数, 所述 X为用户的隐性特征, 所述 t为 用户的显性特征, 所述/为使用所述网站的用户。 而判断在任意用户的隐性特征 下, 用户使用网站并且用户具有显性特征的概率独立性条件是否成立, 所述判 断具体步骤包括:  The L is an integer greater than or equal to 1, the X is a recessive feature of the user, the t is a dominant feature of the user, and the / is a user who uses the website. And determining whether the probability independence condition of the user using the website and the user has a dominant feature is established under the implicit feature of any user, the specific steps of the determining include:
根据所有人口的特征信息、 使用网站的用户和用户的显性特征, 计算任意 用户的 的值, 其中, 所述 A的计算公式如下:  Calculate the value of any user based on the characteristic information of all the population, the users using the website, and the dominant characteristics of the user. The calculation formula of the A is as follows:
= i3(t | x1 .... n xi ) = i 3 (t | x 1 .... nx i )
根据所述所有人口的特征信息, 计算任意用户的 Ρ2的值, 其中, 所述 Ρ2的计 算公式如下: Calculating the value of Ρ 2 of any user according to the characteristic information of all the populations, wherein the calculation formula of the Ρ 2 is as follows:
Ρ2 = I χι η…+ . n xj ^/ l x …+ . n xj Ρ 2 = I χ ι η...+ . n xj ^/ lx ...+ . n xj
若所述任意用户的 与 Ρ2均相等, 则所述概率独立性条件成立。 If the arbitrary user is equal to Ρ 2 , the probability independence condition is established.
处理器 401 还用于根据所述用户的显性特征和隐性特征, 分析所述用户行 为习惯。  The processor 401 is further configured to analyze the user behavior habit according to the dominant features and the recessive features of the user.
需要说明的是: 使用网站的用户以及用户的显性特征可由网站后台统计得 到的, 并存储在存储器 402 中, 处理器 401从存储器 402 中提取使用网站的用 户以及用户的显性特征。 而人口数据库的内容也可网站后台预先从公开渠道获 取到后存储在存储器 402 中, 需要使用人口数据库时, 从存储器 402 中提取, 也可以在需要时再从公开渠道获取。  It should be noted that the user who uses the website and the explicit features of the user can be obtained by the website background statistics and stored in the memory 402. The processor 401 extracts from the memory 402 the user who uses the website and the dominant features of the user. The content of the population database can also be stored in the memory 402 after being obtained from the public channel in advance, and is extracted from the memory 402 when the population database is needed, or can be obtained from the public channel when needed.
在本发明实施方式中, 处理器 401 在计算用户的隐性特征时, 加上使用该 网站的用户的数据, 使得在计算网站的用户群体当中具有显性特征的人群中具 有隐性特征当中的概率时, 是以网站的用户群体作为样本空间, 而不是全国人 口数据, 样本空间的差异就不存在, 从而使得计算结果的误差不存在, 修正计 算结果。 In the embodiment of the present invention, when calculating the implicit feature of the user, the processor 401 adds data of the user who uses the website, so that the person who has the dominant feature among the user groups of the computing website has When there is a probability in the recessive feature, the user group of the website is taken as the sample space, instead of the national population data, the difference of the sample space does not exist, so that the error of the calculation result does not exist, and the calculation result is corrected.
以上所述仅为本发明的实施方式, 并非因此限制本发明的专利范围, 凡是 利用本发明说明书及附图内容所作的等效结构或等效流程变换, 或直接或间接 运用在其他相关的技术领域, 均同理包括在本发明的专利保护范围内。  The above description is only the embodiment of the present invention, and is not intended to limit the scope of the invention, and the equivalent structure or equivalent process transformation using the specification and the drawings of the present invention may be directly or indirectly applied to other related technologies. The scope of the invention is included in the scope of patent protection of the present invention.

Claims

权 利 要 求 Rights request
1. 一种估算用户的隐性特征分布的方法, 其特征在于, 所述方法包括: 获取使用网站的用户以及用户的显性特征; A method for estimating a distribution of a recessive feature of a user, the method comprising: obtaining a user using the website and a dominant feature of the user;
从人口数据库获取所有人口的特征信息, 其中, 所述特征信息包括显性特 征和隐性特征;  Obtaining feature information of all populations from a population database, wherein the feature information includes dominant features and recessive features;
根据所述所有人口的特征信息、 使用网站的用户和所述用户的显性特征, 结合贝叶斯算法计算所述用户隐性特征分布。  The user recessive feature distribution is calculated according to the feature information of all the populations, the users using the website, and the dominant features of the user, in conjunction with the Bayesian algorithm.
2. 根据权利要求 1所述的方法, 其特征在于,  2. The method of claim 1 wherein
所述根据所述所有人口的特征信息、 使用网站的用户和所述用户的显性特 征, 结合贝叶斯算法计算所述用户隐性特征分布的步骤具体为:  The step of calculating the user's implicit feature distribution according to the feature information of the all populations, the user using the website, and the explicit features of the user, and combining the Bayesian algorithm are specifically:
若在任意用户的隐性特征下, 用户使用网站并且用户具有显性特征的概率 独立性条件成立, 则根据如下公式计算所述用户的隐性特征,  If, under the implicit feature of any user, the user uses the website and the probability independence condition of the user has a dominant feature, the implicit feature of the user is calculated according to the following formula.
ρ ρ
Figure imgf000016_0001
Figure imgf000016_0001
其中, 所述 L为大于或等于 1的整数, 所述 X为用户的隐性特征, 所述 t为 用户的显性特征, 所述/为使用所述网站的用户。  The L is an integer greater than or equal to 1, the X is a recessive feature of the user, the t is a dominant feature of the user, and the / is a user who uses the website.
3. 根据权利要求 2所述的方法, 其特征在于, 进一步包括, 判断在任意用 户的隐性特征下, 用户使用网站并且用户具有显性特征的概率独立性条件是否 成立, 所述判断具体步骤包括:  3. The method according to claim 2, further comprising: determining whether a probability independence condition of the user using the website and having the dominant feature is established under the implicit feature of any user, the determining the specific step Includes:
根据所有人口的特征信息、 使用网站的用户和用户的显性特征, 计算任意 用户的 的值, 其中, 所述 A的计算公式如下:  Calculate the value of any user based on the characteristic information of all the population, the users using the website, and the dominant characteristics of the user. The calculation formula of the A is as follows:
= i3(t | x1 .... n xi ) 根据所述所有人口的特征信息, 计算任意用户的 Ρ2的值, 其中, 所述 Ρ2的计 算公式如下: = i 3 (t | x 1 .... nx i ) Calculate the value of Ρ 2 of any user based on the characteristic information of all the populations, wherein the calculation formula of the Ρ 2 is as follows:
P2 = p(t I χι n…+ . nxJP l x +…. nxj P 2 = p (t I χ ι n...+ . nxJP lx +.... nxj
若所述任意用户的 与 P2均相等, 则所述概率独立性条件成立。 If any of the users is equal to P 2 , the probability independence condition is established.
4. 根据权利要求 1〜3 中任意一项所述的方法, 其特征在于, 所述方法还 包括: The method according to any one of claims 1 to 3, wherein the method further comprises:
根据所述用户的显性特征和隐性特征, 分析所述用户行为习惯。  The user behavior habits are analyzed according to the dominant features and recessive features of the user.
5. 一种估算用户的隐性特征分布的装置, 其特征在于, 包括:  5. A device for estimating a distribution of a recessive feature of a user, comprising:
第一获取模块, 用于获取使用网站的用户以及用户的显性特征;  a first obtaining module, configured to acquire a user who uses the website and a dominant feature of the user;
第二获取模块, 用于从全国人口数据库获取所有人口的特征信息, 其中, 所述特征信息包括显性特征和隐性特征;  a second obtaining module, configured to acquire feature information of all populations from a national population database, wherein the feature information includes a dominant feature and a recessive feature;
计算模块, 用于根据所述所有人口的特征信息、 使用网站的用户和所述用 户的显性特征, 结合贝叶斯算法计算所述用户的隐性特征分布。  And a calculation module, configured to calculate a recessive feature distribution of the user according to the feature information of the all populations, the user who uses the website, and the dominant features of the user, and the Bayesian algorithm.
6根据权利要求 5述的方法, 其特征在于, 所述计算模块具体用于 若在任意用户的隐性特征下, 用户使用网站并且用户具有显性特征的概率 独立性条件成立, 则根据如下公式计算所述用户的隐性特征, The method according to claim 5, wherein the calculating module is specifically configured to: if the user uses the website and the probability independence condition of the user having the dominant feature is established under the implicit feature of any user, according to the following formula Calculating the hidden features of the user,
p p
Figure imgf000017_0001
Figure imgf000017_0001
其中, 所述 L为大于或等于 1的整数, 所述 X为用户的隐性特征, 所述 t为 用户的显性特征, 所述/为使用所述网站的用户。  The L is an integer greater than or equal to 1, the X is a recessive feature of the user, the t is a dominant feature of the user, and the / is a user who uses the website.
7. 根据权利要求 6所述的方法, 其特征在于, 所述装置还包括判断模块; 所述判断模块用于根据所述所有用户的特征信息、 使用网站的用户和用户 的显性特征, 计算任意用户的 A的值, 其中, 所述 A的计算公式如下:  The method according to claim 6, wherein the device further comprises a determining module, wherein the determining module is configured to calculate according to feature information of all users, using a dominant feature of a user of the website and a user The value of A of any user, where the calculation formula of A is as follows:
= i3(t | x1 .... n xi) 和, = i 3 (t | x 1 .... nx i ) and,
根据所述所有用户的特征信息、 使用网站的用户和用户的显性特征, 计算 任意用户的 Ρ2的值, 其中, 所述 Ρ2的计算公式如下: Calculating the value of Ρ 2 of any user according to the characteristic information of all the users, the user using the website, and the dominant characteristics of the user, wherein the calculation formula of the Ρ 2 is as follows:
P2 = p(t I χι n…+ . n xJP l x +…. n xj P 2 = p (t I χ ι n ... +. N xJP lx + .... N xj
以及,  as well as,
判断所述任意用户的 A与 P2是否相等, 若相等, 则所述概率独立性条件成 Determining whether A and P 2 of the arbitrary user are equal, if equal, the probability independence condition is
8. 根据权利要求 5〜7 所述的装置, 其特征在于, 所述装置还包括分析模 块; The device according to any one of claims 5 to 7, wherein the device further comprises an analysis module;
所述分析模块, 用于根据用户的显性特征和隐性特征, 分析所述用户行为 习惯。  The analyzing module is configured to analyze the user behavior habit according to a dominant feature and a recessive feature of the user.
9. 一种估算用户的隐性特征分布的装置, 其特征在于, 所述装置包括处理 器;  9. Apparatus for estimating a distribution of recessive features of a user, characterized in that said apparatus comprises a processor;
所述处理器用于获取使用网站的用户以及用户的显性特征, 和, 从人口数 据库获取所有人口的特征信息, 其中, 所述特征信息包括显性特征和隐性特征, 以及, 根据所述所有人口的特征信息、 使用网站的用户和所述用户的显性特征, 结合贝叶斯算法计算所述用户隐性特征分布。  The processor is configured to acquire a user of the website and a dominant feature of the user, and acquire feature information of all the populations from the population database, where the feature information includes a dominant feature and a recessive feature, and, according to the all The characteristic information of the population, the user who uses the website, and the dominant features of the user are combined with the Bayesian algorithm to calculate the user's implicit feature distribution.
10. 根据权利要求 9所述的装置, 其特征在于,  10. Apparatus according to claim 9 wherein:
所述处理器根据所述根据所述所有人口的特征信息、 使用网站的用户和所 述用户的显性特征, 结合贝叶斯算法计算所述用户隐性特征分布的步骤具体为: 所述处理器用于若在任意用户的隐性特征下, 用户使用网站并且用户具有 显性特征的概率独立性条件成立, 则根据如下公式计算所述用户的隐性特征, ρ And the step of calculating, by the processor according to the feature information of the all populations, the user using the website, and the dominant feature of the user, the Bayesian algorithm to calculate the hidden feature distribution of the user according to the following: If the probability independence condition of the user using the website and the user has a dominant feature is established under the implicit feature of any user, the implicit feature of the user is calculated according to the following formula, ρ
Figure imgf000018_0001
Figure imgf000018_0001
其中, 所述 L为大于或等于 1的整数, 所述 X为用户的隐性特征, 所述 t为 用户的显性特征, 所述/为使用所述网站的用户。  The L is an integer greater than or equal to 1, the X is a recessive feature of the user, the t is a dominant feature of the user, and the / is a user who uses the website.
11. 根据权利要求 10所述的装置, 其特征在于,  11. Apparatus according to claim 10 wherein:
所述处理器还用于判断在任意用户的隐性特征下, 用户使用网站并且用户 具有显性特征的概率独立性条件是否成立, 所述判断具体步骤包括:  The processor is further configured to determine whether a probability independence condition that the user uses the website and the user has a dominant feature is established under the implicit feature of any user, and the specific steps of the determining include:
根据所有人口的特征信息、 使用网站的用户和用户的显性特征, 计算任意 用户的 的值, 其中, 所述 A的计算公式如下:  Calculate the value of any user based on the characteristic information of all the population, the users using the website, and the dominant characteristics of the user. The calculation formula of the A is as follows:
= i3(t | x1 .... n xi ) 根据所述所有人口的特征信息, 计算任意用户的 Ρ2的值, 其中, 所述 Ρ2的计 算公式如下: P2 = p(t I χι n…+ . nxJP lx +…. nxj = i 3 (t | x 1 .... nx i ) Calculate the value of Ρ 2 of any user based on the characteristic information of all the populations, wherein the calculation formula of the Ρ 2 is as follows: P 2 = p (t I χ ι n...+ . nxJP lx +.... nxj
若所述任意用户的 与 P2均相等, 则所述概率独立性条件成立。 If any of the users is equal to P 2 , the probability independence condition is established.
12. 根据权利要求 9〜11 中任意一项所述装置, 其特征在于,  12. Apparatus according to any one of claims 9 to 11 wherein:
所述处理器还用于根据所述用户的显性特征和隐性特征, 分析所述用户行 为习惯。  The processor is further configured to analyze the user behavior habit according to the dominant feature and the recessive feature of the user.
PCT/CN2014/079258 2014-06-05 2014-06-05 Method and apparatus for estimating recessive character distribution of users WO2015184619A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201480000467.4A CN104205100B (en) 2014-06-05 2014-06-05 A kind of method and device for the recessive character distribution for estimating user
PCT/CN2014/079258 WO2015184619A1 (en) 2014-06-05 2014-06-05 Method and apparatus for estimating recessive character distribution of users

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2014/079258 WO2015184619A1 (en) 2014-06-05 2014-06-05 Method and apparatus for estimating recessive character distribution of users

Publications (1)

Publication Number Publication Date
WO2015184619A1 true WO2015184619A1 (en) 2015-12-10

Family

ID=52088179

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/079258 WO2015184619A1 (en) 2014-06-05 2014-06-05 Method and apparatus for estimating recessive character distribution of users

Country Status (2)

Country Link
CN (1) CN104205100B (en)
WO (1) WO2015184619A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108765362B (en) * 2017-04-20 2023-04-11 优信数享(北京)信息技术有限公司 Vehicle detection method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060010029A1 (en) * 2004-04-29 2006-01-12 Gross John N System & method for online advertising
CN103164470A (en) * 2011-12-15 2013-06-19 盛大计算机(上海)有限公司 Directional application method based on user gender distinguished results and system thereof
CN103744917A (en) * 2013-12-27 2014-04-23 东软集团股份有限公司 Mixed recommendation method and system
CN103778555A (en) * 2014-01-21 2014-05-07 北京集奥聚合科技有限公司 User attribute mining method and system based on user tags

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060010029A1 (en) * 2004-04-29 2006-01-12 Gross John N System & method for online advertising
CN103164470A (en) * 2011-12-15 2013-06-19 盛大计算机(上海)有限公司 Directional application method based on user gender distinguished results and system thereof
CN103744917A (en) * 2013-12-27 2014-04-23 东软集团股份有限公司 Mixed recommendation method and system
CN103778555A (en) * 2014-01-21 2014-05-07 北京集奥聚合科技有限公司 User attribute mining method and system based on user tags

Also Published As

Publication number Publication date
CN104205100B (en) 2018-02-02
CN104205100A (en) 2014-12-10

Similar Documents

Publication Publication Date Title
Bärnighausen et al. Correcting HIV prevalence estimates for survey nonparticipation using Heckman-type selection models
US20210192565A1 (en) Methods and apparatus to correct misattributions of media impressions
US20220167037A1 (en) Methods and apparatus to correct for deterioration of a demographic model to associate demographic information with media impression information
AU2013204865B2 (en) Methods and apparatus to share online media impressions data
CA2810264C (en) Methods and apparatus to determine media impressions
US8768768B1 (en) Visitor profile modeling
US20130151311A1 (en) Prediction of consumer behavior data sets using panel data
US20230083206A1 (en) Methods and apparatus to generate audience metrics using third-party privacy-protected cloud environments
US20140129321A1 (en) Combination of Social Networking Data with Other Data Sets for Estimation of Viewership Statistics
US10825030B2 (en) Methods and apparatus to determine weights for panelists in large scale problems
Chirinda et al. Comparative study of disability‐free life expectancy across six low‐and middle‐income countries
US9866454B2 (en) Generating anonymous data from web data
US20220198493A1 (en) Methods and apparatus to reduce computer-generated errors in computer-generated audience measurement data
US9420442B2 (en) Ping compensation factor for location updates
Gimbrone et al. Associations between COVID-19 mobility restrictions and economic, mental health, and suicide-related concerns in the US using cellular phone GPS and Google search volume data
US20160034915A1 (en) Document performance indicators based on referral context
McGovern et al. Using interviewer random effects to remove selection bias from HIV prevalence estimates
WO2015184619A1 (en) Method and apparatus for estimating recessive character distribution of users
US20140297404A1 (en) Obtaining Metrics for Online Advertising Using Multiple Sources of User Data
CN112435070A (en) Method, device and equipment for determining user age and storage medium
US20200202370A1 (en) Methods and apparatus to estimate misattribution of media impressions
US10349102B2 (en) Distributing embedded content within videos hosted by an online system
US9600825B2 (en) Estimating probability of spreading information by users on micro-weblogs
Guan et al. Semiparametric maximum likelihood inference for nonignorable nonresponse with callbacks
CN115668234A (en) Efficient privacy enhancement of servers in federated learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14894051

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14894051

Country of ref document: EP

Kind code of ref document: A1