原文地址
https://www.eff.org/deeplinks/2010/01/primer-information-theory-and-privacy
当我们对是否能使用一个因素确定一个人时,答案并不是简单的是或者否。就像当我们仅知道一个人的邮政编码, 或是出生年月,或是性别时, 我们并不能确定他们是具体的哪个人。
当如果我们结合这三件因素时, 我们很可能推断出他们的身份。
有一个数学量可以让我们测量一个事实到底揭示了某人的身份特征到什么程度。这个量称为“熵”,通常以比特(bits)来计量。
直观上你可以把熵看作是一个随机变量可能结果数量的推广:如果有两个可能结果,熵为1比特;如果有四个可能结果,熵为2比特,等等。增加1比特的熵会使可能的结果数量加倍1。
因为地球上约有70亿人,一个随机、身份不明的人至少包含略少于33比特的熵(2的33次方是80亿)。
当我们了解到一个人新的信息时,这个信息会一定程度上减少他们身份的熵。
有一个公式可以计算减少的量:
$
\Delta S = -\log_{2} \Pr(X = x)
$
其中ΔS是熵的减少量,以比特为单位,2Pr(X=x)就是这个因素对于确定一个随机人的概率。我们来计算几个例子,就当玩玩:
星座:
生日:
需要注意的是, 如果组合多个信息, 并不是总能获得新的发现, 就像你已经告诉了我你的生日, 那么再告诉我星座, 这就相当于没有获取到新的信息3.
在上面的例子中, 每个星座和生日的可能性被假定相同4. 这样的计算也适用于概率不均的例子. 比如一个待确定身份的人的邮编是90210(Beverley Hills, California)的可能性和邮编是40203 (Louisville, Kentucky的一部分) 的可能性是不一样的. 根据2007 年的数据, 90210地区居住着21733人,40203地区只有452人,而整个地球大约有66亿人.
邮编为: 90210
: 邮编为: 40203
: 生活在: Moscow
:
识别一个人需要多少的熵
依据 2007 年的数据, 从地球上所有的人口中确定一个人的身份需要:
稍微严谨一点点, 我们可以将这个数四舍五入到 33 bits.
目前为止, 如果我们知道某个人的生日和他的邮编为40230, 那么我们将有 8.51 + 23.81 = 23.32 bits
. 这几乎(但也许还不够)足以了解他们是谁:可能有几个人具有这些特征。加上他们的性别,即 33.32
位,我们大概可以准确地说出这个人是谁。5
Web 浏览器上的应用程序
那么,对于浏览器怎么套入这个例子呢? 事实证明,除了常用于浏览器 “识别” 的特征(例如 IP 地址和跟踪 cookie)之外,浏览器之间还存在更细微的差异,可以用来区分它们。
一个显眼的例子是 User-Agent 字符串,它包含浏览器的名称、操作系统和精确版本号,并且会发送到您访问的每个 Web 服务器。典型的 User-Agent 字符串如下所示:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
正如你所看到的,里面有很多 “东西”。事实证明,这个 “东西” 对于在网上区分不同的人非常有用。在另一篇文章中,我们报告说,平均而言, User-Agent 字符串包含大约 10.5 位的识别信息,这意味着如果您随机选择一个人的浏览器,则只有 1,500 个互联网用户使用同一个 User-Agent 字符串。
EFF 的 Panopticlick 项目是一项隐私研究工作,旨在衡量其他浏览器特征传递了多少识别信息。请访问 Panopticlick 以了解如何识别您的浏览器,并帮助我们进行研究。
1. Entropy is actually a generalization of counting the number of possibilities, to account for the fact that some of the possibilities are more likely than others. You can find a pretty version of the formula here#Definition). ↩
2. This quantity is called the “self-information“ or “surprisal” of the observation, because it is a measure of how “surprising” or unexpected the new piece of information is. It is really measured with respect to the random variable that is being observed (perhaps, a person’s age or where they live), and a new, reduced, entropy for their identity can be calculated in the light of this observation. ↩
3. What happens when facts are combined depends on whether the facts are independent. For instance, if you know someone’s birthday and gender, you have 8.51 + 1 = 9.51 bits of information about their identity because the probability distributions of birthday and gender are independent. But the same isn’t true for birthdays and starsigns. If I know someone’s birthday, then I already know their starsign, and being told their starsign doesn’t increase my information at all. We want to calculate the change in conditional entropy of the person’s identity on all the observed variables, and we can do that by making the probabilities for new facts conditional on all the facts we already know. Hence we see ΔS = -log2 Probability(Gender=Female|DOB=2nd of January) = -log2(1/2) = 1, and ΔS = -log2 Probability(Starsign=Capricorn|DOB=2nd of January)=-log2(1) = 0. In between cases are also possible: if I knew that someone was born in December, and then I learn that they are a Capricorn, I still gain some new bits of information, but not as much as I would have if I hadn’t known their month of birth: ΔS = -log2 Probability(Starsign=Capricorn|month of birth=December)=-log2 (10/31) = 1.63 bits. ↩
4. Actually, in the birthday example, we should have accounted for the possibility that someone was born on the 29th of February during a leap year, in which case ΔS =-log2 Pr(1/365.25) ↩
5. If you’re paying close attention, you might have said, “Hey, that doesn’t sound right; sometimes there will be only one person in ZIP code 40203 who has a given birthday, in which case you don’t need gender to identify them, and it’s possible (but unlikely) that ten people in 40203 were all born on the 2nd of January. The correct way to formalize these issues would be to use the real fequency distribution of birthdays in the 40203 ZIP code. ↩