Privacy Mechanisms

This is the second post of my data privacy series. You may want to also check out this post for an introduction to data privacy and differential privacy. 🙂

For convenient, let’s repeat the definition of differential privacy from the previous post.

Definition: \epsilon-Differential Privacy

The query K(\cdot) is \epsilon-differentially private if and only if for any subset S of the range of K(\cdot),

(1)   \begin{align*} Pr(K(D_1) \in S) \le e^\epsilon Pr(K(D_2) \in S), \end{align*}

where D_1 and D_2 are any two data sets where they differ by at most one record and one data set is a subset of the other.

It is common in the literature to refer the “private” way to disclose the statistics of a data set to as privacy mechanisms. To ensure differential privacy, two common mechanisms are considered: Laplace mechanism and exponential mechanism.  But before discussing them, let’s define the sensitivity of a function applying on a data set as follows.

Definition: Sensitivity of a function

The sensitivity of a function on a data set, \Delta f, is defined by

(2)   \begin{align*} \Delta f = \max_{D_1,D_2} ||f(D_1) - f(D_2)||, \end{align*}

where D_1 and D_2 are any two data sets where they differ by at most one record and one data set is a subset of the other.

As one can see from the definition, sensitivity measure to a extreme how much a single record can change the function value of a query. Intuitively, a malicious user may have higher chance to infer private information from a more sensitive query than a less sensitive one. Therefore more protection may be needed for a more sensitve query.

Laplace Mechanism

Instead of returning the true computation, f(D), the Laplace mechanism returns the query result plus a Laplacian noise with standard deviation \sqrt{2} \Delta f / \epsilon (denoted as Lap(\Delta f/\epsilon)). In other words, it returns K(D) = f(D) + Lap(\Delta f/\epsilon).

It is not difficult to show that (1) (and hence differential privacy) will be satisfied. Note that (1) is true for all subset S if it is true for all singleton set \{ r \}. Therefore we can restrict ourselves to only consider the latter case. For any r, note that

(3)   \begin{align*} \frac{Pr(K(D_1) \in \{ r \})} {Pr(K(D_2) \in \{ r \})} &= \frac{\exp \left(-|f(D_1) -r | \frac{\epsilon}{\Delta f}\right) }{\exp\left(-|f(D_2) -r | \frac{\epsilon} {\Delta f}\right) }\\ &= \exp \left((|f(D_2) -r |-|f(D_1) -r |) \frac{\epsilon}{\Delta f}\right) \\ & \overset{(a)}{\le} \exp\left(|f(D_2) - f(D_1)| \frac{\epsilon}{\Delta f} \right) \\ & \overset{(b)}{\le} \exp \epsilon, \end{align*}

where (a) is due to triangle inequality (|a| + |b| \ge |a + b| \Rightarrow |d| + |c - d| \ge |c| \Rightarrow |c - d| \ge |c| - |d|) and exponential function being monotonically increasing, and (b) is from the definition of sensitivity.

Now, if the system allows more queries, intuitively more protection (hence more distortions to queries) will be needed since there is a larger chance for a malicious user to identify a record owner via analyzing the additional information.

Consider the case when we have two query functions f_1(\cdot) and f_2(\cdot). To ensure that a malicious user can’t take advantage the joint statistics of the queries to identify a record, we should modify the differential privacy definition in (1) to

(4)   \begin{align*} Pr((K_1(D_1), K_2(D_1)) \in S) \le e^\epsilon Pr((K_1(D_2),K_2(D_2)) \in S), \end{align*}

where K_1(\cdot) and K_2(\cdot) are the modified query output functions for f_1(\cdot) and f_2(\cdot), respectively. Again, D_1 and D_2 are any two data sets where they differ by at most one record and one data set is a subset of the other.

As in the single query case, we can construct K_1(\cdot) and K_2(\cdot) by simply adding Laplacian noises to the true results. However, we will require a larger noise as mentioned earlier. Actually, it turns out that Laplacian noise Lap(\frac{\Delta f_1 + \Delta f_2}{\epsilon}) is sufficient. In other words, we may choose K_1(D) = f_1(D) + Lap(\frac{\Delta f_1 + \Delta f_2}{\epsilon}) and K_2(D) = f_2(D) + Lap(\frac{\Delta f_1 + \Delta f_2}{\epsilon}).

Similar to the single query case, to show (4) to be true for all S, it is sufficient to show that it is true for any singleton set \{r_1, r_2 \}. Then,

(5)   \begin{align*} & \frac{Pr((K_1(D_1),K_2(D_1)) \in \{ (r_1,r_2) \})} {Pr((K_1(D_2),K_2(D_2)) \in \{ (r_1,r_2) \})} \\ = & \frac{\exp \left(-[|f_1(D_1) -r_1 | +|f_2(D_1) -r_2 |] \frac{\epsilon}{\Delta f_1 + \Delta f_2}\right) }{\exp\left(-[|f_1(D_2) -r_1 |+|f_2(D_2) -r_2 |] \frac{\epsilon} {\Delta f_1 + \Delta f_2}\right) } \\ \overset{(a)}{\le} & \exp \left([|f_2(D_2) - f_2(D_1)| + |f_1(D_2) - f_1(D_1)|] \frac{\epsilon} {\Delta f_1 + \Delta f_2}\right) \overset{(b)}{\le} \exp \epsilon, \end{align*}

where (a) is due to triangle inequality and (b) is again from the definition of sensitivity.

Exponential Mechanism

When an attribute is categorical, we cannot simply attach a noise to it and so Laplace mechanism will not apply in that case.  To get around this, we introduce exponential mechanism, which essentially adds noise to the distribution of the attribute rather than the attribute itself.

Recall that the core idea of privacy mechanism is to introduce randomness to query outcome. Therefore, even for a categorical query the output should not be deterministic. For a data set D and a particular query, we may specify an utility function u(D,x) to control the preference of the output. Essentially, the larger u(D,x), the more likely x will be the output.

Definition: Exponential Mechanism

Given an utility function u(D,x), we output x with probability proportional to \exp\left(\epsilon \frac{u(D,x)}{\Delta u}\right), where \Delta u is the sensitivity of u(\cdot,\cdot) given by

(6)   \begin{align*} \Delta u = \max_{D_1, D_2,y} ||u(D_1,y) - u(D_2,y)|| \end{align*}

and D_1 and D_2 are any two adjacent data sets.

Note that the definition of sensitivity differs from the previous one.

Exponential Mechanism is differentially private

The exponential mechanism described above is differentially private. Intuitively, one cannot gather information of an individual record out of the distribution of the observed outcomes.

Denote K(D) as the output of the exponential mechanism.
Therefore, the probability of getting y for data set D can be written as

(7)   \begin{align*} Pr(K(D)=y)=\frac{ \exp\left( \epsilon \frac{u(D, y)}{\Delta u} \right)} { \sum_y \exp\left( \epsilon \frac{u(D, y)}{\Delta u} \right)}. \end{align*}

Note that if we shift from D to an adjacent data set D'. The change of numerator is bounded by \exp(\epsilon) and the change of denominator is also bounded by \exp(\epsilon). So overall the change in probability is bounded by \exp (2 \epsilon). Thus the mechanism is 2\epsilon-differentially private.

What next?

In the next post, we will discuss the effect of compositions to privacy mechanism.

Leave a Reply

Your email address will not be published. Required fields are marked *