Summary: On Sampling, Anonymization, and Differential Privacy: Or, k-Anonymization Meets Differential Privacy

Flexudy Education
3 min readJul 6, 2020

Authors: Ninghui Li, Wahbeh Qardaji, Dong Su, in arXiv:1101.2604v2, June 2011

The summary below was automatically created by Flexudy. Feel free to download the app from your Google Play Store or Apple App Store.

Have fun reading

The main result of the paper is that k-anonymization, when done “safely”, and when preceded with a random sampling step, satisfies (ǫ, δ)-differential privacy with reasonable parameters. We consider the scenario where a trusted curator obtains a dataset by gathering private information from a large number of respondents, and then make usage of the dataset while protecting the privacy of respondents. The curator may learn and release to the public statistical facts about the underlying population. Many k-anonymization methods have been developed over the last decades; it has also been extensively applied to other problems such as locationprivacy [14]. The k-anonymity notion requires that when only certain attributes, known as quasi-identifiers (QIDs), are considered, each tuple in a k-anonymized dataset should appear at least k times. We then define classes of k- anonymization algorithms that are “strongly-safe” and “ǫ-safe”, which avoid the privacy vulnerabilities of existing k-anonymization algorithms. The notion of differential privacy was introduced by Dwork et al. if for any two neighboring datasetsD andD′, the distributions ofA(D) andA(D′) differ at most by a multiplicative factor of eǫ. A relaxed version of ǫ-DP, which we use(ǫ, δ)-DP to denote, allows an error probability bounded byδ. Satisfying differential privacy ensures that even if the adversary has full knowledge of the values of a tuple t, as well as full knowledge of what other tuples are in the dataset, and is only uncertain about whether t is in the input dataset, the adversary cannot tell whether t is in the dataset or not beyond a certain confidence level. As in most data publishing scenarios, the adversary is unlikely to have precise information about all other tuples in a dataset. We say that an algorithm satisfies differential privacy under sampling if the algorithm preceded with a random sampling step satisfies differential privacy. Sometimes, this sampling step is implicit; because the respondents are randomly selected, one can view the dataset as resulted from sampling. We prove that safe k-anonymization algorithm, when preceded by a random sampling step, provides(ǫ, δ)-differential privacy with reasonable parameters. if and only ifβ > δ and the algorithmAβ gives(ǫ, δ)-DP, whereAβ denotes the algorithm to first sample with probabilityβ (include each tuple in the input dataset with probabilityβ), and then applyA to the sampled dataset. We have shown that k-anonymity (even when all attributes are treated as QID) does not provide adequate protection, nor do existing k-anonymization algorithms. The notion of k- anonymity implicitly assumes that there is a one-to-one relation g between the input tuples and the output tuples, i.e., given inputD, the output dataset is{g(t) If any individual’s tuple is published, there must exist at least k − 1 other tuples in the input database that are the same under the recoding scheme; furthermore, the recoding scheme does not depend on the dataset, and one sees only the results of the recoding. Then,A(D) andA(D′) contain different numbers of g(t). We use f(j;n, β) to denote the probability mass function for the binomial distribution; that is, f(j;n, β) gives the probability of getting exactly j successes inn trials where each trial succeeds with probabilityβ. And we useF (j;n, β) to denote the cumulative probability mass function; that is,F (j;n, β) = ∑j i=0 f(i;n, β). Unfortunately, due to the discrete nature of the binomial distribution, the maximum value may not be reached at nm, but instead at one of the next few local maximal points In the literature, k- anonymization and differential privacy have been viewed as very different privacy guarantees. k-anonymization achieves weak syntactic privacy, and differential privacy provides strong semantic privacy guarantees. We take the approach of starting from both k-anonymization and differential privacy and trying to meet in the middle. Let Λβ denote the process of binomial sampling the datasetD with probabilityβ.”

Did you enjoy reading? Follow us on Medium and give us feedback to help us improve our Summarizer.

--

--

Flexudy Education

Our goal at flexudy education is to improve the quality of education and research with the help of trend technologies.