Differential privacy, as discussed by Google, ensures that no individual's data has a significant impact on the model's predictions.
The goal is to measure and minimize changes in output results based on the inclusion or exclusion of individual data items, so that data analysis results remain nearly identical whether a specific individual's data is included or excluded.
Privacy Auditing with One (1) Training Run
We propose a scheme for auditing differentially private machine learning systems with a single training run. This exploits the parallelism of being able to add or remove multiple training examples...
https://openreview.net/forum?id=f38EY21lBw

Differential Privacy 정리
머신러닝 기술이 발전함에 따라 어마어마한 양의 데이터들이 서버로 보내지고 있다.
이는 모델들을 학습하는 데에 많은 데이터와 컴퓨팅 자원이 필요하기 때문이다.
하지만 사용자들은 자신들의 정보, 데이터들을 서버로 보내는 것을 좋아할 리 없다.
해커들이 중간에 민감한 정보들을 훔쳐갈 수도 있고, 최근의 이루다 사건처럼 개인정보가 제대로 처리되는지 확신할 수 없기 때문이다.
오늘은 이를 해결하기 위한 방법인 Differential Privacy (DP) 에 대해 정리하고자 한다.
물론 DP는 머신러닝을 위해서 존재하는 것이 아닌 다양한 곳에 사용될 수 있는 보안에서의 용어이다.
https://zzaebok.github.io/machine_learning/DP/
VaultGemma
Provides mathematically defined privacy limits that prevent data leakage.
VaultGemma: The world's most capable differentially private LLM
Amer Sinha, Software Engineer, and Ryan McKenna, Research Scientist, Google Research
https://link.alphasignal.ai/PFGVYS

DP can prevent reconstruction attacks, and its guarantee is stronger when viewed through RDP compared to traditional (ε,δ)-DP interpretation. Previously, DP guarantees were mainly interpreted in terms of membership inference, but this paper provides direct guarantees against reconstruction/extraction attacks. In particular, using the probability preservation property of RDP (Rényi Differential Privacy), the paper argues that even with the same DP-SGD mechanism, one can obtain tighter leakage bounds than the ε interpretation in existing literature.
Even with a privacy budget that cannot prevent membership inference, it can still make full reconstruction of rare secrets difficult. In other words, if an attacker already knows a secret, they can somewhat determine "was this in the training set?" However, extracting a secret from the model without prior knowledge is much more difficult.
The smaller the prior probability p₀ of a secret (i.e., the rarer and higher entropy it is), the more limited the overall leakage is, even when the posterior p₁ increases due to training. Therefore, the longer and rarer a secret is, the more protected it is. As the paper states, "secret length itself serves as protection."
arxiv.org
https://arxiv.org/pdf/2202.07623
Opacus
The goal is to enable DP-SGD to be applied by adding only about 2 lines to existing training code. Much faster than micro-batch methods like PyVacy
arxiv.org
https://arxiv.org/pdf/2109.12298

Seonglae Cho