Title: Multimethod synthetic data generation for confidentiality and measurement of disclosure risk
Authors: Michael D. Larsen; Jennifer C. Huckett
Addresses: Biostatistics Center; Department of Statistics, George Washington University, 6110 Executive Blvd., Ste 750, Rockville, MD 20852, USA. ' Battelle Memorial Institute, 505 King Ave., Columbus, OH 43201, USA
Abstract: Government agencies must simultaneously maintain confidentiality of individual records and disseminate useful microdata. We propose a method to create synthetic data that combines quantile regression, hot deck imputation, and rank swapping. The result from implementation of the proposed procedure is a releasable dataset containing original values for a few key variables, synthetic quantile regression predictions for several variables, and imputed and perturbed values for remaining variables. To measure the disclosure risk in the resulting synthetic dataset, we extend existing probabilistic risk measures that aim to imitate an intruder attempting to match a record in the released data with information previously available on a target respondent.
Keywords: disclosure control; hot deck imputation; quantile regression; rank swapping; simulation; statistical disclosure limitation; SDL; synthetic data; disclosure avoidance; disclosure risk; confidentiality; privacy; security.
DOI: 10.1504/IJIPSI.2012.046132
International Journal of Information Privacy, Security and Integrity, 2012 Vol.1 No.2/3, pp.184 - 204
Published online: 23 Aug 2014 *
Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article