The Balance between Data Redaction and Data Utility

Megaputer Intelligence
4 min readOct 3, 2019

How to make use of data while respecting privacy

What is Data Anonymization?

Data anonymization, also known as data redaction, is the process of removing or concealing the identifiable information of individuals (i.e., personal data), so that the data may be used more widely in different applications. Several organizations, such as the Institutional Review Board (IRB) and European Medicines Agency (EMA), require researchers and companies to anonymize their data before sharing or publishing their work, in order to protect the privacy of their data subjects and their personal data.

Article 3(1) in the Regulation (EU) 2018/1725 of the European Parliament defines personal data and data subjects as follows:

[P]ersonal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person[.]”

Of course, depending on the context of the data release, the anonymization requirements may change. For example, internal data sharing would have less strict anonymization requirements as opposed to a public release of a dataset.

According to the EMA, effective anonymization solutions may be evaluated on three criteria:

  1. Possibility to single out an individual.
  2. Possibility to link records relating to an individual.
  3. Whether information can be inferred with regard to an individual.

If an anonymization solution fails in one of the three criteria, then the risks of re-identification must be evaluated.

What are the challenges?

There are several challenges associated with efficient data anonymization, the main one being the achievement of balance between anonymization and readability. Anonymized data is often used for training data models to predict certain characteristics, behaviors, or outcomes in a variety of fields. However, a dataset that has an important part of its data redacted may not be very useful for further analyses and may affect the performance of the models. Consequently, there is always a trade-off between privacy and model performance to consider.

With the rise of Big Data and the Semantic Web, another challenge has appeared: the risk of re-identification due to linked databases. The Semantic Web, or Web 3.0, allows linking diverse information about individuals across databases that may be used for Artificial Intelligence processes. However, this increases the difficulty of anonymizing personal data efficiently, since there are pieces of identifiable information in multiple locations. Even if we anonymize personal data in one database, we may be able to re-identify an individual based on the linked identifiable information in a different database.

Finally, there are challenges in identifying and extracting personal data properly, but also in making the anonymization reversible. Many companies use manual annotation systems that add tags to the information, or entities, that need to be anonymized. The available volume of data nowadays makes this process extremely time-consuming and labor-intensive. In addition, a good anonymization solution should also use secure encryption for saving the original information and allowing the reversal of anonymization: the de-anonymization of data. For these reasons, there is increased demand for automated systems that efficiently and accurately identify different types of personal data in unstructured text.

How is it done?

Anonymization may be approached from a utility or a privacy viewpoint. The utility approach focuses on preserving the utility of the data as much as possible and allows for some loss of privacy, whereas the privacy approach focuses on implementing methods that offer the highest privacy while sacrificing some of the data utility. Depending on the data release context, we may choose one or the other; however, an ideal anonymization solution should strike a balance between privacy and utility.

The two main steps in a data redaction task are the data preprocessing and the anonymization.

Data Preprocessing

During the data preprocessing step, we make sure that our data is formatted appropriately and as clean as possible. For example, spelling mistakes may decrease accuracy when identifying the personal information that needs to be anonymized. The identification of information, or entities, that need to be anonymized is also part of the data preprocessing step, and it is achieved either via manual tagging or an automated approach; of course, the latter is preferred. Once this information is identified and tagged, it is classified into direct identifiers and quasi-identifiers based on replicability, distinguishability, and knowability.

Anonymization

There are several considerations that we need to take into account when moving onto the anonymization step. First, we need to identify any possible attacks to our data and the entities behind them. Where will the data be released? Who will have access to it? Could it be used in a malicious way? If so, how? The answers to these questions help us evaluate and decide on the balance between low re-identification risk and data usability.

Next, we need to choose an anonymization methodology that is appropriate for our selected privacy-utility balance and data goals. The simplest anonymization technique is the complete and irreversible removal of any personal information using a data redaction software. While this offers a very low risk of re-identification, it scores very low in data utility since the de-identified data is not very readable.

Originally published at https://www.megaputer.com on October 3, 2019.

--

--

Megaputer Intelligence

A data and text analysis firm that specializes in natural language processing