Introduction
For a Data Scientist in the corporate area like me, data privacy is a tremendous challenge: Data is the source of all the powerful things we can do with Analytics or Machine Learning, but since it is often personal and sensitive we cannot simply use it as of course our responsibility is to preserve the privacy of individuals. This is covered by legislation, such as GDPR in Europe, which also requires us to get a proper consent from the customer to use his/her data in these kinds of applications. In many cases, we do have some consent documented, but it doesn’t cover our specific use case as it wasn’t know at the time the consent was requested.
One of the options to make use of all the data, apart from getting a new consent from the customers would be to use an anonymized version of the data that does not allow for identification of the individual. Still, doing anonymization in a secure, lawful way that actually protects your customers rights isn’t that easy and as if that wasn’t enough, anonymized data is often considered useless for Data Science usage.
Thankfully, the SAP HANA data anonymization team stepped up to take the first challenge and they asked me to take the second one.
So here is the plan: We will take the publicly available, well known adult data set, put it into a SAP HANA database, run the SAP HANA data anonymization feature on it and then see how much loss in prediction accuracy we see compared to working with the original data sets.
A few words on data anonymization
To put it frankly, anonymization is more than simply removing IDs and names from a dataset. Lots of research has been done to prove that doing this does not properly secure privacy of data. Simply think of a 19-year-old HON-CIRLCE member in your company’s travel data – chances are you will know that person without any names and IDs. To preserve this employee’s privacy (travel details) you might get rid of this leaking information like age or frequent traveler status, that we refer to as quasi-identifiers or “blur” the actual sensitive data (travel details).
Research provides us with a variety of well-known algorithms to achieve that, having different privacy and utility guarantees. SAP HANA data anonymization comes with three of them – k-Anonymity, l-Diversity and Differential Privacy.
For this post we will stick with k-Anonymity, while the others might be part of future blog posts. With our frequent traveler this might be a group of employees that share the characteristics “age between 18 and 25” and “has frequent traveler status”.
Getting started
So, let’s get started. The adult data set is one the most well-known publicly available data set that is regularly used for Data Science demos and presentations. It contains US census data with a set of personal characteristics and the information on the income level, divided in two groups below or above 50k $. The questions then typically is, how well certain algorithms can predict the individual income class based on the characteristics.
Obviously, the income data is highly sensitive, but may be revealed by the quasi-identifiers (age, occupation, marital status, etc.): Assume you know a widowed white female, born in Guatemala who only passed preschool: If our data set wasn’t just a small extract, that would by far be more than enough to get to know her salary. As we don’t want this to happen, we want to secure this information before working with the data.
To do so, we will request our anonymized data set to be k-anonymous with k = 10, meaning any individual record shares the quasi-identifiers (age, occupation, marital status etc.) with at least 9 others.
SAP HANA data anonymization achieves this through the mean of an anonymization view. This is a specialized view, build on top of any (HANA) data source, that specifies both the anonymization parameters as well as information on how to handle the contained columns.
In the case of k-Anonymity we need to specify the levels, to which each column may be aggregated. There is three ways to do that specification:
1. Embedded hierarchies: directly specified in the view definition.
2. Hierarchy functions: SQL Script functions defined anywhere in the database, that compute the applicable values at runtime.
3. External hierarchies: Hierarchies defined in the SAP HANA System
We will make use of the first two options.
First, we’ll specify functions to group the numeric values into buckets. We use two separate functions as the value ranges in age, workhours and years of education vary, obviously.
CREATE OR REPLACE FUNCTION HIERARCHYFUNCTION_SMALL_NUM(value varchar(255), level integer)
RETURNS outValue varchar(255)
AS
BEGIN
DECLARE double_value double;
DECLARE rangefrom integer;
DECLARE interval integer;
IF (level > 0) THEN
interval := power(2, level);
rangefrom = floor(value/interval)*interval;
outValue := '[' || rangefrom||'-'||rangefrom+power(2, level)-1|| ']';
ELSE
outValue := value;
END IF;
END;
Second, we’ll define embedded hierarchies for the categorical variables. This unfortunately is quite cumbersome if the quasi-identifier contains many different values, and we would want to create external hierarchies for future productive (re-)use.
COLUMN AGE PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":5}}'
COLUMN EDUCATION PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Preschool","Primary School","Primary education"],["1st-4th","Primary School","Primary education"],["5th-6th","Primary School","Primary education"],["7th-8th","High School","Secondary education"],["9th","High School","Secondary education"],["10th","High School","Secondary education"],["11th","High School","Secondary education"],["12th","High School","Secondary education"],["HS-grad","High School","Secondary education"],["Assoc-voc","Professional Education","Higher education"],["Assoc-acdm","Professional Education","Higher education"],["Prof-school","Professional Education","Higher education"],["Some-college","Undergraduate","Higher education"],["Bachelors","Undergraduate","Higher education"],["Masters","Graduate","Higher education"],["Doctorate","Graduate","Higher education"]]}}'
COLUMN MARITAL_STATUS PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Married-civ-spouse","spouse present"],["Divorced","spouse not present"],["Never-married","spouse not present"],["Separated","spouse not present"],["Widowed","spouse not present"],["Married-spouse-absent","spouse not present"],["Married-AF-spouse","spouse present"]]}}'
COLUMN HOURS_PER_WEEK PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":5}}'
COLUMN GENDER PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Male"],["Female"]]}}'
COLUMN WORKCLASS PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"embedded": [["Private", "Private"],["Federal-gov","Government"],["Local-gov","Government"],["State-gov","Government"],["Self-emp-inc","Self-Employed"],["Self-emp-not-inc","Self-Employed"],["Never-worked","Other"],["Without-pay","Other"],["?","Other/Unknown"],["Unknown","Other/Unknown"]]}}'
COLUMN EDUCATION_NUM PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_SMALL_NUM","levels":5}}'
COLUMN OCCUPATION PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"embedded": [["Adm-clerical", "White-Collar"],["Craft-repair", "Blue-Collar"],["Exec-managerial", "White-Collar"],["Farming-fishing", "Blue-Collar"],["Handlers-cleaners", "Blue-Collar"],["Machine-op-inspct", "Blue-Collar"],["Other-service", "Service"],["Priv-house-serv", "Service"],["Prof-specialty", "Professional"],["Protective-serv", "Service"],["Tech-support", "Service"],["Transport-moving", "Blue-Collar"],["Armed-Forces", "Other/Unknown"],["?", "Other/Unknown"],["Sales", "Other/Unknown"]]}}'
With that we are nearly good to go. Last thing is the syntax of the actual anonymization view.
CREATE VIEW ADULT_VIEW_SAMPLE_1 (ID, AGE, EDUCATION, MARITAL_STATUS, GENDER, HOURS_PER_WEEK, WORKCLASS, EDUCATION_NUM, OCCUPATION ,CAPITAL_GAIN, CAPITAL_LOSS, SALARY) AS
SELECT ID, TO_VARCHAR(AGE) AS "AGE", EDUCATION, MARITAL_STATUS, GENDER, TO_VARCHAR(HOURS_PER_WEEK) AS "HOURS_PER_WEEK", WORKCLASS, TO_VARCHAR(EDUCATION_NUM) AS EDUCATION_NUM, OCCUPATION, CAPITAL_GAIN, CAPITAL_LOSS, SALARY FROM ANON_DEMO.ADULT_DATA_TRAIN
WITH ANONYMIZATION (
ALGORITHM 'K-ANONYMITY' PARAMETERS '{"data_change_strategy": "restricted", "k":10}'
Setting it all together gives us below definition.
CREATE VIEW ADULT_VIEW_SAMPLE_1 (ID, AGE, EDUCATION, MARITAL_STATUS, GENDER, HOURS_PER_WEEK, WORKCLASS, EDUCATION_NUM, OCCUPATION ,CAPITAL_GAIN, CAPITAL_LOSS, SALARY) AS
SELECT ID, TO_VARCHAR(AGE) AS "AGE", EDUCATION, MARITAL_STATUS, GENDER, TO_VARCHAR(HOURS_PER_WEEK) AS "HOURS_PER_WEEK", WORKCLASS, TO_VARCHAR(EDUCATION_NUM) AS EDUCATION_NUM, OCCUPATION, CAPITAL_GAIN, CAPITAL_LOSS, SALARY FROM ANON_DEMO.ADULT_DATA_TRAIN
WITH ANONYMIZATION (
ALGORITHM 'K-ANONYMITY' PARAMETERS '{"data_change_strategy": "restricted", "k":10}'
COLUMN ID PARAMETERS '{"is_sequence":true}'
COLUMN AGE PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":5}}'
COLUMN EDUCATION PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Preschool","Primary School","Primary education"],["1st-4th","Primary School","Primary education"],["5th-6th","Primary School","Primary education"],["7th-8th","High School","Secondary education"],["9th","High School","Secondary education"],["10th","High School","Secondary education"],["11th","High School","Secondary education"],["12th","High School","Secondary education"],["HS-grad","High School","Secondary education"],["Assoc-voc","Professional Education","Higher education"],["Assoc-acdm","Professional Education","Higher education"],["Prof-school","Professional Education","Higher education"],["Some-college","Undergraduate","Higher education"],["Bachelors","Undergraduate","Higher education"],["Masters","Graduate","Higher education"],["Doctorate","Graduate","Higher education"]]}}'
COLUMN MARITAL_STATUS PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Married-civ-spouse","spouse present"],["Divorced","spouse not present"],["Never-married","spouse not present"],["Separated","spouse not present"],["Widowed","spouse not present"],["Married-spouse-absent","spouse not present"],["Married-AF-spouse","spouse present"]]}}'
COLUMN HOURS_PER_WEEK PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":5}}'
COLUMN GENDER PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Male"],["Female"]]}}'
COLUMN WORKCLASS PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"embedded": [["Private", "Private"],["Federal-gov","Government"],["Local-gov","Government"],["State-gov","Government"],["Self-emp-inc","Self-Employed"],["Self-emp-not-inc","Self-Employed"],["Never-worked","Other"],["Without-pay","Other"],["?","Other/Unknown"],["Unknown","Other/Unknown"]]}}'
COLUMN EDUCATION_NUM PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_SMALL_NUM","levels":5}}'
COLUMN OCCUPATION PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"embedded": [["Adm-clerical", "White-Collar"],["Craft-repair", "Blue-Collar"],["Exec-managerial", "White-Collar"],["Farming-fishing", "Blue-Collar"],["Handlers-cleaners", "Blue-Collar"],["Machine-op-inspct", "Blue-Collar"],["Other-service", "Service"],["Priv-house-serv", "Service"],["Prof-specialty", "Professional"],["Protective-serv", "Service"],["Tech-support", "Service"],["Transport-moving", "Blue-Collar"],["Armed-Forces", "Other/Unknown"],["?", "Other/Unknown"],["Sales", "Other/Unknown"]]}}'
)
Please note, that the core definition of the view is followed by a “REFRESH” statement, that is necessary to actually run the anonymization calculation. If the defined criteria do not allow for a solution of this anonymization problem, it will throw an error at that point.
Querying the new view gives us the following result:
As you can see, many of the columns have been completely masked with a ‘*’. This comes from two reasons:
1. We did not put any restrictions on the choice of levels, so the algorithm computes the mathematically optimal solution, that might not be the semantically optimal solution.
2. By default anonymization runs with the recoding scheme “global full domain”. Recoding schemes define, how levels are chosen in-between the various groups. “Global full domain” means, to stick the worst level from any of the groups for all other groups. So, if a single case that cannot be solved without completely hiding a specific column, this will result in the column to be suppressed in all other cases.
Let’s give it a second shot with the “multi_dimensional_strict” recoding scheme, that has recently been introduced with SAP HANA Cloud and allows all the group to have an individual level of anonymization for each column.
As you can see, much more information is preserved, while the privacy guarantee can still be given.
For the sake of this post, we unfortunately need to make information loss a little worse again, as we want to see the algorithm perform under this suboptimal condition. To do so, we will define some min and max values on the hierarchy level of each of the column. That will force the algorithm aggregate some columns although not technically necessary.
Definition now looks like this:
CREATE VIEW ADULT_VIEW_TRAIN (ID, AGE, EDUCATION, MARITAL_STATUS, GENDER, HOURS_PER_WEEK, WORKCLASS, EDUCATION_NUM, OCCUPATION ,CAPITAL_GAIN, CAPITAL_LOSS, SALARY) AS
SELECT ID, TO_VARCHAR(AGE) AS "AGE", EDUCATION, MARITAL_STATUS, GENDER, TO_VARCHAR(HOURS_PER_WEEK) AS "HOURS_PER_WEEK", WORKCLASS, TO_VARCHAR(EDUCATION_NUM) AS EDUCATION_NUM, OCCUPATION, CAPITAL_GAIN, CAPITAL_LOSS, SALARY FROM ANON_DEMO.ADULT_DATA_TRAIN
WITH ANONYMIZATION (
ALGORITHM 'K-ANONYMITY' PARAMETERS '{"data_change_strategy": "restricted", "k":10,"recoding":"multi_dimensional_strict", "loss":0.025}'
COLUMN ID PARAMETERS '{"is_sequence":true}'
COLUMN AGE PARAMETERS '{"is_quasi_identifier":true, "min":1, "max": 3 ,"hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":5}}'
COLUMN EDUCATION PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Preschool","Primary School","Primary education"],["1st-4th","Primary School","Primary education"],["5th-6th","Primary School","Primary education"],["7th-8th","High School","Secondary education"],["9th","High School","Secondary education"],["10th","High School","Secondary education"],["11th","High School","Secondary education"],["12th","High School","Secondary education"],["HS-grad","High School","Secondary education"],["Assoc-voc","Professional Education","Higher education"],["Assoc-acdm","Professional Education","Higher education"],["Prof-school","Professional Education","Higher education"],["Some-college","Undergraduate","Higher education"],["Bachelors","Undergraduate","Higher education"],["Masters","Graduate","Higher education"],["Doctorate","Graduate","Higher education"]]}}'
COLUMN MARITAL_STATUS PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Married-civ-spouse","spouse present"],["Divorced","spouse not present"],["Never-married","spouse not present"],["Separated","spouse not present"],["Widowed","spouse not present"],["Married-spouse-absent","spouse not present"],["Married-AF-spouse","spouse present"]]}}'
COLUMN HOURS_PER_WEEK PARAMETERS '{"is_quasi_identifier":true, "min":1, "max": 4 ,"hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_NUM_2","levels":5}}'
COLUMN GENDER PARAMETERS '{"is_quasi_identifier":true,"hierarchy":{"embedded":[["Male"],["Female"]]}}'
COLUMN WORKCLASS PARAMETERS '{"is_quasi_identifier":true, "min": 1, "hierarchy":{"embedded": [["Private", "Private"],["Federal-gov","Government"],["Local-gov","Government"],["State-gov","Government"],["Self-emp-inc","Self-Employed"],["Self-emp-not-inc","Self-Employed"],["Never-worked","Other"],["Without-pay","Other"],["?","Other/Unknown"],["Unknown","Other/Unknown"]]}}'
COLUMN EDUCATION_NUM PARAMETERS '{"is_quasi_identifier":true, "min": 1, "max":4 ,"hierarchy":{"schema":"JUPYTER","function":"HIERARCHYFUNCTION_SMALL_NUM","levels":5}}'
COLUMN OCCUPATION PARAMETERS '{"is_quasi_identifier":true, "min": 1, "hierarchy":{"embedded": [["Adm-clerical", "White-Collar"],["Craft-repair", "Blue-Collar"],["Exec-managerial", "White-Collar"],["Farming-fishing", "Blue-Collar"],["Handlers-cleaners", "Blue-Collar"],["Machine-op-inspct", "Blue-Collar"],["Other-service", "Service"],["Priv-house-serv", "Service"],["Prof-specialty", "Professional"],["Protective-serv", "Service"],["Tech-support", "Service"],["Transport-moving", "Blue-Collar"],["Armed-Forces", "Other/Unknown"],["?", "Other/Unknown"],["Sales", "Other/Unknown"]]}}'
)
The result then looks like this:
You might have noticed that I secretly introduced a new parameter called “loss”. This allows me to ignore a certain percentage of my data during anonymization. Technically they are dropped, which should allow for an anonymization with lower generalization levels, for the price of some rows being dropped. Of course, this only applies to our training data, while the test data keep untouched for a proper comparison with the original data.
With that, we are good to go to the Data Science / Machine Learning part.
Machine Learning training & results
As my data reside in a SAP HANA database, it makes perfect sense to use SAP HANAs ML library (PAL) for the ML part. From there we’ll choose the fairly new Hybrid Gradient Boosting tree algorithm for classification. As it wasn’t the focus of this post, I didn’t spend too much time on optimizing the training. Some simple grid search on the core parameters and we are good to go. Thanks to the comfortable Python API we don’t even need to leave our Python environment here.
For a fair comparison both the original and the anonymized data set is trained with the same model definition, while grid search allows for specific selection of parameters.
Obviously, you can see differences in the influencing factors between the models, that reflect the information loss in the anonymized data set.
Still, looking at the overall results, the differences are fairly low. We do see 85,8 % prediction accuracy on the anonymized data vs. 87,1 % on the original data. Also checking the confusion matrix as well as standard performance indicators like Recall/Precision and F value indicate comparable performance with minor difference, obviously.
Overall, we can hence conclude, that our forecast would work on the anonymized data at a comparable accuracy level, while preserving the individual privacy.
Of course, this might not hold true for each and every case and you might argue, that the adult data set isn’t even one of the hardest ones to solve. But it definitely indicates that it is worth to take a closer look onto the individual case and properly investigate if anonymized data can be used for your ML case, especially since in most of the situations where personal data is involved anonymization is the only option to get data at all.
Consideration regarding productive use and recoding schemes
If you stayed with me until here and carefully followed my explanations, you might have recognized this one little gap in my story. Did you?
Using the multi-dimensional recoding scheme allowed me to get nicer looking data, but since every group in the data set may have its individual level of anonymization, chances are that these will differ between training and test data. That would hence result in test data characteristics that my model has not seen during training.
To get around this issue there are two approaches you might think of:
1. Anonymize training data with the global_strict scheme, build a “dictionary” from this and apply it to the test data.
2. Use the multi-dimensional scheme for the training data set and apply the found anonymization rules to the test set by using the same anonymization view.
I put some sample code on both approaches in the notebook.
For the dictionary approach the idea basically is, to run the global_strict anonymization on the training set, then extract which original values refer to which anonymized values (they are static for all different groups of k individuals) and apply these transformations to the test data and any data that you use in production.
The second approach is a little more complex. It is based on an anonymization view with the mulit-dimensional strict scheme for the training data. Once this is calculated, you append the test (and any future data) to the training data table, keeping track of the respective row numbers. Next step then is to refresh the view, which makes it apply the found rules to all data in the table, including you added test/production data. Then, simply select the test/production data by their row numbers and continue working with them.
Both approaches will most likely result in your anonymization request being overachieved with more than k individuals being indistinguishable. But this is not a bad thing, right?
I tried both ways and results for the prediction were pretty much the same, with differences not worth noticing. So, pick what best fits your case or your system landscape (e.g. the second case is not supported in on-prem cases).
No comments:
Post a Comment