Sensitive & Compliant Data Access – A key Self-Service Requirement in 2020
Enabled in SAP Data Warehouse Cloud through SAP HANA data anonymization
A growing number of people is making use of the latest technologies. Its power and increasing accessibility are shown in personal apps that ultimately generate data. For companies that build solutions on new technologies and retrieve information about their users, it is key to use methodologies that allow handling sensitive data with care. In order to derive insights from data, adherence to data privacy laws and regulations needs to be ensured. These considerations are necessary not only with regards to external customer or supplier data but also when working with internal data as we did in the case outlined in this blog article.
In times of a global pandemic and many people working from home, SAP started an initiative called “Move2Donate” which motivated employees to exercise for a higher purpose: SAP donated an amount of money per kilometer ran or ridden, while employees decided to which charity organization the donation was attributed. In just 100 days, SAP employees in Germany accomplished over 300.000 km together.
For the tracking of activities, an internal SAP fitness app was developed on SAP Data Warehouse Cloud backend and rolled via smartphones to all interested employees. While this is a great example of the powerfulness of technology, the use case was also associated with data privacy challenges when analyzing the above-mentioned fitness data: The data might hold sensitive information and we need to preserve each individual’s privacy when working with the data or even granting access to it internally. If we cannot overcome data privacy challenges, we would lose the possibility of accessing valuable data– erasing incredible opportunities to unleash the value of such data. For instance, a feasible would be to increase the competition amongst participants by sharing leaderboards. Thus, participants are encouraged to do more sports and ultimately collect more donations.
In the following paragraphs, we will explain the exact use case and how SAP HANA can automatically safeguard the data in such scenarios. To enable you to rebuild this demo within your own SAP Data Warehouse Cloud tenant, we derived a sample data set from the original data, with scrambled, but realistic values. You can find the datasets & sample code right here: https://github.com/OliverHuth/Move2Donate
The challenge: Even without user information, data can be sensitive
Let’s start with the original dataset and illustrate the problem of working with it directly, without any data privacy mechanisms in place. We prepared some simple analysis on different aspects, like the number of activities per day of the week or aggregation of distances passed per start location. If you look at the data from a high-level perspective – comparing all participants across countries, no sensitive data is exposed.
While aggregates give a nice overview of the data, we are often interested in the details of the data as well. In our case, it might be interesting to investigate the distances accomplished per city. Using the embedded hierarchy from SAP Data Warehouse Cloud we can drill down from countries to regions and to cities ultimately. Here, we might select any of the cities for further analysis. We will choose “Ely” now, where only 41 km were run in total.
As you can see in the screenshot, there are only a few activities recorded in Ely which are fairly distributed over the weekdays and the hours of a day. Now, imagine that you happen to know one of our participants in Ely and he tells you about the amazing run he had on Monday. If you apply this information to the dashboard by filtering Mondays only, the single data point from this specific participant remains – uncovering that he only made 3 km that day – a piece of information that he maybe did not want to share with you.
Since we obviously do not want to create dashboards that reveal this kind of personal information, a solution is needed to help us make responsible use of data and adhere to privacy regulations – and of course, further motivate participants to increase donations. This is where SAP HANA data anonymization comes into play. The feature allows us to configure the database to automatically apply anonymization techniques in constellations where individual privacy is at risk.
The foundation: Your assets on SAP Data Warehouse Cloud
The source data for this demo can be found in the GitHub repository. You will need to create a table in SAP Data Warehouse Cloud and upload the data.
For the drill-down analysis into start locations, an additional hierarchy is needed containing the relation of cities, regions and countries. This can be found in the repository as well – ready for upload into a table in SAP Data Warehouse Cloud. It is important to also create a dimension view on top of this table with the hierarchy feature configured for later use.
Together with the source data table, this hierarchy view can be used to create an analytical view of the original data to be consumed in the SAP Data Warehouse Cloud Story Builder. In order to also use the view later on the database level (which is needed for the anonymization part), you need to enable the external consumption by activating the “exposing” flag in the Data Builder.
In the following steps, we will switch to the database layer to create an “anonymization view” on top of the data. As the core element of SAP HANA data anonymization, it calculates in real-time whether to display original values or an anonymized version, based on the defined privacy requirements. In doing so, it always strives to retain as much of the valuable information as possible for our analysis.
As a first step, we need to switch to the Space Management of SAP Datawarehouse Cloud and create a Database User, also checking the “Enable HDI Consumption” option as well as the “Enable Data Consumption with Grant Option” setting. We do this to create a database user, that allows us to seamlessly and bi-directional interact both with the data warehouse layer and the database layer. (Sidenote: In order to do that, you need to run a Data Warehouse Cloud Tenant on Wave 22 or higher).
After the user is created, the database access link can be found directly in the same place: In the “Space Management” section, you can find the user that we just created. By selecting the user, the “Open Database Explorer” option is activated, which directly takes us down to the database level.
From now onwards, we will work with the SAP HANA Database Explorer, which you might recognize from previous work on SAP HANA Cloud or SAP WebIDE. This means that we directly interact with SAP HANA Cloud instance that comes with the SAP Data Warehouse Cloud. The license directly entitles you to leverage the native database features, like data anonymization, with your SAP Data Warehouse Cloud data assets.
But before diving hands-on into the required SQL commands, let’s quickly pause for a moment and summarize the most important aspects of anonymization with SAP HANA Cloud.
The background: A few words on data anonymization
Data anonymization is more than simply removing IDs and names from a dataset. Lots of research has been done to prove that doing this does not properly secure the privacy of data. In return, research also provides us with a variety of well-known algorithms to better secure private information, having different privacy and utility guarantees. SAP HANA data anonymization comes with three of these algorithms – k-anonymity, l-diversity and Differential Privacy.
In this case, we will apply k-anonymity.
The main idea behind k-anonymity is to “hide” individuals in a group by aggregating their quasi-identifiers until at least k individuals share the same details. Quasi-identifiers are information on an individual, that does not directly identify it but allow for re-identification. This can happen, if quasi-identifiers are combined, filtered, or selected in a proper way, just like in the example with the people from Ely, shown above. In this case, individuals could be reidentified by their activity type and the respective start location, so we will need to make sure that they are indistinguishable with regards to these attributes. We decided to define “indistinguishable” as at least 10 individuals sharing the same attribute values, but this may vary from case to case.
The magic: SAP HANA data anonymization in action
In SAP HANA, we achieve anonymization through the means of an anonymization view. This is a specialized view, built on top of any (SAP HANA) data source, that specifies both the anonymization parameters as well as information on how to handle the contained columns.
It is mainly a typical SQL view, enhanced by the “WITH ANONYMIZATION” statement and the required parameters. Please note, that here we specify “Algorithm: K-ANONYMITY” and “k: 10”, as planned earlier.
CREATE VIEW "M2D_ANONYMIZATION#HANA"."ANON_RESULT_VIEW" ( "ID", "AGGREGATEDTYPE", "STARTLOCATION", "STARTREGION", "STARTTIME", "MONEYEARNED", "DISTANCE" ) AS
SELECT ID, AGGREGATEDTYPE, STARTLOCATION, STARTREGION, TO_VARCHAR(STARTTIME) AS STARTTIME, MONEYEARNED, DISTANCE
FROM "M2D_ANONYMIZATION"."EU_RESULT_VIEW" WHERE STARTTIME IS NOT NULL
WITH ANONYMIZATION ( ALGORITHM 'K-ANONYMITY' PARAMETERS '{"k": 10, "data_change_strategy":"static", "recoding":"multi_dimensional_relaxed"}'
In the case of k-anonymity we also need to define the levels, to which each quasi-identifier column may be aggregated. Here are three ways of the specification:
1. Embedded hierarchies: directly specified in the view definition.
2. Hierarchy functions: SQL Script functions defined anywhere in the database, that compute the applicable values at runtime.
3. External hierarchies: Hierarchies defined in the SAP HANA System.
We chose the first and the third one here.
COLUMN "STARTLOCATION" PARAMETERS '{"is_quasi_identifier": true, "hierarchy":{"view": "ANON_CITYHIERARCHY"}}'
COLUMN "AGGREGATEDTYPE" PARAMETERS '{"is_quasi_identifier":true, "hierarchy":{"embedded": [["Bike"], ["Run"]]}}'
Since there are no meaningful ways to aggregate the activity types (called “AGGREGATEDTYPE” in our data), we simply use the embedded (aka “hard-coded”) way to specify that either the actual values should be shown or the information should be fully suppressed.
The case with the start locations is more complex, as cities belong to regions, who belong to countries and so on. Also, we have dozens of cities, so we used an existing hierarchy, called ANON_CITIYHERARCHY.
With those details in place, we are good to go. The final result is shown below:
Do not forget to refresh the anonymization view before selecting it, to trigger the actual computation of anonymization levels.
REFRESH VIEW ANON_RESULT_VIEW ANONYMIZATION;
SELECT * FROM ANON_RESULT_VIEW;
The usage: Consume results back in SAP Data Warehouse Cloud
As SAP Data Warehouse Cloud is an end-to-end data management platform, the Data Modeler can directly see the created anonymization view in his sources in the Data Builder. Hence, we can directly use it to create the analytical data set as a basis for the Anonymized Story in the Story Builder with drag-and-drop.
Adding the anonymization view to the Story Builder – depicted as the orange bar chart – you can directly see that on a high aggregation level – in this case, the regions of the United Kingdom – the totals are the same between the original and anonymized data, for example, kilometers ridden in England by bike are reported 119,516km in both versions.
The interesting part comes now: When you drill down the hierarchy of the anonymized view into the cities of England, you realize that sensitive buckets are no longer accessible during data discovery (again, while the totals remain correct).
Take a look at our case from the beginning: While the original view shows both the “Run” and the “Bike” bucket for the City of Ely, the anonymization view shows only the “Bike” bucket where more than 10 activities have been recorded and our privacy requirement is fulfilled. This is calculated live based on the current set of records in the database table.
If you are happy with your anonymized view and want to make it broadly accessible in your company – knowing that it automatically hides sensitive data – you can use the SAP Data Warehouse Cloud Cross Space Sharing functionality to make it available in a Self-Service Space. This is a crucial architectural element of a Self-Service Data Platform.
A Line of Business or IT user in the target Space, in this case, the “Self-Service Ride Insights” Space, can go into the Data Builder and consume the shared anonymization view with confidence. The data consumer can now work with an Analytical Data Set that hides sensitive data, even when drilling down into more levels of detail.
The summary: Mission accomplished.
We started with a data privacy issue that limits the confident consumption of a company’s data assets due to the potential exposure of sensitive data. With the capabilities of SAP Data Warehouse Cloud and its underlying database SAP HANA Cloud, we enabled self-service data modeling and consumption by stacking an anonymization view on top of the existing data and using cross-space sharing to socialize the data asset across the company. We showed a great way on how to use sensitive data while respecting individuals’ privacy. This is of course not limited to Move2Donate but finds application in many areas where personal data is involved. So, next time you work with employee data or customer data, etc., you know what to do.
Feel free to reach out to us if you or your client faces similar challenges and would like to evaluate the use case with the technology of SAP Data Warehouse Cloud and SAP HANA Cloud.
No comments:
Post a Comment