There is a basic tension when it comes to both securing data and having it readily accessible, and this tension needs to be frequently addressed when considering the sharing of big data both within an organisation and outside it says Ted Dunning.
Often the key problem is not necessarily the prevention of data leaks. Instead, the challenge lies in sharing the data or some characteristic of the data without sharing more than we intended. Anyone can ensure data is hidden, but you need to balance security with safe access. Organisations commonly fail when it comes to getting the most out of the big data they have by either being too tight or too loose. What's the use of data if you can't mine it to make better and more accurate business decisions?
Opening up data
We're increasingly seeing data being opened up and shared with the general public in the hope that insights will be provided and benefits delivered to the wider society. A good example of this in the UK is the data.gov.uk website, whereby the government has released public data to help people understand how government works and how policies are made. Publicly available data ranges from the large to the small, from legislative documents down to databases of all the trees in a city.
However, this information often has to be anonymised to protect the identity and sensitive data of individuals referenced in publically accessible data. This is challenging as it is very difficult to open data up, while at the same time anonymising it effectively for security reasons. In some cases and for some purposes, however, we can be very open, with essentially no chance of the data being de-anonymised.
Using the right tools
One way of addressing the challenge of sharing data securely is by ensuring that a limited set of users only gain access to partial views of the data. One tool that provides this is Apache Drill, an open source system that allows interactive analysis of large-scale datasets and conveniently makes views of data that limit what each user can access.
Situations are commonly encountered in big data settings whereby set views are useful to manage secure access, for example, allowing employees in a retail business to see storage information but not necessarily revenue related to certain employees. By having a reliable tool that lets you easily control access, you are able to specify the particular subset of data for each person working on a project, protecting the security of the larger dataset.
Having granular control such as that provided by Drill is good in situations where you also limit access to people you trust enough not to try to de-anonymise the data. That definitely is not true for many situations.
The value of fake data for security
Ironically, another way to improve access to big data is to use synthetic or fake data. This very much sounds like a version of a famous contradiction: the only way to share the data is to destroy it. The fact is, however, that recently developed tools allow you to synthesise data that mimics important characteristics of your actual data without revealing any information that might be used in a de-anonymisation attack. By using fake data, you can freely share data with outsiders to work on so that they can find bugs, build analytical systems, or find insights that will generalise to your actual data. This allows you to gain useful insights from external consultants without compromising the overall security of your system. Synthetic data like this can be a powerful tool for dealing with analytics safely when the data of interest lives behind a security perimeter.
In conclusion, data is a tremendously powerful business asset. However, it often has to be shared and distributed to get the most out of it. Doing so securely is a huge challenge. It's a fine balance to strike, but by using the right tools, you can get the benefits of sharing without leaking data.
Contributed by Ted Dunning, chief application architect, MapR