Security for Data Analytics - Gaining a Grip on the Two-Edged Sword
In the information age, data is power, and the ability to mine it can boost business profits, save patients’ lives and aid government security.
But how much information is too much? To protect privacy or proprietary interests, data owners impose limits on access to the data.
Of course, restrictions can leave a veil in place, with insights buried in the data. It’s a classic two-edged sword, and finding a flexible but secure solution to assure both privacy and deep data access has been difficult to achieve.
The 1996 HIPPA patient protection provisions limit access and analysis of patient data without consent in the U.S. This year, the European Union’s new General Data Protection Regulation, or GDPR, will define similar kinds of boundaries throughout the European Union.
The two security protocols differ. Mismatches, overlaps and conflicts in data security restrictions also commonly crop up between collaborating businesses, or between the government and the entities it deals with.
In the case of medical records, security protocols to assure compliance can essentially lock down analysis, says Noah Johnson, a graduate student in electrical engineering and computer science (EECS) with both a research background and front-line experience in designing practical security tools.
“If an organization can’t distinguish between what is allowable and what is overstepping security boundaries, the only safe solution is to block all access to the data.”
In addition, a typical organization uses many different database engines tuned to specific requirements, he says — for example, one type for real-time queries and another for queries on larger datasets. With current approaches, a full data security solution requires a hodgepodge of different security tools.
Supported by the Signatures Innovation Fellows program, Johnson and Dawn Song, a professor in EECS and a leader in the cybersecurity field, are taking an entirely new approach to enable organizations to follow tight data security and privacy policies while enabling flexible data analysis, as well as machine learning for analysts.
Along with colleagues, they developed the prototype of the technology in the real world. Working with Uber, they tested their system using a dataset of eight million queries written by the company’s data analysts. The system is currently being integrated into Uber’s internal data analytics platform.
With help from the Signatures Innovation Fellows program, they are advancing the system to provide the same level of security and flexibility for a broad range of data analysis and machine learning, whether needed in basic and medical research or business analytics.
Typically, data security systems comply with security safeguards by restricting access to the data or “sanitizing” it so that no single user in the dataset can be identified — a strategy that is actually vulnerable to attack, Johnson says.
The technology that he and Song developed looks upstream — analyzing the data queries, not the data itself.
For starters, the data owner’s specific security requirements are encoded into the system — whether they are specific limits imposed by a company or a broader range of privacy rules, such as HIPAA regulations.
“Data owners can specify what types of access are permitted for each dataset,” Song says.
The system’s analysis engine can determine the security and privacy impact of a program before it is executed. Crucially, a rewriting engine can automatically modify the query to ensure compliance with specified security policies — sparing the data analyst from navigating what might otherwise be a maze of security constraints imposed on the data.
“We sit between the data and the analysts,” Song says. “Our system can reason about — and modify — the query before it runs on the data. It can ‘understand’ queries written for any standard database, so it can enforce policies uniformly across different databases in an organization.”
Since the system enforces the privacy guarantees, the data owner no longer has to trust each analyst to figure out what is fair game and what is not.
At the same time, the Signatures Innovation Fellows program supports their plans to work with companies across different sectors, including finance, health care and technology, to determine the common set of requirements needed for the secure data analysis and machine learning system to be widely adopted.
The Signatures Innovation Fellows Program supports innovative research by UC Berkeley faculty and researchers in the data science and software areas with a special focus on projects that hold commercial promise. The application deadline for the 2018/19 cohort is Sunday, February 25, 2018.
For more information, see http://vcresearch.berkeley.edu/signatures/about.