The AMP Lab Stands Up to Big Data
The iPhone is fully charged, but the car battery isn’t. Siri to the rescue. In a flash, she pinpoints the nearest towing service. Siri taps into a nearly limitless world of data by asking thousands of clustered computers to come up with a quick, reliable answer.
Behind the curtain, thousands of servers perform different functions to get Siri’s job done. A software system developed at UC Berkeley called Mesos coordinates their action, assuring that each server has the resources it needs at the right moment.
“Mesos helps prevent the machines from stepping on each other,” says Mike Franklin, chair of UC Berkeley’s computer science division. “It provides near-optimum access to needed resources in a cluster of computers.”
A sample of Mesos users attests to its powers: Amazon, Twitter, Airbnb, Alibaba, NASA-JPL, Intel, Google.
Mesos is one of several blockbuster, open-source systems developed in the AMPLab at UC Berkeley, a four-year-old team effort in the computer science division. The
highly collaborative enterprise tackles some of the biggest problems in the world of Big Data.
AMP stands for Algorithms Machines and People. “Algorithms” means machine learning and statistical methods; “Machines” are large computer clusters and cloud computing infrastructure; “People” refers to crowd sourcing and human computation. AMPLab’s faculty founders focus on integrating this trinity in novel ways to increase the power and sophistication of analyzing data.
“We’re not going to address the biggest challenges in data science simply by making machines do more of what they’ve been doing,” Franklin says. “It’s important to develop a system that can incorporate these very different resources — algorithms, machines and crowds — and flexibly blend them to take on different problems. We believe that will be the breakthrough.”
Franklin and several other computer science professors launched the AMPLab in 2011. The lab now engages a dozen faculty and more than 50 graduate students, postdocs and other researchers. Faculty and student lab members sit in an open space environment that looks something like a start-up. And in a way it is.
Instead of venture capital, the lab was launched with funding from major players in the information industry, including Amazon Web Services, Google, IBM and SAP. Currently 30 companies actively sponsor the lab . AMPLab’s solutions are open source — made available free to users. “If we patented platforms we’ve developed, I don’t think we would have the outsized impact we’ve had,” Franklin says. “To make the quantum leaps that AMPLab members hope for, our technology must get wide use.”
The AMPLab’s most influential invention, by far, is a software system called Spark, now used by hundreds of companies, government agencies and researchers to process and analyze exponentially increasing volumes of data.
Spark supports scalability, allowing data analysis to be spread over hundreds, even thousands of computers running in parallel in the cloud. This allows systems to process and analyze increasing volumes of data without losing speed, or alternatively, to manage a constant amount of data 10 to 100 times faster than previous big data platforms.
Spark provides data analysis for Autodesk, Novartis, and Microsoft, among many others. At a recent “Spark Summit,” in New York, attended by more than 1,000 data scientists, developers, and Big Data professionals, a managing director at Goldman Sachs called the Spark “the lingua franca of big data analysis.”
“What’s exciting is that this started in the lab with a grad student here, and now it has gone around the world. It’s almost unprecedented for a computer science research project to have such direct and immediate impact,” Franklin says.
In 2012, the White House announced its Big Data research initiative to support university research aimed at developing new methods to derive knowledge from data and “new infrastructure to manage, curate and serve data communities.”
The mission was right up the AMP lab’s alley, and the lab received a highly-competitive National Science Foundation Expeditions in Computing Award of $10 million to continue its research.
The NSF was drawn to specific data-intensive applications that the AMP Lab was targeting, including cancer genomics; traffic monitoring and prediction; and opinion gathering through crowd sourcing. The projects reflect big data’s presence in almost all areas of study and application now, from biology and economics to business and government.
Independence Blue Cross, the largest Philadelphia health insurer, has turned to Spark to get new claims and benefit applications up and running faster. Closer to home, AMPLab researchers have shown that genomic data analysis can be carried out nearly 30 times faster using Spark and other parts of the UC Berkeley Analytics Stack, which integrates analysis programs to tackle big data.
“We had high expectations for AMP, ” Franklin says, “and it has exceeded our expectations. It’s great to solve big problems in Big Data, and it’s been a lot of fun.”