Big Thinking About Big Data
To Michael Jordan, the smart way to extract and analyze key information embedded in mountains of “Big Data” is to ignore most of it. Instead, zero in on collections of small amounts of data.
“You may be swamped with data,” he says, “but you’re not swamped with relevant data. Even in a huge database, there is often only a small amount of it that is relevant.”
A modest example: “To choose a restaurant in a city I’m visiting, I don’t need to know about all the restaurants that are out there and everyone’s experiences with them. I only need to know about the restaurants that I’m likely to be interested in and reviews by people who share my tastes. The beauty of ‘Big Data’ is that the chance of finding a relevant subset of data for any particular person is high. The challenge is finding that relevant subset.”
To this end, Jordan has worked on ideas that span computer science and statistics. The two fields have
historically proceeded along separate paths, but he sees them as natural allies. “Big Data,” he says, “is the phenomenon that has drawn them together.”
Classical statistics has paid little attention to the amount of time that a data analysis will take, but it may take too long to obtain useful statistical inferences from very large data sets, he says. Classical computer science, on the other hand, has paid attention to the amount of time needed for algorithms to run, but was little interested in the special kinds of algorithms needed in data analysis.
“Bringing the two perspectives together is a win-win,’’ he says.
An example of Jordan’s blend of computer science and statistics is his work with colleagues on the “Bag of Little Bootstraps.” Here the goal is not merely getting the answer to a query to a large-scale database, but to place an “error bar” around the answer — an estimate of the accuracy of that answer. And it’s crucial to compute the error bar quickly.
“The only way to know if the output from a data analysis is solid enough to provide a basis for a decision is to accompany that output with an error bar,” Jordan says. If medical tests indicate that an underlying condition warrants surgery, a cautious, or statistically minded, patient would want to know the error bar on that analysis. The size of the error bar would help determine if it’s time for a second opinion and other tests.
The same logic applies in commerce. Vendors need to assess the likelihood that a purchase was made on a stolen credit card. Error bars provide control over the number of “false positives” — the annoying (to the customer!) situation in which a legitimate purchase is flagged as fraudulent. Using all of the data from all customers and all transactions can in principle bring down the number of false positives, but the computational cost of such an analysis may be prohibitive, leading to overly slow decisions.
“The Bag of Little Bootstraps demonstrates a core idea in computer science — that of ‘divide and conquer,’” Jordan says. “We divide a large data set into small subsets, compute error bars on each subset, and combine the error bars to obtain an overall error bar.”
Statistical thinking is needed at each step to ensure that the overall error bar is actually correct. “We’ve found that in practice this approach can be hundreds of times faster than classical approaches to computing error bars.”
For the past few years, Jordan has been developing this and other strategies in UC Berkeley’s Algorithms, Machines and People (AMP) Lab. He’s a founder and co-director of the three-year-old lab, a small, highly collaborative team of a half dozen computer and statistical scientists and their graduate students.
The team has selected and tackled a few big challenges that had blocked access to efficient, accurate analysis of data in the real world. AMP Lab software platforms have already been adopted by hundreds of companies as well as researchers — from neuroscientists to Netflix.
“I’m a real fan of our model,” he says.
A New Definition of Literacy
The growing avalanche of data in almost all fields has convinced many educators that a truly literate student should graduate with at least a fundamental understanding of the modern tools of collecting and analyzing big data.
Responding to a swell of student interest, Jordan has spearheaded the design of a new freshman-level course entitled “Computational and Inferential Thinking: Foundations of Data Science.” A prototype version of the course is being offered this semester, Fall 2015. The plan is to start small but to ramp up quickly so that the course can soon be offered each year to a significant proportion of the incoming freshman class.
“Many students will eventually be working in a field that involves data analysis, and all students will be living in a world in which their own personal decision-making will involve data analysis,” Jordan says.
“We think that data literacy should be an essential component of the Berkeley educational experience.”
________________________________
Michael Jordan is the Pehong Chen Distinguished Professor in Electrical Engineering and Computer Science as well as Statistics.