[OCT 2014] Beyond Data Lakes: The DataHub

Organizations are scrambling to capitalize on Big Data for business analytics. By necessity, they’ve developed some creative new data management ideas. Data lakes is one of these.

Cheap commodity servers, clustering, cloud, Hadoop and distributed data storage methods have laid the foundation for data lakes. In creating a data lake, you put all data – structured, semi-structured, unstructured – into a central pool. Data stays in its source formats. All users have access to the data in the pool, and each figures out how to use it for specific analytic needs.

The idea is that you can save all the data future use, without most of the typical management overhead associated with storing data in relational databases and other traditional systems.

The Challenge with Data Lakes

Data lakes have attracted fans and foes.

The ultimate goal of data lakes is just-in-time, ubiquitous self-serve analytics. But just pooling data doesn’t solve the hard problems in achieving this goal. There are several problems and related technology gaps:

  • How do you find the datasets you need and quickly understand what they contain using visualization and sampling?

  • How do you get datasets into a format that allows people to operate on them?  How do you combine and integrate related datasets together?

  • How do you maintain the relationship between datasets over time and track their evolution as users enhance, merge, extend and correct them?  Some kind of version control system is needed.

  • How do you share datasets with others inside and outside of your organization and maintain integrity as users continually add or delete records?

These problems affect business users, researchers, and scientists alike.

DataHub: A GitHub for Big Data Management

Today data scientists want to collect, analyze and collaborate on datasets to get just-in-time insights, distill knowledge and make discoveries. Collaboration often happens on an ad-hoc basis, iteratively, and with lots of trial-and-error as users experiment with the best data-access and visualization tools for tasks at hand.

Inspired by software version control systems like git and github, a research team spanning MIT CSAIL, the University of Maryland, and the University of Illinois at Urbana-Champaign have proposed a solution: DataHub. The solution delivers the advantages of data lakes (data and analytics liberation) while overcoming its challenges (poor user interface, lack of governance, potentially “bad”/out-of-sync data ).

The solution consists of two tightly integrated systems: 1) a Dataset Version Control System (DVCS), which gives users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (2) a DataHub platform that gives users the ability to perform collaborative data analysis built on good version control.

The Dataset Version Control System provides multi-version dataset management. Its goal is to provide a common substrate that enables data scientists to capture their modifications, minimize storage costs, use a declarative language to reason about versions, identify differences between versions, and share datasets with other scientists.(1)

The DataHub platform is a hosted platform for organizing, managing, sharing, collaborating, and making sense of data. It provides an efficient platform and easy-to-use tools/interfaces for:

  • Publishing your own data (hosting, sharing, collaboration)

  • Using others’ data (querying, linking)

  • Making sense of data (analysis, visualization)

The platform provides access to the data via a scalable, parallel, SQL-based analytic data processing engine optimized for extremely low-latency operation on large data sets. This engine overcomes the limitations of previous generations of database management systems in handling Big Data, such as indexing, main-memory databases, column-oriented DBMS, and MapReduce.

The DataHub system is currently being tested in situ on a wide “lake” of diverse data sets – including in the MIT Big Data Living Lab, a platform for data sharing at MIT for MIT researchers – supporting a dynamic set of analytics.

Jump Over the Lake, Not Into It

DataHub overcomes the problems of data lakes, while liberating data so that data scientists and researchers can engage productively and effectively in large-scale collaborative data analytics.

To read more about the DataHub system, download the paper, “DataHub: Collaborative Data Science & Dataset Version Management at Scale” and the VLDB 2013 keynote presentation by MIT CSAIL Professor Sam Madden, “The DataHub: A Collaborative Data Analytics and Visualization Platform.” A paper on DataHub will be appearing in the Conference on Innovative Data management Research (CIDR), in January 2015.

(1) “DataHub: Collaborative Data Science & Dataset Version Management at Scale,” arXiv:1409.0798 [cs.DB]