What is a data lake?

Our zatz-its are wise. Our zatz-its are dispersed. We have questions to ask the zatz-its, so we corral them in a warehouse. Now everyone can see the zatz-its and ask any question that the zatz-it can answer. But buyer beware, our zatz-its are data, try not to stare.

We have more snuvs than we can handle. They flow down the mountains. They trickle through the ground. They stream into our snuv lake. Everyone who wants snuvs may wade in our lake. But be careful what is in store, those snuvs are not snuvs, they are data and more.

Data Lake

Data is diverse. Structured, semi-structured, and unstructured data does not even cover it. Even this list is incomplete.

  • POS
  • CRM
  • Financial
  • Loyalty card
  • Incident ticket
  • Email
  • PDF
  • Spreadsheet
  • Word processing
  • GPS
  • Log
  • Images
  • Social media
  • XML/JSON
  • Click stream
  • Forums
  • Blogs
  • Web content
  • RSS feed
  • Audio
  • Transcripts

Of the types of data listed above, only the bolded items can easily be stored in a data warehouse, without being transformed into something better suited for it.

Data lake, a term originally coined by James Dixon, the founder and CTO of Pentaho, is used to describe a data store which can scale to extremely large sizes, in an affordable manner. A data lake is also designed to store the raw data, in its original format, so it can be used immediately, rather than waiting weeks for the IT department to massage it into a format that the data warehouse can accept and/or use effectively.

The data lake concept always includes the capability to scale to an enormous size. However, you do not need petabytes of data to find use in a data lake. It can be used as cheap storage for long-term archival data. It can be used to transform data before attempting to ingest into a data warehouse with the convenience of retaining the original and transformed versions of the data. It also can be used as the centralized staging location for ingestion into the data warehouse, simplifying the loading processes.

Data lakes are meant to break down information silos. They are the opposite end of the spectrum from data marts and cubes, which are sub-sets of the company’s data and tend to be business area specific.

My favorite use of the data lake is to defer design and integration work. By defining the folder structures very well, data can be accepted before reporting requirements are known, before data warehouse table schema is designed / implemented, and before transformation and loading work has begun. The [semi][un]structured data now has a structured storage location and analytic tools can query that data for immediate use. Exploration of the data for the purposes of building those report requirements is also an option. This drop now and design later method closely mimics the schema-on-read concept of querying a data lake.

Clarity of the water

Data lakes are not data warehouse replacements, nor are they data warehouse v2.0. Data lakes solve the expense of scalability issue and theoretically expose data earlier so that insights can be derived faster. However, they run the risk of becoming data swamps.

Similar or related information from various sources can be exposed differently. For example, sales data from one source may label a data point one way, while some label the exact same type of data differently. This may cause us humans, who are analyzing the data, to misinterpret the meaning of the information. Maybe we fail to notice the relationship or infer an incorrect relationship. We may also incorrectly identify the calculation which was used to derive the value.

This brings us to the analogy of a data lake vs. a data swamp. A data lake is not necessarily a placid lake without bumps or choppy waters. In fact, a data lake is full of character and diversity. A data lake’s waters, however, are clear and the users can see what exists. The definition of data within the lake also must continue to improve as experts discover relationships and pull new pieces of data into the existing landscape.

A data swamp has dirty water which is difficult to see through. It may lack a data catalog or fail to continue improving upon it. Attributes are misunderstood and the use of the data swamp is overwhelming to the users.

Different users, different tools

This brings us back to my earlier point. Data lakes are not a replacement for data warehouses. Data warehouses are like water towers of pre-filtered water. Those towers supply clean water to factories who produce bottles of water, also known as data marts and cubes, some flavored to suit specific business areas. Even the cleanest data lake, is still more difficult to use and to deliver a structured message to amateur users with a single, consistent, voice. That is why we have data warehouses. Data lakes, at this point in their maturity, are best suited for data scientists and advanced data analysts.

It is important to guide diverse types of users, depending upon their needs and skill, to the data lake, data warehouse, and/or data marts / cubes.

References

This article has 1 comment

  1. […] Derik Hammer gives us a definition of the data lake: […]

Leave a Reply