Data Lake

A data lake is a repository – typically a large one – for storing data of many types. They are more flexible (less structured) than their predecessor Data Warehouses. At their crudest they are little more than raw storage with an organizational structure plus, maybe, a catalog. At their more sophisticated they can become an entire data management infrastructure.

The flexibility of the data lake concept is both its advantage and a limitation: almost any data architecture that includes collecting organizational data together could be described as data lake.

At a practical level, the flexibility can become a limitation in that data lakes become data swamps: the lack of structure for data lakes often limit the usability of the lake: data cannot be found or is not of adequate quality. As ThoughtWorks note: "Many enterprises failed to generate a return on their investment because they had quality issues with the data in their lakes or had invested significant sums in creating their lakes before identifying use cases."^1

Schematic overview of a Data Lake Architecture