The Data Warehouse
The original data management method was what is referred to as the “Data Warehouse” model. Imagine for a moment a warehouse. Nice neatly spaced rows of pallets, organized by type and size. Ask for a particular pallet, and the manager can point you to a specific aisle, how far down you have to go, and what is inside. Trucks take new boxes into the warehouse, and deliver from the warehouse to customers. The warehouse holds pallets. It does not house livestock, act as a garage for cars, or a home for people. This is how the “Data Warehouse” model works. In this analogy, the information is the pallets, a database the warehouse manager, and users are its customers. It is designed to store very specific data, with unrelated info discarded, or not allowed inside the warehouse at all.
Let's imagine another different warehouse. Amongst the pallets, sheep wander the aisles, cars are parked where they fit, and makeshift homes are set up. If you ask for a specific pallet, the warehouse manager can point you to where it should be. Ask for a specific animal, and you will be told where it was last seen (a month ago). What about where a certain car is parked? You might be told it is usually parked in aisle 4, but that might not be right. The delivery of pallets is slowed down by all the obstructions from cars and sheep. This warehouse is a mess, and it resulted from being asked to do what it shouldn’t have. Too many different types of things are stored within this warehouse, and a warehouse is not the correct way to store some of them. This is what can happen when a “Data Warehouse” is miss-used. Random tables are strewn about, it is a pain to use, and it is hard to find what you need.
The two most important features for a data warehouse are that it must be:
The Data Lake
In Data Lake, these features are turned on their head. There is only a high-level structure, used to search for what data is useful right now. What data comes back could be of any type. Also, every source that can be used to feed in data is used. What is important for the future is unknown, so don’t risk missing out on data that could be important later on.
Imagine now a large lake. Fish swim inside the lake, boats float atop it, houses dot the shoreline. Instead of a warehouse manager, you have some old salt who knows everything there is to know about this lake. Need to find a certain house? They know the way. Where is the best spot to locate a certain type of fish? They know that too.
The key difference between Data Lake and Data Warehouse comes down to what they are used for. The Data Warehouse method of data storage is used when the user know everything that goes in and everything that goes out. It is designed to handle files & data quickly and efficiently. The Data Lake method is used for discovery. Analysts can use it to gleam new insights into your organization. Reports can analyze things going back years, without the foreknowledge that the data would be used later on. And of course, Machine Learning and Artificial Intelligence benefits from having access to the colossal sum of knowledge recorded within.
The importance of proper data storage to an organization cannot be understated. Consider the valuable things that can be done if the information is there. Internal company messages could be analyzed to figure out what departments are at risk of losing valuable employees. Emails and calls from your salespeople could be used to help determine which leads are most likely to buy product. Receipt data could be used to root out fraud. None of this can happen if the data doesn’t exist. With Data Lake, everything is recorded so that it can be used in the future. Although any given data-point might be not be useful, the more data you have, the more value it can give back.
If your organization is facing difficulties with unstructured data or putting your data to its best use, contact our data experts who specialize in developing solutions that keep your organization compliant and competitive. Contact us to explore solutions.