In his 1798 book “An Essay on the Principle of Population,” Thomas Malthus predicted that population growth would outpace our food supply in time. While he couldn’t have known the impact that technology has had on agriculture and our global food supply, he was correct in assuming that there would be daunting challenges to sustainably feeding a growing planet.
By Justin Hofer
The titans of technology believe that they have seen the future, and they have changed the course of their companies to follow it. Google and Facebook, for example, have made sweeping changes to each of their platforms in the name of better employing and gathering data. Algorithms backed by gargantuan stockpiles of data can act with an intelligence that matches or surpasses humans. Seemingly inconsequential information today can enable amazing advances tomorrow. These corporations seek to amass as much data as they can so that they can reap the benefits of their knowledge, and employ it to further improve and evolve their work. However, with great data comes great responsibility - including the proper storage and management of that data. In today’s article, we’re going to explore the best methods for managing large and complex data sets.
The Evolution of Data Storage
There has been a steady evolution of the technologies used in data storage:
Computer File: The first iteration was the humble computer file. All information contained within had to be read by hand.
Schema Databases: Then, various structured databases came about, using extensive rules to catalog, store, and read out data on demand. All data within was one of a predetermined set of attributes. This allowed for the records to be easily queried, sorted, and updated as needed.
Schemaless Databases: New advances then allowed for databases that did not require pre-definition of what would be recorded. These “unstructured” databases allowed for new information to be easily added to future records, without having to deal with the old records.
Throughout its history, data storage has been in a state of constant evolution, the new ways carrying some decisive advantage over the old. Each has made the data easier to access and utilize. Having a managed file system ensures that the information written to a hard drive will be able to be retrieved in the future, safe from being accidentally overwritten, with metadata that lets a computer know how to read it. With schema based databases, specific records can be easily searched for and changed as needed, without having to hunt down specific files, and then editing the files themselves. Schemaless databases opened this up further, allowing for data in one record to be completely unlike the data in another.
Old vs New
Traditional methods for storing data are highly structured, built around the information you know is important. What is being stored, manipulated, and retrieved is known before anything is entered. When that is the case, those methods work very well, but when those assumptions are taken away, things start to break down. Tangled databases of random tables and entries of unrecorded origin and content form a tumorous mass that even the database administrator might not be able to fully understand.
Enter the Data Lake, a schema built around the assumption that users do not know what will be important in the future, or what else might be added to it. Properly designed and maintained, it can hold varieties of different information, from traditionally structured database tables, more open ended JSON data, or even things like images, video, recordings, and emails. All of these files are stored in a way that can be easily searched in the future, so that esoteric algorithms like Neural Networks and Fuzzy Clustering can put that data to work.
The Data Lake is not a perfect solution. With improper design or lazy execution, it could end up just as messy as that large, messy mass of a database mentioned earlier (a so called “Data Swamp”). However, when done properly, it can serve as a stepping stone to becoming one of the machine learning and artificial intelligence backed powerhouses that will dominate the rapidly changing corporate world.
The Data Warehouse
The original data management method was what is referred to as the “Data Warehouse” model. Imagine for a moment a warehouse. Nice neatly spaced rows of pallets, organized by type and size. Ask for a particular pallet, and the manager can point you to a specific aisle, how far down you have to go, and what is inside. Trucks take new boxes into the warehouse, and deliver from the warehouse to customers. The warehouse holds pallets. It does not house livestock, act as a garage for cars, or a home for people. This is how the “Data Warehouse” model works. In this analogy, the information is the pallets, a database the warehouse manager, and users are its customers. It is designed to store very specific data, with unrelated info discarded, or not allowed inside the warehouse at all.
Let's imagine another different warehouse. Amongst the pallets, sheep wander the aisles, cars are parked where they fit, and makeshift homes are set up. If you ask for a specific pallet, the warehouse manager can point you to where it should be. Ask for a specific animal, and you will be told where it was last seen (a month ago). What about where a certain car is parked? You might be told it is usually parked in aisle 4, but that might not be right. The delivery of pallets is slowed down by all the obstructions from cars and sheep. This warehouse is a mess, and it resulted from being asked to do what it shouldn’t have. Too many different types of things are stored within this warehouse, and a warehouse is not the correct way to store some of them. This is what can happen when a “Data Warehouse” is miss-used. Random tables are strewn about, it is a pain to use, and it is hard to find what you need.
The two most important features for a data warehouse are that it must be:
Highly Structured – Everything has a place and purpose
Strict Entry – Only data that is relevant right now is allowed to be stored
The Data Lake
In Data Lake, these features are turned on their head. There is only a high-level structure, used to search for what data is useful right now. What data comes back could be of any type. Also, every source that can be used to feed in data is used. What is important for the future is unknown, so don’t risk missing out on data that could be important later on.
Imagine now a large lake. Fish swim inside the lake, boats float atop it, houses dot the shoreline. Instead of a warehouse manager, you have some old salt who knows everything there is to know about this lake. Need to find a certain house? They know the way. Where is the best spot to locate a certain type of fish? They know that too.
The key difference between Data Lake and Data Warehouse comes down to what they are used for. The Data Warehouse method of data storage is used when the user know everything that goes in and everything that goes out. It is designed to handle files & data quickly and efficiently. The Data Lake method is used for discovery. Analysts can use it to gleam new insights into your organization. Reports can analyze things going back years, without the foreknowledge that the data would be used later on. And of course, Machine Learning and Artificial Intelligence benefits from having access to the colossal sum of knowledge recorded within.
The importance of proper data storage to an organization cannot be understated. Consider the valuable things that can be done if the information is there. Internal company messages could be analyzed to figure out what departments are at risk of losing valuable employees. Emails and calls from your salespeople could be used to help determine which leads are most likely to buy product. Receipt data could be used to root out fraud. None of this can happen if the data doesn’t exist. With Data Lake, everything is recorded so that it can be used in the future. Although any given data-point might be not be useful, the more data you have, the more value it can give back.
If your organization is facing difficulties with unstructured data or putting your data to its best use, contact our data experts who specialize in developing solutions that keep your organization compliant and competitive. Contact us to explore solutions.
At Positronic we take a four-step approach to implementing a new data science project. First, we go through our DISCOVER process where we visually navigate the available data looking for patterns and correlations and we apply advanced analytics and visualizations as a guide to building hypotheses for what sorts of predictive models may fall out of the data. Second, we test chosen hypothesis through TEST.