There are many different ways to store, manipulate and retrieve data in a system. Many different software are available for such tasks. These software, should enable, based on the 5 Vs of data (volume, velocity, variety and value), a better user experience for the appropriate task. For that, we need to understand the difference among those 4 correlated objects (databases, datawarehouse, datamart and data lake).
Databases
Evolution of Database
Databases have been through years of evolution. Earlier, people stored their data in flat-file systems, which had several drawbacks. The file-based database was introduced in 1968. However, it required extensive programming in a third-generation language.
During 1968-1980, Hierarchical Databases became prominent. IBM introduced their first DBMS, called IMS (Information Management System). It worked with a parent/child data structure.
Today, Modern Databases are very flexible. They can be used as companies want them to. Thus, they are much more appealing to the enterprise-side as well as personal users.
Databases are simply the software that stores data. Any data per say. So, to be a little bit old school, we can say that an office drawer is a database. It is a database of files. If well organized, a user would be able to “quickly” go through it. Probably not quick enough, hence the need for software that handles that type of demand.
If we try to modernize a little bit, we can come to Excel, which is still very popular, for very small data. If we are talking about large volumes of data, we have to start thinking of different systems with query capabilities so that we are able to quickly deal with data.
So, What is a database?
A Database is an organized collection of data that can be stored and accessed easily. It can be organized into tables, columns, rows, and indexes. This way, you can find and access relevant data in no time.
Databases are designed to perform basic operations, including retrieval, modification, storage, and deletion. A database supports the electronic manipulation of data.
A database can be stored as a file or a set of files. Those are binary files not readable by humans and cand only be read by the DBMS system itself.
The main types of databases
- Relational Databases
- NoSQL Databases
Each type requires a different DBMS to operate. The most common is Relational Database. It uses SQL (Structured Query Language) to manipulate, define, access, store, and delete data. The most popular databases in RDBMS are Oracle, MSSQL Server, MySQL, PostgreSQL and DB2.
NoSQL databases are getting a lot of popularity nowadays due to the Big data evolution\revolution. NoSQL, mean not only SQL. The most popular databases here are MongoDB, Cassandra and Neo4J. Each of these databases operates in different ways.
Hadoop, even though, many people say it is a database. I would classify more a file system storage with high processing power for big data.
In few words, the database is purely the software that manages storage, manipulation and retrieval. Nothing else!
What is a Data Warehouse?
A Data Warehouse is a “large” collection of data that live inside of a database software. So the database in this case is the car. The warehouse is how we drive that car.
In the database world, there are so many ways to use it and warehouse is just one of them. To know more about other ways, check on this article about OLTP vs OLAP.
A Data Warehouse holds data from heterogeneous sources. It is structured data, and should have been well thought the type of data that it holds and the purpose of the data. If well, designed and with purpose in mind, datawarehouses provide meaningful business insights. It is used to make important organizational decisions.
It is the core of the BI System, built for reporting and data analysis. A Data Warehouse is a blend of different technologies and components. Its main purpose is to help gain strategic benefit for an organization.
A Data Warehouse electronically stores large amounts of information for a business. It can be considered as a process of transforming data into information. Thus, users can make decisions based on the available data and insights.
History of Data Warehouses
The first concept of Data Warehouse was coined in 1983. Tera Deta Corporation released a DBMS that was designed specifically for making decisions. However, in the late 1980s, IBM developed the first Business Data Warehouse. Although it had some similarities to the modern Data warehouses, it was still not the same.
The first real concept was introduced by Inmon Bill. He is considered the father of data warehouses. His concept included the usage, building, and maintenance of a data warehouse.
Types of Data Warehouse
There are two main types of Data Warehouse:
1. Enterprise Data Warehouse (EDW): It is a centralized warehouse that provides various services for an enterprise.
2. Operational Data Store: Operational Data Store is a simple datastore that refreshes in real-time.
In few words, Datawarehouses are the concept used inside databases to store large amounts of data from all different sources to use for reporting, analytical and insights purposes.
What is a Datamart?
Datamart is a subset of a Datawarehouse. In a Datawarehouse, we may have data from different channels and departments in one centralized place. In an enterprise level this centralization is good. However, for the usage of different departments, it may be too slow and hard to manage security. Datamart comes into place in this situation. In this case Datamart would be a smaller Datawarehouse for a specific department for example.
Although Data Mart is also considered a subset of Data warehouse, it has some differences that we will go through.
A Data Mart is a subject-oriented database, often used in an enterprise Data Warehouse (EDW). It holds the data related to different departments like finance, marketing, and sales independently.
The main purpose of data marts is to accelerate the business process by allowing access to the relevant information quickly. The reason it does that is because the Datawarehouse is usually huge in data. Queries may take minutes sometimes to execute. However, if a Datamart, with exactly same data from the Datawarehouse, is used for sales data only in the sales department. The query will take much faster to execute.
A Data Mart offers actionable insights quickly and easily since it only contains the data applicable to a specific business area. Thus, Data Marts make up a cost-effective approach to accessing enterprise insights.
The other usage would be to create a wall in between data that other departments should not be able to see. For example, if the sales department should not see some of the data in the financial department or vice and versa, Datamarts for an specific department would be able hold only what is needed and allowed to see in that particular set.
Types of Data Marts
There are three types of Data Marts:
Dependent Data Mart: It is a centralized data mart that enables sourcing an organization’s stored data.
Independent Data Mart: This type operates without a central data warehouse and is ideal for small businesses.
Hybrid Data Mart: It combines input from data sources that are not available in the data warehouse.
What is a Data Lake?
A Data Lake is a system of repository data stored in its natural format. It usually stores blobs or files of all different types. It can be video files, image files, JSON files, other database files or anything else. If some type of data is yet to be explored, this is a good place to put it in. It should be ale to hold structured, semi-structured and unstructured data.
Datawarehouse’s are sometimes created out of data from data lakes. There are times where sometype of data is not yet known on how to be used. So, after clarification of its usage, a database structured can be defined and that data may now be part of a Datawarehouse.
You can perform several tasks with a data lake, such as visualization, reporting, machine learning, and advanced analytics. However, those tasks are usually time consuming due to the lack of patterns and structured data.
A data lake can be established in the cloud or on-premises. While it has many benefits for a business, there are some downsides as well. For instance, a data lake can not establish a robust data ingestion process or good communication.
History for Data Lake
The term Data Lake was first coined in 2011 by James Dixon, the chief technology officer at Pentaho. To promote this new concept, Dixon highlighted the problems that users face in data marts. The goal of Data Lake is to provide views from different types of information in quick way for future data modeling. Data Scientists works quite a bit with this type of data. They look for patterns in these files to improve the organization insights.
Other early studies on Data Lakes indicated that they were starting to extract and place data for analytics into a single repository.
Today, many huge companies offer data lakes. These include Google, Oracle, Microsoft, Amazon, among others.
Main Differences among Database, Datawarehouse, Data Mart and Data Lake
Database | Data Warehouse | Data Mart | Data Lake |
Databases are used to store and access structured and semi-structured data. | A Data Warehouse only holds structured processed data. | A Data Mart can hold the data from one or more functional area of an organization. | A Data Lake can embrace and retain all types of data, including text, images, sensors, structured, or unstructured. |
Databases perform the best when there’s a single source. | Data warehouses access the source system for data analysis. | Data Marts is designed to hold only a specific type of data. | A Data Lake accesses the reports and analyzes the given data “quickly”. |
A Database can come in different sizes, ranging from personal to enterprise use. | A data warehouse can take a lot of storage to process and prepare the data. | A data Mart contains only a subset of data stored in a Data Warehouse. | A Data Lake can take high storage and still works “efficiently” as it stores in raw form. |
A Database can scale up and down based on requirements. | A data warehouse is structured. | A data Mart is only a condensed version of a Data warehouse. | Depending on the requirements, a Data Lake can also be scaled up and down. |
Updating the data or system in a database may be costly. | A data warehouse also has low accessibility and fidelity of updating the system. | A Data Mart is smaller and size, but it is more flexible as compared to a data warehouse. | A Data lake allows users to make changes easily and quickly. |
A database comes in a variety of types. Each has its own functions and features. | A Data Warehouse is mainly of two types and can fulfill different functions. | A data mart has different types, each designed for specific use. | A data lake has no types. That’s because it adopts Hadoop for its unstructured and real-time data streams. |
A database can be used for personal, business, and enterprise usage. | A data warehouse is designed specifically to access the processed data of a business. | A data mart is designed to fit the needs of a specific department, unit, or group of users. | A Data Lake is also designed specifically for business usage. It offers actionable insights that can be used to make important organizational decisions. |
Workloads
All these systems manage different types of workloads. For instance, a database can handle both transactional and operational workloads. It means that a database can be used to make transactions by queries. Similarly, you can also perform several operations in a database, such as create, read, delete, modify, etc.
On the other hand, a data warehouse is only restricted to analytical workload. Since a data mart is a subset of a data warehouse, it also manages the same.
A data lake also manages only analytical workloads. It means that businesses use this system to get important analyses. They can use these to make important decisions and strategies.
Schema
A database has a rigid or flexible schema. However, it also depends on the type of database.
A data lake has no schema definition. It only requires schema on reading. A data warehouse, however, has a fixed and pre-defined schema definition for ingesting. It requires schema on reading and writing.
Users
A database is used by application developers. But, it has a large variety of use cases. A Datawarehouse and Data Mart is only suitable for application developers, business analysts, and data scientists.
A data lake is aimed at business analysts and data scientists, thanks to its nature of storing raw and analytics-based data.
Examples
Some notable examples of different database types include:
Document Database: MongoDB and CouchDB
Relational Database: Oracle, MySQL, Microsoft SQL Server, and PostgreSQL
Graph Database: Amazon Neptune and Neo4j
Wide-column stores: HBase and Cassandra
Key-value database: Redis and DynamoDB
Some prominent examples of data warehouse and data mart include:
Amazon Redshift
IBM Db2 Warehouse
Microsoft Azure Synapse
Snowflake
Teradata Vantage
Oracle Autonomous Data Warehouse
Google BigQuery
The following are examples of technology that offer storage for building data lakes:
AWS S3
Google Cloud Storage
Azure Data Lake Storage Gen2
AWS Athena
Presto
MongoDB Atlas Data Lake
Databricks SQL Analytics
Starburst
Conclusion
The list of differences among databases, data warehouses, data marts, and data lakes is vast. Each system can be used for a specific purpose. As an organization, it depends on their requirements to pick the system they want.
Almost all new applications require a database to store their current data. Similarly, companies want to hold current and historical data to analyze their applications.
Therefore, they either have to choose a database, data warehouse, data mart, or a data lake. However, many companies also use several management systems to make the best use.
All of them hold and access different types of data. Plus, they have various uses due to their abilities.
A database stores the current data that is required to power an application, while a data warehouse can store both current and historic data from multiple systems.
A data mart stores only a subset of data that a data warehouse can contain. On the other hand, a data lake can store the same data type but in its raw form.