The terms data mart, data lake, data repository and data warehouse are often used interchangeably when people write about these similar systems. However, that’s not accurate.
Each system has its own unique properties. For those working in health informatics, understanding the differences is important. Here’s a closer look at these four terms and what exactly they mean.
Data Lake
A data lake is typically considered a kind of dumping ground for data, because everything goes in. And in many cases, not a lot comes back out. Essentially, it’s used by organizations with massive amounts of data to store, but no current plan on how they will analyze it.
Everything goes into a data lake. That means unstructured data, such as data feeds, emails, chat logs, images and videos. A data lake is not necessarily something an organization wants, but many have one as the ways to collect data have outrun the ways to analyze it.
Data Warehouse
Typically, a data warehouse is also filled with massive amounts of data. However, it is data that has been structured and is easier to both access and analyze.
However, the data is not separated in a specific way to make it more useful to business units within an organization. For example, data that marketing and sales would be interested in (customer behavior online, certain demographic indicators) is not separated from other data.
The advantage is that data from across an entire operation is accessible. That can help in healthcare projects, for example, that require often overlapping data from different corners of the operation.
Data Mart
A dart mart is essentially a subset of a data warehouse. In most cases, it is created to provide information for one department within the overall organization. The advantage is that it walls off other types of data. A data mart for patient billing in a hospital will not include information from maintenance, procurements or clinical departments, for example, The advantage is that it is easier to provide security for that specific subset of information, as well as allow people to access it without affecting work in other departments.
Data Repository
A data repository compares to the data mart as the data lake compares to the data warehouse. For example, a data repository will collect unstructured data for a specific business unit within a healthcare operation. For example, a data repository could contain detailed patient healthcare records. This can include demographic information, test results, video images, diagnoses, etc. However, the data is not in a state where it is prepared for the application of data analytics.
Each of these four data collection approaches offers certain advantages, although typically a healthcare operation strives to have data warehouses and data marts. Both allow for extracting valuable information that can be analyzed, either across an entire operation or within a specific department.