DataOps frequently utilizes the data lake design pattern supporting fast creation of robust data marts and data warehouses.

Throw Your Data in a Lake

DataKitchen
data-ops
Published in
5 min readMay 23, 2017

--

Many enterprises organize data in disparate silos, making it much more challenging to ask questions that require data to be combined and synthesized in new ways. DataOps breaks down the barriers between data using simple storage and data lakes.

The Benefits of Simple Storage

The declining cost of storage reduces the incentive to architect data and workflow processes to minimize storage costs. Enterprises can now cost-effectively design the data analytics pipeline to maximize flexibility and responsiveness to users and customers. The dream of “saving everything” is now an affordable reality.

The industry uses the term simple storage to describe this dynamic. Just to be clear, simple storage refers to more than just a single technology or vendor. Different applications can effectively utilize different approaches. Simple storage can be implemented using HDFS, AWS S3, Distributed File Systems, or one of many other storage solutions. We use simple storage here as an agnostic term.

Using Data Lakes to Simplify Transformations

The low-cost availability of simple storage enables enterprises to increasingly use data lakes. A data lake utilizes simple storage to retain the organization’s critical data. Data analysts commonly understand data lakes as a repository for raw data. Processed data can also be deposited into a data lake allowing it to be more easily combined with other data.

The figure below illustrates the data lake design pattern. (For more information on why DataKitchen views the data lake as a design pattern, please see our previous blog on this topic.) On the left, data originates from many different sources. In its native and isolated form, accessing data is difficult. Imagine a new analytics project that needs to work with data from a series of databases: a CRM, an ERP, syndicated data, sales channel data and so on. Accessing data in each of these repositories is time consuming and requires authorization and specific skills. Collocating data all in one place makes it much easier to work with. The data lake serves as the common repository for the various data sources greatly simplifying the job of transformation.

Data lakes consolidate data from different sources and feed into an automated data analytics pipeline that creates data warehouses and data marts.

Creating Data Warehouses from Data Lakes

The data lake provides easier access, but lacks the optimizations needed for visualizations or modeling. For example, data often enters the data lake in the format of the source system and not in a star schema — a format that facilitates analysis. Data marts or data warehouses better address analytic-specific requirements.

Data transforms create a data mart or data warehouse from a data lake. Data transforms are just code: scripts, source code, algorithms, or other types of files. When an enhancement to the data mart or data warehouse is needed, it is best to update the transform files and create a new data mart or data warehouse from the original data lake.

For several reasons data marts and data warehouses are difficult to update. For example, in a non-DataOps environment, the source code that created a data mart is not in a revision control system — it might be scattered in different folders or on someone’s personal hard drive. There may not be a ready suite of tests that can validate the production worthiness of the new data mart. The workflow might involve error-prone manual steps. Also, modifying a data mart in active use can impact user productivity. These and other factors add risk to the task of updating an existing data mart or data warehouse.

Improving Response Time to Requests for New Analytics

The data lake pattern frequently used in DataOps simplifies the process of creating a modified data mart or data warehouse. This enables data analytics professionals to respond rapidly to user requests for new analytics. Let’s look at an example.

Imagine that a data mart needs to be updated with an improved transformation of some raw data. We will illustrate how to make this change using DataOps. As we have written previously, DataOps can be implemented in seven simple steps. We’ll note how each action in our example relates to each of the seven steps (and provide links back to our blog series).

Example: Rapidly Creating an Improved Data Mart

· Use a version control system — The designer checks the data transform code that originally created the data mart (from a data lake) out of version control.

· Branch and merge — In version control terms, the above step creates a branch off the main trunk.

· Use multiple environments — The designer creates a development environment including a copy of the data that she needs using simple storage. Some DataOps enterprises implement this step by keeping development environments on-hand for new development work.

· Add data and logic tests — The developer updates the data transform code with new functionality and adds tests to validate the changes.

· Reuse and containerize — The updated transform now interfaces to a legacy system. In order to make that easier for others to reuse, the designer places it in a container.

· Parameterize your processing — The developer adds a parameter to the application enabling it to redirect output to the development environment. This makes it easier to switch between the development and production environments.

· Branch and merge — When complete, the designer checks the code back into version control, merging changes back into the main trunk. She also checks in the newly developed tests.

· Orchestrate two data journeys- The automated data analytics pipeline builds, tests and runs the new data transform code, creating a new data mart and pushes it out to production. New tests are running in production as data flows through.

The data lake pattern makes it much easier for the data analytics team to respond to requests for enhancements. The DataOps enterprise designs its processes to support this quick-response delivery of new features so that requests for new analytics can be implemented quickly and robustly. Better and faster analytics support creativity and improved decision-making. The ability to bring new ideas to implementation faster than competitors provides a sustainable competitive advantage for DataOps enterprises.

DataKitchen is leading the DataOps movement to incorporate Agile Software Development, DevOps, and manufacturing-based statistical process control into analytics and data management. We provide the world’s first DataOps platform for data-driven enterprises, enabling them to support data analytics that can be quickly and robustly adapted to meet evolving requirements.

--

--