Structuring a Data Lake for Digital Transformation

Published by:

Visagio

19/4/18

10 Min. reading time

The growing visibility of data as a strategic asset and source of competitive advantage in companies is a phenomenon that has accelerated the demand for processing and analysis capacity. The explosion of the volume of data in organizations and the plurality of systems create an ecosystem in which the existence of an architecture for data collection, structuring, storage, and processing is fundamental. These architectures can be highly complex depending on the size of the company and its data in question, so there is no single solution or software that meets all scenarios.

Among the main challenges of this topic is the collection and storage of raw data from source systems such as ERPs, APIs, manual inputs such as Excel and Access, among others. This collection must meet the following requirements: high volumetry, scalability, low latency, ability to handle data in various formats, different levels of structure/formats, in addition to low cost.

There are currently many solutions on the market that seek to meet the demand for data collection and storage. Generically referred to as Data Lakes, these systems are comprised of a repository capable of storing raw or unstructured data meeting volumetry, reliability, and cost requirements. However, implementing a Data Lake What generates results doesn't just depend on choosing a good tool. Um Data Lake Success depends on technical work combined with a change of mindset in the company.

**The four pillars to build a Data Lake Of success**:

Despite being built and maintained by IT, a Data Lake as a unified and governed data base, it will only be properly structured with the participation of all areas of the company and the interest of senior management. It is important that there is synergy between IT and business areas to ensure greater efficiency in generating results and better strategic alignment with the company's directives and goals. Below we list the four pillars of a Data Lake successful:

1 — Correct platform: there are many platform options on the market, including open-source alternatives. It must be ensured that it meets requirements regarding data volume, cost, data variety, and scalability. It is very common for the tools used to be undersized or oversized, which may result in unnecessary costs or a lack of performance to meet the requirements requested by the customer.

2 — Organization: The purpose of Data Lake is that the largest possible volume of data is stored in its natural form, avoiding pre-processing that can be costly and time-consuming to implement depending on the data in question. Additionally, the platform must be transparent so that users know exactly how to obtain them.

3 — Interface: Build a Data Lake successful means not only storing a gigantic volume of data, but also being able to provide it to the user in the most appropriate format depending on the use in question and their expertise. Take the commercial area of a company as an example. An analyst in the field will be interested in the average sales volume of a particular product. A data scientist, on the other hand, seeks the correlation of the daily sales volume of product x as a function of product y and the arrangement of those products on the shelves (close or separate). Both users will access the Data Lake in search of sales information, however, each one will require the data at a different level of processing.

4 — People: It is essential that there are people and teams trained to design Data Lake. As mentioned, it's not just about choosing the right tool, intense technical work is needed to define the best solution that suits each company.

In general, IT teams need to have in-depth technical knowledge in addition to business procedural knowledge. Business areas, in turn, need to know not only their processes, but also, minimally, the technology that will be applied. This interdisciplinarity will help in synergy between areas, which is fundamental for the success not only of Data Lake as with any other system implemented in the company.

Real gains from implementing a Data Lake

The correct implementation of a Data Lake can bring profits to the company at different levels:

Governance: centralized data storage allows the creation and maintenance of policies and regulations that permeate the entire company. One of the main points in this matter is the information security, which will be covered in greater detail in the following section.
Negócio: gains in storage scalability allow companies to store historical data with significantly less granularity. This increases the accuracy of the data, in addition to creating space for analytics initiatives that in turn generate more value to the business.
Cost reductions: It is possible to optimize the cost structure related to storage and processing by means of a data architecture that uses Data Lake to offload data stored in other, more expensive technologies, such as a data warehouse.

Real examples of gains from implementing a Data Lake

Financial sector

Financial services giant American Express used its cloud architecture to improve its fraud detection algorithm. Modeling methods must consult a variety of data sources, from basic credit card information to spending details and merchant information, in order to block fraudulent transactions. At the same time, it should allow legitimate transactions to proceed quickly.

Fraud detection systems must flag suspicious events in advance and make decisions within a few milliseconds against a vast set of data. In this context, the use of cloud architecture associated with machine learning provided an improvement over traditional linear regression methods, raising the accuracy of predictions to a new level.

Automotive sector

The luxury car brand Mercedes-Benz reduced its weekly engine testing cycle by one day by implementing a cloud architecture. A typical engine has about 300 sensors and generates about 30,000 data every second. Before cloud architecture, engineers had to wait about an hour of testing and then analyze the data manually looking for anomalies. With the cloud architecture, a correlation is made between the engine data collected in real time and historical test data. With this, it is possible to detect performance problems caused by engine failure almost instantly.

Safety and legal considerations

Information security is one of the main points when it comes to the implementation of Data Lakes and of the democratization of data. The question is: how to guarantee access to data and at the same time follow corporate and government regulations?

As discussed earlier, a Data Lake well-structured allows the creation of centralized data governance. However, for the policies to actually be followed, it is necessary that the structure of the Data Lake be prepared for it. This requires an architecture with permission rules, in addition to an organized directory tree. In general, the main tools available in the market already have the necessary resources for such an organization, and it is up to corporate governance to be properly structured to make this work.

The following is a recommended structure for a Data lake.

Recommended structure of Data Lake

The ideal structure for a company may vary depending on its size. In general, Visagio recommends the following organization:

Transient: Storage location for temporary files used as staging. Deleted regularly.
Raw: Storage location where the raw files are found, that is, all files extracted from systems are saved in this folder in their original format
Curated: Storage location where the data is cleaned and refined for data consumption. It is comprised of two subdirectories:
- Converted: Storage location with files converted by ETL (Extract, Transform and Load, the basic data preparation process) in formats optimized to be read and processed by the tools.
- Enriched: Storage location with files enriched with new data, crosses, calculations, or analytical models.
Lab Zone: Folder for data scientists to carry out experimentation and exploratory activities.

Another good practice is with regard to naming. A single standard should be adopted that avoids ambiguities and is easily recognized. Starting the name with the extraction date in YYYYMMDD format, using only an upper case, without accents, special characters and/or space are examples of good practices that we recommend.

Example: a .csv file extracted directly from an ERP database on a given date would be in the RAW/ [ERP_NAME]/[BASE_NAME]/[YEAR]/[MONTH]/[DAY]/[YEAR] [MONTH] [DAY] [MONTH] [DAY] - [BASE_NAME] .csv

Conclusion

Data organization is a key part of digital business transformation. It is necessary that, in addition to the implementation of new technologies, there must be an organizational change and of Mindset so that real value can be extracted from the data.

In other words, the data area must be structured with the support of the company's top management.

With regard to technology, a well-defined data architecture is necessary. In this context, the use of a Data Lake following best practices is fundamental, and can represent a major step in the digital transformation of companies.

Learn more about data architecture for BI & Analytics

This content met your expectations?

Yes! ? A little...? No?

External sources

[1] Big Data Requires a Big, New Architecture
Source: https://www.forbes.com/sites/ciocentral/2011/07/21/big-data-requires-a-big-new-architecture/#242d04ab1157

[2] Data Lake — Wikitionary
Source: https://en.wiktionary.org/wiki/data_lake

[3] James Dixon's Blog — Pentaho, Hadoop, and Data Lakes
Source: https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/

[4] Gartner — Look Before Driving Headfirst Into a Data
Source: https://www.gartner.com/smarterwithgartner/look-before-diving-headfirst-into-a-data-lake-2/

[5] Gartner — Gartner Glossary
Source: https://www.gartner.com/en/information-technology/glossary/data-lake

[6] Gorelić — The Enterprise Big Data Lake (2019)

‍

About the authors

José Suen is a consultant at Visagio, working on Technology projects focusing on BI and Analytics since 2019.

Julio Batista is a consultant at Visagio, working on Technology projects focusing on BI and Analytics since 2017.

Share this Insight:

Facebook