The more energy, the faster the bits flip. Earth, air, fire, and water in the end are all made of energy, but the different form they take are determined by information. To do anything requires energy. To specify what is done requires information. – Seth Lloyd, Professor of Mechanical Engineering and Physics at MIT

The word ‘big data’ has become commonplace in the data analytics world, and deservedly so. This technology has revolutionized the way we consume data and has played an outsized role in every aspect of our life – almost everything we do or is done to us today has been driven by this disruptive technology.

There are two parts to how transformative changes of big data came about. First is the emergence of data science – a branch of statistical science that brings the power of statistical methods to data analytics. The other is data engineering discipline for building data infrastructure.

Data Science

Although some form of prescriptive statistics has always been present in data analytics, statistics were more common only in niche areas of research and development. Statistics were also common in academic disciplines such as social science, economics, and the esoteric world of statistical and quantum physics; however, the necessary infrastructure, tools, and data science practices as a discipline were not available to apply to the operational processes of the academic world.

The implementation of data science has become possible in more recent years due to the changes brought forth by the enhancements in storage and scalable computing to mobilize, simulate, and transform the data at near real-time speed. Changes in this industry over the last decade have resulted in more options, making data science more affordable for all organizations.

Although the more affordable data science price point enables more organizations to engage in data analytics, the need for skilled data practitioners remains. Implementing the rigor of statistical science using modern data infrastructure requires new skill sets — a data scientist for inferring knowledge from data by building, and an AI analytics/data engineer to build the infrastructure for data.

Data Engineering

Data engineering primarily evolved from software engineering. Data engineering adopted the same fundamental coding principles and processes but applied them to building a framework and infrastructure for data storage and movement. This framework is known as extraction, transformation, and load (ETL). Today, data engineering skills have become imperative for building modern data infrastructures.

Historically, the extraction, transformation, and loading (ETL) of data was accomplished either by shell scripts or by using an enterprise ETL technology. The largest graphical players were Informatica, IBM DataStage, and Microsoft SSIS. These enterprise-level technologies worked very well for structured/tabular data where data volume is small by today’s standard. These technologies have predefined sources used for data extraction and transformation. Traditionally, the transformed data is pushed to any databases or file system. In the last decade we have witnessed dramatic changes in the way data is being generated and consumed. Companies are seeing massive growth in the velocity, variety, and volume (3V) of data. The 3V explosion triggered innovation in data ingestion and transformation process at scale that effectively revolutionized the data world with big data technology. This 3V expansion gave rise to the evolution of data warehouses and new branches of analytics platforms such as data lakes.

Data Lakes and Data Warehouses

A data lake is a repository of raw data: structured, semi-structured, and unstructured. All this raw data are pulled together and stored in a single repository where it can be easily accessible. In contrast, a data warehouse is a repository of structured, filtered, and possibly aggregated data that has a defined purpose. A data warehouse is a single source of historical and current information of business transactions and processes. A data lake is often heavily used by data scientists to build scoring or prediction models of advanced analytics, compared to a data warehouse, which is built for business organizations to gain actionable insights into business processes.

Superior Analytics Environments

Ultimately, what an organization is striving for is a superior analytics environment. I believe that superior analytics environments have the flexibility to adapt to business needs. Analyzing current business processes can be accomplished with a traditional data warehouse or a data lake, which can be leveraged to build predictive models for making strategic business decisions. Having both a data warehouse and a data lake complements an organization’s ability to efficiently access useful information from their disparate data sources and differing data structures.

Which Platform is Right For You?

For an organization new to data analytics, choosing whether to implement a data lake or a data warehouse, or both, is an important first step in the data analytics journey. This decision depends on the type of organization you belong to and the data consumption needs of your organization. The decision also depends on your data strategy and the direction your company may want to take with data analytics.

Examples of data needs specific to an organization abound in both the healthcare industry and manufacturing industries. The healthcare industry has leveraged data warehousing for many years. Improvements in modern healthcare technologies in recent years, however, have resulted in a huge amount of unstructured image data. Combined with the need to store data in a single repository to prevent data silos, adopting a data lake has become practically no-brainer decision for healthcare organizations.

The manufacturing industry typically has diverse types of data such as data from IoT sensors, other devices used in manufacturing, and transaction data stored in OLTP / OLAP databases . The data is typically of a large variety and volume. To store this data, it is better to implement both a data lake and a data warehouse.

Your organization’s culture drives change. When considering implementing a data strategy, keep in mind the data culture of your organization. This culture includes the decision-making process and willingness to adopt or improve data-driven strategy. There are two ways to develop your strategy around the culture. One path is to build a data lake where all form of data can be stored. Using modern technologies, the data can be queried directly from the data lake. The other path is to establish a traditional data warehouse, which involves a more process-oriented and waterfall approach. There is also scenarios where you may need both, data lake for scalable and accessible storage integrated with data warehouse for data analysis.

Building a data lake is the more agile approach, and I recommend it versus building a traditional data warehouse. Keep in mind, however, that a data lake is not free from complexity and difficulties for maintaining a cohesive data platform.

For more information on what CSpring can do for you, be sure to reach out here!

Abhi is a Senior Data Consultant with proven expertise providing high value data analytics and predictive solutions. He brings more than 10 years of cross-industry experience in business intelligence, data visualization, and data science. Abhi is also well versed in subjects and concepts fundamental to data science and predictive modeling. Prior to consulting work in data analytics, he obtained a Ph.D. in Physics from Lehigh University. He has wealth of experience designing research experiments and applying scientific method in data analysis for deriving causal relation. In recent years, Abhi has combined his research and IT consulting experience in data analytics to deliver high impact predictive solutions for niche business needs.