As data becomes one of the most valuable resources, the focus of data engineering springs up dramatically. When I first entered the data engineering realm, I was overwhelmed by the boom of data-focused technologies. After yearly experience in this domain, I realized we could only touch the surface if we fixate on those evolving technologies but ignore the core issue. This article was written to help people understand the essence of data engineering so we can focus on the core of the problems, not trying to win the never-ending race. Innumerable technologies and vendor products rise and fall, but the essence of data engineering remains the same – the core concept and principles can be applied to any relevant technologies. I hope this article’s ideas and principles will stand the test of time. Besides, data engineering is not only a job for data engineers; everyone who works with data can and should understand it.
The Essence of Data Engineering
In the Big Data era, the whole purpose of collecting as much data as possible was to gain value from the data, and the method is data engineering. At the core, the purpose of data engineering is to make data usable and valuable—to transform source data into a form suitable for a data use case that can extract value from it.
A typical data engineering workflow includes gathering data from source systems, transforming the collected data, and gaining values from the processed data. Below is a high-level overview of data engineering.
The essence of data engineering contains three operations (Data Collection, Data Transformation, and Data Utilization) and one component (Storage).
The Data Engineering Operations
- Data Collection involves gathering data from various sources into a centralized storage system to ensure the data is ready for subsequent processing and analysis.
- Data transformation is a process of converting raw data into a usable format or structure. The process may involve cleaning, conversion, aggregation, normalization, or any methods that can make the data usable.
- Data Utilization refers to effectively using data and involves delivering the processed data to end-users or applications, enabling analysis, decision-making, modeling, and actionable insights. Data utilization overlaps with the territory of data analytics and data science. However, the boundary between data engineering, data analytics, and data science is blurry.
The Storage Component
Data needs a place to be stored. A storage system is the backbone that supports the entire data lifecycle. It securely and efficiently manages the influx of raw, processed, and used data.
Data Utilization
Although starting with data collection is more intuitive, knowing how data is used helps ensure we fully align with the purpose of data engineering—making data usable and valuable. This foresight avoids unnecessary data collection and ensures that the collected data perfectly fits the end goals, improving efficiency and relevance. Data analytics and machine learning are the two most predominant among all possible data cases.
Analytics
Data analytics includes discovering useful information, drawing conclusions, and supporting decision-making. Analytics uses statistical methods, reporting tools, and business intelligence tools to extract value from the data. Although the goals of analytics vary based on different situations, the data used by analytics needs to be accurate and time-sensitive (i.e., fresh enough) so that an analytics report will be trustworthy and delivered on time.
Machine Learning
Machine learning is a subfield of Artificial Intelligence. As its name implies, it teaches machines to learn something, and the essence of machine learning is learning from data. Since the beginning of the AI boom, machine learning has become one of the biggest data consumers and one of the primary use cases in the data world. Understanding how machine learning works helps serve it better. The following diagram shows a typical machine-learning workflow.
Data must be transformed into a particular format depending on the learning algorithms so that they can be trained. Most machine learning algorithms need quite a lot of data to train, especially deep learning. However, unlike data analytics, machine learning usually has a higher tolerance for data inaccuracies, and data sometimes doesn’t need to be served on time (i.e., historical data may be sufficient for training).
(To learn more about machine learning, please refer to the Machine Learning Basics Series)
Data Transformation
The raw data that a source system generates is usually unsuitable for analytics and machine learning. Therefore, it needs to be transformed into usable form. Any method that can make data usable is considered a data transformation. Below are some common transformation methods.
- Cleaning: removing duplicates or invalid data, correcting errors, and filling in missing values to improve data quality.
- Filtering: select a subset of data based on specific criteria or conditions.
- Aggregation: summarizing data, such as calculating averages, sums, or counts.
- Deduplication: removing duplicate data.
- Merge & Join: combining data from different sources or tables to create a unified dataset.
- Flatten: convert a complex data type (e.g., map and struct) to plan data types (e.g., integer and string) into multiple columns.
- Normalization: scaling data to a standard range, typically 0 to 1, to ensure consistency.
- Encoding: converting categorical data into numerical formats.
- Standardization: adjusting data with a mean of 0 and a standard deviation of 1.
- Smoothing: reducing noise in the data to highlight trends, often using techniques like moving averages.
Although we can implement data transformation functions from scratch, handling data transformations is challenging when the data quantity is enormous. Fortunately, many open-source libraries have been implemented for this purpose. Apache Spark and Pandas are arguably the two most popular tools. For example, by using Spark, transforming data can be as simple as a one-line code program, and Spark will handle the rest (See Apache Spark examples).
These transformations help prepare the data for analytics, machine learning, and other use cases, ensuring the data is clean, consistent, and in a proper format. Data transformation can also be applied to any step of data engineering. For example, when collecting data from a source system, the collection process can filter out invalid data before storing it, so only the valid data will be stored in the storage system.
Data Collection
Data collection is the first step of the data engineering workflow. It involves gathering raw data from various sources such as databases, APIs, logs, IoT devices, etc. The data is then processed, cleaned, transformed, and ingested into storage systems, so it is also called data ingestion or ETL (extract, transform, load).
Depending on the source system, there are several ways to gather data. The list below includes some common methods most source systems provide.
- Database Query: directly querying databases using SQL or other query languages to retrieve specific data.
- File Transfer: collecting data from flat files such as CSV, JSON, or log files. These files can be transferred via FTP, SFTP, or cloud storage services.
- Streaming: collecting real-time data from sources like IoT devices, sensors, or social media using technologies like Apache Kafka or Amazon Kinesis.
- API (Application Programming Interface): APIs allow applications to communicate and share data. Many web services and applications provide APIs for accessing their data.
After receiving the source data, we can transform it into a desired format and load it into a storage system for further use.
Storage
Storage is the cornerstone of data engineering—data must persist throughout its lifecycle. Besides, the storage stage frequently touches on other data engineering stages, such as collection, processing, and even data generation. Therefore, storing data in the context of data engineering is not as simple as saving it to a disk for personal use, especially when the quantity of data is significant.
Traditionally, data is stored in an on-prem storage system in a data center, and an abstract layer (i.e., software) manages persistent storage media such as disk and SSD and provides an interface to access them.
With the popularity of the cloud, cloud storage solutions (e.g., Amazon S3) are becoming the new norm. A cloud storage solution adds one more abstract layer to data centers across multiple regions. However, the data is still stored in persistent media (e.g., disks) in a data center somewhere in the world.
(To learn more about data centers and storage technologies, please refer to the Brief Introduction of Data Center Technologies article)
Storage Abstractions
Managing data manually and directly on persistent media such as SDD is tedious and not scalable, so modern storage solutions, whether on-prem or cloud, provide an abstraction layer to simplify and standardize our interactions with data storage systems. A database is a typical example of storage abstraction. Many new technologies, such as Data Warehouse and Data Lake, have also been created.
- Data warehouse
A data warehouse is a central data hub used for reporting and analysis. Its data is typically highly formatted and structured for analytics. Thus, the data stored in a data warehouse is like tables with rows and columns and has a predefined schema, meaning the data structure is defined before storing it. Typical data warehouses include Amazon Redshift, Google BigQuery, and Snowflake.
- Data Lake
A data lake is a centralized repository that allows us to store all structured, semi-structured, and unstructured data at any scale. Data is stored in its native format, and the schema is defined only when the data is read, not written. This provides flexibility in storing any type of data. Popular cloud storage service – Amazon S3 is widely used for data lakes.
- Data Lakehouse
A data lakehouse is a newer innovation that combines aspects of the data warehouse and the data lake, offering unified storage and schema flexibility with data management, ACID transactions, scalability, and flexibility. This hybrid architecture allows users to perform analysis and machine learning on all data, regardless of its structure. Delta Lake and Apache Iceberg are two data lakehouse examples.
Put Everything Together
Combining everything mentioned in the previous sections is called a Data Pipeline. A data pipeline is not a one-time process; it usually needs to continue operating. As a result, a complete data pipeline includes a series of automated processes that move and transform data from various sources to a destination where the outcome can be analyzed and utilized.
A well-defined data pipeline ensures that every aspect—ingestion, processing, storage, orchestration, and monitoring—is carefully planned and integrated. However, like designing software, building a data pipeline is full of challenges and trade-offs. A holistic approach needs to take many situations into account.
Things to Consider
Building a data pipeline involves not only the pipeline but also the data it produces and manages.
Data Security
A data pipeline generates data that needs to be managed. Because of this, good data security practices must be applied.
- Access must follow the Principle of Least Privilege. The principle of least privilege means granting enough permission to a user or service only for the essential data and resources to perform the operations.
- The other side of data security is data privacy. We must respect people’s privacy to comply with regulations such as GDPR and CCPA. As a result, sensitive data must be masked, especially when data is PII (Personally Identifiable Information). Only a privileged person can view the unmasked sensitive data; everyone else can only see the masked value. Following this practice, even if a non-privileged person’s workstation is hacked, sensitive data will not be leaked.
- When granting permission, we must avoid giving permanent permission. All permissions should have a lifespan (i.e., time-to-live or TTL), so the given permission will be revoked after it expires.
- The ability to share data is one of the most significant contributors to data leakage. As a result, the privilege of sharing data must be limited, and the data-sharing activities must be monitored.
Data Quality
Data quality refers to the condition of a dataset, which has the following aspects:
- Accuracy: data should be correct and free of errors.
- Completeness: all required data should be preset without any gaps.
- Consistency: data should be uniform and compatible across different datasets.
- Reliability: data should be trustworthy and reliable.
- Relevance: data should be relevant and applicable to the task
Good data quality is crucial for whatever data use cases. Data quality checks can be applied to ensure the quality.
Running data quality checks against datasets is similar to running software unit tests. For example, if we want to ensure the completeness of the ID field in a data set, we can iterate through the data set and check whether the ID field is empty. Although we can implement the data quality check ourselves, many tools, such as deequ and Great Expectations, have been created to make it easier,
Trade-off
When building software, we run unit tests every time something changes to ensure the change does not break anything. Similarly, we can run data quality checks every time data is collected and generated. However, doing that could significantly slow down the data pipeline performance because running data quality checks is expensive. Imagine a data pipeline needs to process one million records every time it runs. If we add a check to ensure there is no empty field, the check needs to iterate one million records every time. That’s one additional operation for processing the data. If we add more checks, more extra operations will be performed. The overall run time of the data pipeline will be significantly longer. Therefore, data quality checks need to be applied wisely.
Data Integration
A data pipeline may often have multiple data sources. Data integration is the process of combining data from different sources to provide a unified and comprehensive view. For example, a retail company has data stored in multiple systems:
- Sales data in an ERP (Enterprise Resource Planning) system.
- Customer data in a CRM (Customer Relationship Management) system.
- Inventory data in a warehouse management system.
Integrating data from these disparate sources into a single data warehouse allows the company to analyze sales performance, understand customer behavior, and optimize inventory management all from one place. To do so, data must be consistent, accurate, and usable across the entire dataset. For instance, the user ID format may differ in the source systems. Therefore, when integrating the data, the user ID fields of all sources must be transformed into a consistent format.
Data Lifecycle
Data destruction is usually not a concern for a data project. However, regulations like GDPR and CCPA require companies to actively manage data destruction to respect customers’ “right to be forgotten.” As a result, we must know what consumer data they retain and have procedures to destroy data in response to requests and compliance requirements. Besides, removing unnecessary data can also reduce storage costs.
Data Lineage
Data lineage refers to recording an audit trail of data’s origins, movements, and transformation from its source to its destination. It provides a detailed record of how data flows through its lifecycle. Data lineage helps with error tracking and debugging of data and the services that process it.
In addition, having lineage for customer data allows us to trace where a customer’s data is stored and its dependencies, which is necessary to comply with regulations like GDPR and CCPA.
Scalability
In the real world, the size of the source data a data pipeline can collect is inconsistent; the data a source system generates only grows most of the time. Therefore, when designing a data pipeline, we must consider future growth and ensure the infrastructure can handle increased data volume and complexity without compromising performance.
Monitoring and Alerting
Like any software service, a data pipeline continues operating, so its status needs to be monitored, and an alert needs to be triggered if its operations fail. Monitoring and alerting keep our data processes reliable, efficient, and secure.
Cost Control
With the popularity of cloud solutions, more services are being built and run on the cloud. Most cloud providers have a pay-as-you-go pricing model so that systems can run on a cost-per-processing or any other variant of the pay-as-you-go model. That means the more resources we consume, the more we need to pay. Therefore, cost and resource consumption must be considered when designing a data pipeline.
Resources can be categorized as computing and storage. Computing indicates resources where software can be executed, and the resources include CPU and memory. Nowadays, most cloud providers offer two types of computing resources: server and serverless. The server type is like a virtual machine, which is like a real server but virtual. When using server-type computing, a virtual machine is allocated with predefined configurations (e.g., CPU type and memory size), and we run our software on it. On the contrary, serverless computing doesn’t need to specify its configuration. Our software will be running on whatever computing resources the cloud provides. Whether the computing is server or serverless, they are charged based on the size of computing resources and the duration of use. For that reason, we should choose computing resources adequately. Enabling auto-scale if the option is available.
A data pipeline generates data, usually a lot, and the data needs to be stored. Like the computing resources on the cloud, most cloud storage providers charge based on the size and duration. Setting up proper data lifecycle policies, moving cold data to the cold tier, and cleaning temp data could save storage costs.
(To check more details about optimizing storage cost, please refer to A Guide for Optimizing AWS S3 Storage Cost. Although the article is written for Amazon S3, its idea can be applied to any other cloud storage platform)
Trade-off
Making software development decisions involves trade-offs. This is the same when trying to optimize the costs of a data pipeline. For example, using a bigger cluster costs more, but the pipeline may run faster. So, the overall cost might be lower when using a bigger cluster. Similarly, storing pre-compute data increases the storage footage, but having pre-compute data could significantly reduce the cost of computing. Apart from that, storage is usually cheaper than computing so that the overall cost could be lower. To sum up, when evaluating costs, every factor of a data pipeline needs to be considered.
Conclusion
All data problems become problems because of the gigantic quantity of data. Even the dumbest approach can sort one hundred records well. However, sorting billions of records becomes a complex problem. Numerous new technologies have been invented to solve data problems. However, despite how many new technologies have emerged, the core problem remains: making data usable and valuable. Data engineering is the method to achieve that goal regardless of the tools or technologies used.