Batch Processing
-
Charles
- December 21, 2024
Motivation
Martin Kleppmann’s Designing Data-Intensive Applications is a widely known book in the software community that covers substantial depth and breadth of content on designing distributed software systems.
When I originally bought the book a few years ago, I managed to read the first 75% of the book, and stopped short of the chapters relating to batch processing and stream processing, chapters 10 and 11.
This was unfortunate for me, because those chapters turned out to be quite relevant to the work I was doing at the time.
In this post, I want to review some of the concepts I learned reading those chapters recently, and share how those concepts related to past projects I’ve done.
Batch Processing
Batch processing is the offline, asynchronous processing of reading input data, processing, and producing outputs.
The main performance aspect is throughtput (i.e. how much data can be processed within a given timeframe). It’s typically not at the path of a user-facing operation (i.e. no user on the website is waiting for a batch processing system to complete).
Use cases extend beyond OLAP business domains, such as generating business analytics, business reports. It could be used for processing big data to help train machine learning models, create a search index, designing a payroll system, etc.
MapReduce
MapReduce is a programming model that implements data processing through running computations in parallel across a cluster of machines. To highlight, the main benefit is that, instead of running data processing on a single machine (bounded by memory), we can now perform processing across multiple, distributed machines.
Main concepts in MapReduce are the mapper and reducer. The mapper typically resolves a row in a dataset to a key and value. This key becomes the partition key which acts like an “address” to send the row to the partition that will run the reducer. The reducer performs some accumulation function (e.g. count, sum, etc) to reduce the data.
For example, if we need to count the words in an essay, we can theoretically, divide the essay by each line, count the words for each line, and then bring all word count together and sum() it.
Hadoop, HDFS, YARN
Like me, you may be curious what Hadoop, HDFS, and YARN mean.
MapReduce is part of the larger Hadoop ecosystem, which includes other open source modules, namely the Hadoop distributed file system (HDFS) and YARN. HDFS is a distributed filed system, similar to Amazon S3, which can store files of various formats. YARN, abbreviated from “Yet Another Resource Negotiator”, is the job scheduling and resource management system. YARN is interesting its own right, but we won’t get to it here.
Batch Processing Problems
Two main problems of batch processing.
- Partitioning (i.e. how to partition data and bring them to the same machine to process efficiently, without incurring too much network costs through shuffling data.)
- Fault Tolerance (i.e. How to recover from single node failures, and resume processing)
In the case of partitioning, we want to avoid shuffling, especially if data is not mapped or partitioned correctly. Also, there could be issues of data skew, which is when one partition has to handle an abnormally large amount of data, which results in “straggler tasks” that can delay the overall performance.
Join Algorithms
There are algorithms for performing joining operations, as mentioned in the book are: sort-merge join, broadcast hash join, and partitioned hash join. Specific details are not included here to avoid going too deep into these topics.
Apache Spark, Apache Flink, and others.
While MapReduce is easy to understand from a theoretical view, it is apparently difficult to implement.
For instance, one has to write the join algorithms from scratch. As a result, the industry made abstractions on top of it, such as Hive, Cascade, Pig, etc. However, despite writing abstractions on top of MapReduce, there are some performance optimizations that were addressed by newer frameworks, such as Apache Spark and Apache Flink.
Some of MapReduce problems are: fully materialization of intermediate state (i.e. data that is generated as output and used as input once are saved entirely onto disk, taking up unnecessary storage), default sorting operation between map and reduce stages, unnecessary mappers that could be part of previous reducers, disk usage of storing intermediate states (as opposed to in memory), need for waiting a job to finish its stages before next stages, etc.
Apache Spark and Apache Flink are called “dataflow engines”, as they model the flow of data in a directed acyclic graph (DAG), using function operators. Function operators are a more generalized form of map and reduce. Instead of writing multiple MapReduce jobs, one can write one Spark/Flink job that encompasses an entire workflow of data processing.
This leads to some remarkable benefits such as: scheduling optimizations (job scheduler can decide to reuse a partition for computing same data to avoid copying over network), less I/O writes to disk, less time being blocked on previous jobs (i.e. operators can start when its inputs are ready), reusing JVM processes to reduce startup time (as MapReduce did), among other benefits.
Fault Tolerance
What happens when a task fails?
In MapReduce, since intermediate data is often saved into disk (HDFS), the task can restart and re-compute from the previous input data.
In Spark or Flink, it is handled through re-computation. Spark keeps track of the input and operators applied to data through tracking the data ancestry, using resilient distributed dataset (RDD). Flink, on the other hand, performs checkpoints on operator state.
What I’ve Worked On
I’ve been fortunate enough to work on a couple of projects that focus on batch processing.
- Batch processing of images for annotation purposes (i.e. find bounding box in this image).
- Batch processing to create a non-personalized news feed.
- Batch processing of generating content or assets for creating short videos (i.e. call lots of APIs to composit a video).
- Batch processing of training data used for training machine learning models for product recommendations (i.e. refresh daily datasets and create features)
A common thread between these projects:
- Using workflow orchestration framework, such as Amazon Step Functions or Apache Airflow, to orchestrate a series of tasks.
- Using a distributed file system, such as Amazon S3, to store file inputs/outputs.
- Using a NoSQL or SQL databases, such as Amazon DynamoDB or Amazon Redshift, to store data that would be queried by online services for supporting customer-facing websites.
- Using a notification mechanism to signal a “new update”, such as AWS SNS, to downstream consumers that need to consume this new dataset.
- Using a task queue, such as AWS SQS, to decouple between services or allow jobs to be processed asyncrhonously.
For data processing libraries, I’ve used:
There are many flavors or frameworks to this domain, but some of these are ones I’ve used explicitly in projects. This domain continues to evolve rapidly, so I would not be surprised if these tools are no longer in trend.
Conclusion
Overall, batch processing seems like a common approach to solving many business use cases.
Whenever we need to regularly ingest new data to create new datasets at a regular cadence (e.g. daily, monthly, yearly), batch processing is likely the solution. If you need to support more frequent updates or near real-time use cases, then you can check out stream processing, which is a close cousin of batch processing, but deserves an entire topic on its own. Perhaps, I can cover it in another post.
I hope you were able to learn something new about batch processing. At least if you were like me and you stopped reading the book before these chapters, this would have summarized some of the content, so you can decide if you want to gather more details from the book itself!