Leaky abstractions in Apache Spark
Why understanding the Spark internals is crucial for its optimal usage.
Apache Spark has become a de-facto data processing engine for Big data use cases. Although Spark provides great abstractions, there are times when understanding of internals are essential for using it optimally. This post lists ways Spark exposes Leaky Abstractions.
Law of leaky abstractions
The law of leaky abstractions is a principle in software engineering that states that all non-trivial abstractions are leaky. It is impossible to create perfect abstractions. Even if your software is perfect it is not immune to natural disasters, electric grid failures, etc. However, there are reasonable expectations of abstractions that users expect out of the system. E.g. Useful error messages on failure, reasonable performance, and correctness.
Programming Model
Spark started as a platform that was a replacement for Hadoop MapReduce. It solved problems of disk-based data exchange, lack of expressivity, and fault tolerance without data redundancy¹. The solution was an abstraction called Resilient Distributed Dataset (RDD), an in-memory, distributed (partitioned across machines), immutable collection of records.
RDDs were designed to implement common functional operations (e.g., map, reduce, filter, etc.) on collections. It’s easy to think about them as any other collection (List, Array).
E.g., To check the emptiness of a collection in Scala. myList.size == 0 and myList.isEmpty would perform similarly. But rdd.count() ==0, and rdd.isEmpty would not. Spark counts the size of rdd, there is no pre-computed value.
This is a leaky abstraction because rdd.count() ==0 and rdd.isEmpty is logically equivalent and the expectation is that performance would be comparable. Spark’s
As the project evolved, the Dataframe API was created, which paved the way for Spark SQL. Spark SQL abstracts away many details, but it has similar problems. Here is my previous article with an example of equivalent queries with different performance.
Configuration
Configuring a Spark job can be complex due to numerous configurations. Typically, configuring memory on worker JVMs (Executors) and the number of tasks per executor is necessary.
Optimal values for the above configs depend on
The Code being executed
Number of i/p files (aka partitions)
Size of files (aka size of data)
Type of file format (CSV, JSON, Parquet)
Data distribution
In addition, there are other non-trivial considerations like compression, serialization, and observability.
Spark does not hide the complexity of understanding and optimizing these configurations. Developers must at least have a mental model of the Spark internals to utilize the cluster completely.
Error Handling
Spark errors are difficult to debug if you do not understand the internals. Here is an excellent post on unintuitive errors that the Spark application can throw.
Code performance, cluster configuration, and error handling are three aspects of Leaky Abstractions in Spark.
Comparison with other SQL-based data systems
Now that we have reviewed the challenges with Spark, let’s contrast them with other SQL-based systems. SQL-based databases have been used for a long time. E.g., PostgreSQL, Oracle DB & Snowflake. Note that Spark SQL is a processing engine and SQL databases are both processing and storage engines.
SQL databases offer better abstractions for optimal performance because they are less flexible. E.g., In SQL databases, users do not have a choice on the physical layout of data, and the database engine is tied to the catalog. In contrast, Spark can read data from file formats like CSV, JSON, Parquet, etc., and supports any hive-based metastore (catalog).
Less flexibility -> Less configurations -> Better Abstractions
Examples of these abstractions are indexes and partitions for structuring the data optimally based on the expected query patterns.
SQL databases do have some similar challenges to Spark SQL. Different ways of writing queries can lead to different query plans resulting in different peformance. Example
There are no perfect abstractions.
Support my work by subscribing. Thanks.
Resources
RDD Research Paper: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
Spark errors: https://medium.com/@yhoso/resolving-weird-spark-errors-f34324943e1c
https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/
What Are Abstractions in Software Engineering with Examples? https://thevaluable.dev/abstraction-type-software-example/