Apache Spark Optimization: Don’t count your chickens
Don't count your chickens before they hatch might not just be wise in life, but can also be a great recommendation for writing performant Spark SQL jobs.
Let’s take an example where we need to count number of managers, engineers and sales professionals from an employee dataset.
We could do something like below.
employees_df = spark.sql("select * from employees")
num_managers = employees_df.filter("employee_type is manager").count()
num_engg = employees_df.filter("employee_type is engineer").count()
num_sales = employees_df.filter("employee_type is sales_professional").count()
However, these would lead to 3 separate SQL queries being executed and your data would essentially be scanned thrice i.e. to count for each employee type.
There has to be a better way!
So.. Spark can be lazy. Spark has a good reason to be, unlike me😀
The lowest level of abstraction that Apache Spark provides are RDDs (Resilient Distributed Dataset). SparkSQL code is ultimately converted to RDDs.
RDDs support two types of operations
Transformations: They are operation that transform an RDD. In the example that is the RDD code for the SQL query.
Actions: Operations that trigger the transformations to compute a statistic like count/sum or show the data. In the example that is the count() call.
Spark does not process any transformations till it sees an action because its lazy. This makes sense because by the time the action is called, Spark can know all the things you wish to compute and optimize the execution of all the transformations. The later you call an action the better (i.e. the later you count your chickens the better)
The above example can now be written as below to reduce the number of actions to 1.
spark.sql("select count_if(employee_type = manager) as mgr_count,
count_if(employee_type = engineer) as engg_count,
count_if(employee_type = sales_professional) as sales_count
from employees").first()
With the above code the only action that got executed was the call to first(). This query gives enough information to Spark to just do one pass over your data and calculate all of the counts you need.