Apache Spark is an open-source unified analytics and data processing engine for big data. Its capabilities include near real-time or in-batch computations distributed across various clusters.
Simply put, a Spark Job is a single computation action that gets instantiated to complete a Spark Action. A Spark Action is a single computation action of a given Spark Driver. Finally, a Spark Driver is the complete application of data processing for a specific use case that orchestrates the processing and its distribution to clients. Each Job is divided into single “stages” of intermediate results. Finally, each stage is divided into one or more tasks.
It should be clear by now that a Spark Job is simply one of the single units of execution used to achieve the maximum possible configurability for cluster affinity and parallelization of resources.
By default, these Jobs are executed by Spark’s scheduler in keeping with the FIFO (first in, first out) ordering—the first job gets priority on all cluster resources. In the latest versions of Spark, it is also possible to configure fair sharing, assigning tasks between jobs following the “round robin” algorithm, so that all jobs get an approximately equal share of cluster computing resources.
Spark distribution can work with different clustering technologies such as Mesos, Hadoop Yarn, or Kubernetes. Spark will take care of distributing the Job’s workload among different cluster nodes through a cluster manager.
Most cloud warehouse providers also offer Spark integration services called Spark Connectors. Amazon S3 has “s3a connector”, Azure Storage has “wasb connector”, and Google Cloud Storage has “gs connector”, making Spark an all-round software computational unit for modern distributed data-warehousing systems.
Advantages
Spark and Spark Jobs, specifically, have many advantages, which are mostly related to speed and ease of use. Apart from those, using Spark Jobs provides the following additional advantages:
- Lazy Evaluation: Spark automatically triggers processing only when a specific Spark Action (or Job) is run. It’s possible to classify Spark Jobs into “actions” and “transformations”, which allows Spark to optimize the overall performance of the system by triggering transformations of data only when they’re effectively needed and when all the actions needed have been executed correctly.
- Easy Parallelism: Spark’s Jobs are easily configurable to be run in parallel. It’s easy to split data into several partitions so that they can be processed in parallel and independently.
- Caching of Intermediate Results: Spark automatically manages to cache intermediate results. If Spark discovers that the results of a previous computation can be reused in a subsequent execution, it automatically reuses the previous data without re-executing the workload. It is still possible to configure peculiar types of caching among different jobs in different tasks.
- Highly Configurable Computational Tasks: We have seen that Jobs are executed by default in a FIFO fashion. If enabled, the fair scheduler also supports grouping jobs into pools by setting the weight for different scheduling options for each pool of jobs. This approach is modeled after the Hadoop Fair Scheduler. In this way, it’s possible to create a pool of “high-priority” jobs whose computational analyses need to be prioritized. It’s even possible to group the jobs of each user together.
- Third-Layer Application Servers Available: Despite the great parallelization available when handling workloads with Spark Jobs, Spark is not really well suited for concurrency. In fact, multiple concurrent requests for data analysis should be handled by interposing a third-layer application server like Apache Livy or Apache Hadoop between the clients and the Apache Spark Context Manage.
- Resiliency: The distribution of Jobs among different nodes relies on resilient distributed datasets (RDD), appropriately designed to handle the failure of any worker node in the cluster and make sure that data loss can be approximated to 0.
Challenges
Adopting a distributed unified analytics and data processing engine comes with different challenges that need to be considered carefully.
- Optimization: Optimizing a Spark Job means taking care of the various aspects of processing, which can be difficult. Splitting datasets into multiple partitions so that the computation can be run in parallel is not always an easy task, as single data information could be more frequently than not heavily correlated and not atomisable. Caching intermediate results should be considered carefully, as handling a huge amount of data could result in heavy cluster resource consumption.
- Harder to Read and Write: When integrating Apache Stark with distributed data storage, it’s important to remember that object stores do not function in the same way as filesystems; they are significantly different. Reading and writing data can be significantly slower than working with a normal filesystem.
- Out Of Memory Issues: Memory issues are one of the most frequent problems in designing Spark Applications. Driver memory can be affected by huge collectors or broadcasting. Executor memory issues can be caused by analyzing big partitions or by setting up very high concurrency.
- Configuring the number of executors: when designing an apache spark cluster, a particular focus should be put on the division of the job workload into different executors. Having too many executors with low memory can result easily in out-of-memory problems while having too few executors can result in an inability to parallelize workloads.
- Consistency: even if Amazon S3, Google Cloud, or Microsoft Azure object stores are consistent (meaning that it’s possible to read a file immediately after it has been written) none of the store connectors provide any guarantees as to how their clients cope with overwritten objects while a stream is reading them. This could result in Spark having inconsistent results if the data cloud providers are overwriting objects.
With all the underlying and atomic division among computational units, a simple job execution could become really hard to configure. Spark Jobs are not suited for simple enough data-intensive computation, as all the configuration power that comes from the adoption of Spark Jobs may be excessive for the vast majority of use cases.