Databricks Databricks Certified Associate Developer for Apache Spark 3.0 Exam
Certified Associate Developer for Apache Spark (Page 5 )

Updated On: 26-Jan-2026

Which of the following describes Spark actions?

  1. Writing data to disk is the primary purpose of actions.
  2. Actions are Spark's way of exchanging data between executors.
  3. The driver receives data upon request by actions.
  4. Stage boundaries are commonly established by actions.
  5. Actions are Spark's way of modifying RDDs.

Answer(s): C

Explanation:

The driver receives data upon request by actions.
Correct! Actions trigger the distributed execution of tasks on executors which, upon task completion, transfer result data back to the driver.
Actions are Spark's way of exchanging data between executors. No. In Spark, data is exchanged between executors via shuffles. Writing data to disk is the primary purpose of actions.
No. The primary purpose of actions is to access data that is stored in Spark's RDDs and return the data, often in aggregated form, back to the driver.
Actions are Spark's way of modifying RDDs.
Incorrect. Firstly, RDDs are immutable – they cannot be modified. Secondly, Spark generates new RDDs via transformations and not actions.
Stage boundaries are commonly established by actions.
Wrong. A stage boundary is commonly established by a shuffle, for example caused by a wide transformation.



Which of the following statements about executors is correct?

  1. Executors are launched by the driver.
  2. Executors stop upon application completion by default.
  3. Each node hosts a single executor.
  4. Executors store data in memory only.
  5. An executor can serve multiple applications.

Answer(s): B

Explanation:

Executors stop upon application completion by default.
Correct. Executors only persist during the lifetime of an application.
A notable exception to that is when Dynamic Resource Allocation is enabled (which it is not by default). With Dynamic Resource Allocation enabled, executors are terminated when they are idle, independent of whether the application has been completed or not.
An executor can serve multiple applications.
Wrong. An executor is always specific to the application. It is terminated when the application completes (exception see above).
Each node hosts a single executor.
No. Each node can host one or more executors. Executors store data in memory only.
No. Executors can store data in memory or on disk. Executors are launched by the driver.
Incorrect. Executors are launched by the cluster manager on behalf of the driver.
More info: Job Scheduling - Spark 3.1.2 Documentation, How Applications are Executed on a Spark Cluster | Anatomy of a Spark Application | InformIT, and Spark Jargon for Starters. This blog is to clear some of the… | by Mageswaran D | Medium



Which of the following describes a valid concern about partitioning?

  1. A shuffle operation returns 200 partitions if not explicitly set.
  2. Decreasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.
  3. No data is exchanged between executors when coalesce() is run.
  4. Short partition processing times are indicative of low skew.
  5. The coalesce() method should be used to increase the number of partitions.

Answer(s): A

Explanation:

A shuffle operation returns 200 partitions if not explicitly set.

Correct. 200 is the default value for the Spark property spark.sql.shuffle.partitions. This property determines how many partitions Spark uses when shuffling data for joins or aggregations.
The coalesce() method should be used to increase the number of partitions.
Incorrect. The coalesce() method can only be used to decrease the number of partitions.
Decreasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.
No. For narrow transformations, fewer partitions usually result in a longer overall runtime, if more executors are available than partitions.
A narrow transformation does not include a shuffle, so no data need to be exchanged between executors. Shuffles are expensive and can be a bottleneck for executing Spark workloads.
Narrow transformations, however, are executed on a per-partition basis, blocking one executor per partition. So, it matters how many executors are available to perform work in parallel relative to the number of partitions. If the number of executors is greater than the number of partitions, then some executors are idle while other process the partitions. On the flip side, if the number of executors is smaller than the number of partitions, the entire operation can only be finished after some executors have processed multiple partitions, one after the other. To minimize the overall runtime, one would want to have the number of partitions equal to the number of executors (but not more).
So, for the scenario at hand, increasing the number of partitions reduces the overall runtime of narrow transformations if there are more executors available than partitions.
No data is exchanged between executors when coalesce() is run.
No. While coalesce() avoids a full shuffle, it may still cause a partial shuffle, resulting in data exchange between executors.
Short partition processing times are indicative of low skew.
Incorrect. Data skew means that data is distributed unevenly over the partitions of a dataset. Low skew therefore means that data is distributed evenly.
Partition processing time, the time that executors take to process partitions, can be indicative of skew if some executors take a long time to process a partition, but others do not. However, a short processing time is not per se indicative a low skew: It may simply be short because the partition is small.
A situation indicative of low skew may be when all executors finish processing their partitions in the same timeframe. High skew may be indicated by some executors taking much longer to finish their partitions than others. But the answer does not make any comparison – so by itself it does not provide enough information to make any assessment about skew.
More info: Spark Repartition & Coalesce - Explained and Performance Tuning - Spark 3.1.2 Documentation



Which of the following is a problem with using accumulators?

  1. Only unnamed accumulators can be inspected in the Spark UI.
  2. Only numeric values can be used in accumulators.
  3. Accumulator values can only be read by the driver, but not by executors.
  4. Accumulators do not obey lazy evaluation.
  5. Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.

Answer(s): C

Explanation:

Accumulator values can only be read by the driver, but not by executors.
Correct. So, for example, you cannot use an accumulator variable for coordinating workloads between executors. The typical, canonical, use case of an accumulator value is to report data, for example for debugging purposes, back to the driver. For example, if you wanted to count values that match a specific condition in a UDF for debugging purposes, an accumulator provides a good
way to do that.
Only numeric values can be used in accumulators.
No. While pySpark's Accumulator only supports numeric values (think int and float), you can define accumulators for custom types via the AccumulatorParam interface (documentation linked below).
Accumulators do not obey lazy evaluation.
Incorrect – accumulators do obey lazy evaluation. This has implications in practice: When an accumulator is encapsulated in a transformation, that accumulator will not be modified until a subsequent action is run.
Accumulators are difficult to use for debugging because they will only be updated once, independent if a task has to be re-run due to hardware failure.
Wrong. A concern with accumulators is in fact that under certain conditions they can run for each task more than once. For example, if a hardware failure occurs during a task after an accumulator variable has been increased but before a task has finished and Spark launches the task on a different worker in response to the failure, already executed accumulator variable increases will be
repeated.
Only unnamed accumulators can be inspected in the Spark UI.
No. Currently, in PySpark, no accumulators can be inspected in the Spark UI. In the Scala interface of Spark, only named accumulators can be inspected in the Spark UI.
More info: Aggregating Results with Spark Accumulators | Sparkour, RDD Programming Guide – Spark 3.1.2 Documentation, pyspark.Accumulator — PySpark 3.1.2 documentation, and pyspark.AccumulatorParam — PySpark 3.1.2 documentation



Which of the following describes Spark's standalone deployment mode?

  1. Standalone mode uses a single JVM to run Spark driver and executor processes.
  2. Standalone mode means that the cluster does not contain the driver.
  3. Standalone mode is how Spark runs on YARN and Mesos clusters.
  4. Standalone mode uses only a single executor per worker per application.
  5. Standalone mode is a viable solution for clusters that run multiple frameworks, not only Spark.

Answer(s): D

Explanation:

Standalone mode uses only a single executor per worker per application. This is correct and a limitation of Spark's standalone mode.
Standalone mode is a viable solution for clusters that run multiple frameworks.
Incorrect. A limitation of standalone mode is that Apache Spark must be the only framework running on the cluster. If you would want to run multiple frameworks on the same cluster in parallel, for example Apache Spark and Apache Flink, you would consider the YARN deployment mode.
Standalone mode uses a single JVM to run Spark driver and executor processes. No, this is what local mode does.
Standalone mode is how Spark runs on YARN and Mesos clusters.
No. YARN and Mesos modes are two deployment modes that are different from standalone mode. These modes allow Spark to run alongside other frameworks on a cluster. When Spark is run in standalone mode, only the Spark framework can run on the cluster.
Standalone mode means that the cluster does not contain the driver.
Incorrect, the cluster does not contain the driver in client mode, but in standalone mode the driver runs on a node in the cluster.
More info: Learning Spark, 2nd Edition, Chapter 1



Viewing page 5 of 37
Viewing questions 17 - 20 out of 342 questions



Post your Comments and Discuss Databricks Databricks Certified Associate Developer for Apache Spark 3.0 exam prep with other Community members:

Join the Databricks Certified Associate Developer for Apache Spark 3.0 Discussion