Databricks Databricks Certified Associate Developer for Apache Spark 3.0 Exam
Certified Associate Developer for Apache Spark (Page 3 )

Updated On: 26-Jan-2026

The code block displayed below contains an error. The code block should return a DataFrame in which column predErrorAdded contains the results of Python function add_2_if_geq_3 as applied to

numeric and nullable column predError in DataFrame transactionsDf. Find the error. Code block:
1. def add_2_if_geq_3(x):
2. if x is None:
3. return x
4. elif x >= 3:
5. return x+2
6. return x 7.
8. add_2_if_geq_3_udf = udf(add_2_if_geq_3) 9.
10. transactionsDf.withColumnRenamed("predErrorAdded", add_2_if_geq_3_udf(col("predError")))

  1. The operator used to adding the column does not add column predErrorAdded to the DataFrame.
  2. Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.
  3. The udf() method does not declare a return type.
  4. UDFs are only available through the SQL API, but not in the Python API as shown in the code block.
  5. The Python function is unable to handle null values, resulting in the code block crashing on execution.

Answer(s): A

Explanation:

Correct code block:
def add_2_if_geq_3(x):
if x is None:
return x elif x >= 3:
return x+2 return x
add_2_if_geq_3_udf = udf(add_2_if_geq_3)
transactionsDf.withColumn("predErrorAdded", add_2_if_geq_3_udf(col("predError"))).show() Instead of withColumnRenamed, you should use the withColumn operator.
The udf() method does not declare a return type.
It is fine that the udf() method does not declare a return type, this is not a required argument. However, the default return type is StringType. This may not be the ideal return type for numeric, nullable data – but the code will run without specified return type nevertheless.
The Python function is unable to handle null values, resulting in the code block crashing on execution.
The Python function is able to handle null values, this is what the statement if x is None does. UDFs are only available through the SQL API, but not in the Python API as shown in the code block.
No, they are available through the Python API. The code in the code block that concerns UDFs is correct.
Instead of col("predError"), the actual DataFrame with the column needs to be passed, like so transactionsDf.predError.
You may choose to use the transactionsDf.predError syntax, but the col("predError") syntax is fine.



Which of the following describes a narrow transformation?

  1. narrow transformation is an operation in which data is exchanged across partitions.
  2. A narrow transformation is a process in which data from multiple RDDs is used.
  3. A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like 16-bit or 8-bit float variables.
  4. A narrow transformation is an operation in which data is exchanged across the cluster.
  5. A narrow transformation is an operation in which no data is exchanged across the cluster.

Answer(s): E

Explanation:

A narrow transformation is an operation in which no data is exchanged across the cluster.
Correct! In narrow transformations, no data is exchanged across the cluster, since these transformations do not require any data from outside of the partition they are applied on. Typical narrow transformations include filter, drop, and coalesce.
A narrow transformation is an operation in which data is exchanged across partitions.
No, that would be one definition of a wide transformation, but not of a narrow transformation. Wide transformations typically cause a shuffle, in which data is exchanged across partitions, executors,
and the cluster.
A narrow transformation is an operation in which data is exchanged across the cluster. No, see explanation just above this one.
A narrow transformation is a process in which 32-bit float variables are cast to smaller float variables, like 16-bit or 8-bit float variables.
No, type conversion has nothing to do with narrow transformations in Spark.
A narrow transformation is a process in which data from multiple RDDs is used.
No. A resilient distributed dataset (RDD) can be described as a collection of partitions. In a narrow transformation, no data is exchanged between partitions. Thus, no data is exchanged between
RDDs.
One could say though that a narrow transformation and, in fact, any transformation results in a new RDD being created. This is because a transformation results in a change to an existing RDD
(RDDs are the foundation of other Spark data structures, like DataFrames). But, since RDDs are immutable, a new RDD needs to be created to reflect the change caused by the transformation.
More info: Spark Transformation and Action: A Deep Dive | by Misbah Uddin | CodeX | Medium



Which of the following statements about stages is correct?

  1. Different stages in a job may be executed in parallel.
  2. Stages consist of one or more jobs.
  3. Stages ephemerally store transactions, before they are committed through actions.
  4. Tasks in a stage may be executed by multiple machines at the same time.
  5. Stages may contain multiple actions, narrow, and wide transformations.

Answer(s): D

Explanation:

Tasks in a stage may be executed by multiple machines at the same time.
This is correct. Within a single stage, tasks do not depend on each other. Executors on multiple machines may execute tasks belonging to the same stage on the respective partitions they are holding at the same time.

Different stages in a job may be executed in parallel.
No. Different stages in a job depend on each other and cannot be executed in parallel. The nuance is that every task in a stage may be executed in parallel by multiple machines.
For example, if a job consists of Stage A and Stage B, tasks belonging to those stages may not be executed in parallel. However, tasks from Stage A may be executed on multiple machines at the
same time, with each machine running it on a different partition of the same dataset. Then, afterwards, tasks from Stage B may be executed on multiple machines at the same time. Stages may contain multiple actions, narrow, and wide transformations.
No, stages may not contain multiple wide transformations. Wide transformations mean that shuffling is required. Shuffling typically terminates a stage though, because data needs to be exchanged across the cluster. This data exchange often causes partitions to change and rearrange, making it impossible to perform tasks in parallel on the same dataset.
Stages ephemerally store transactions, before they are committed through actions.
No, this does not make sense. Stages do not "store" any data. Transactions are not "committed" in Spark.
Stages consist of one or more jobs.
No, it is the other way around: Jobs consist of one more stages. More info: Spark: The Definitive Guide, Chapter 15.



Which of the following describes tasks?

  1. A task is a command sent from the driver to the executors in response to a transformation.
  2. Tasks transform jobs into DAGs.
  3. A task is a collection of slots.
  4. A task is a collection of rows.
  5. Tasks get assigned to the executors by the driver.

Answer(s): E

Explanation:

Tasks get assigned to the executors by the driver.
Correct! Or, in other words: Executors take the tasks that they were assigned to by the driver, run them over partitions, and report the their outcomes back to the driver.
Tasks transform jobs into DAGs.
No, this statement disrespects the order of elements in the Spark hierarchy. The Spark driver transforms jobs into DAGs. Each job consists of one or more stages. Each stage contains one or more tasks.
A task is a collection of rows.
Wrong. A partition is a collection of rows. Tasks have little to do with a collection of rows. If anything, a task processes a specific partition.
A task is a command sent from the driver to the executors in response to a transformation.
Incorrect. The Spark driver does not send anything to the executors in response to a transformation, since transformations are evaluated lazily. So, the Spark driver would send tasks to executors
only in response to actions. A task is a collection of slots.
No. Executors have one or more slots to process tasks and each slot can be assigned a task.



Which of the following code blocks generally causes a great amount of network traffic?

  1. DataFrame.select()
  2. DataFrame.coalesce()
  3. DataFrame.collect()
  4. DataFrame.rdd.map()
  5. DataFrame.count()

Answer(s): C

Explanation:

DataFrame.collect() sends all data in a DataFrame from executors to the driver, so this generally causes a great amount of network traffic in comparison to the other options listed.
DataFrame.coalesce() just reduces the number of partitions and generally aims to reduce network

traffic in comparison to a full shuffle.
DataFrame.select() is evaluated lazily and, unless followed by an action, does not cause significant network traffic.
DataFrame.rdd.map() is evaluated lazily, it does therefore not cause great amounts of network traffic.
DataFrame.count() is an action. While it does cause some network traffic, for the same DataFrame, collecting all data in the driver would generally be considered to cause a greater amount of network traffic.



Viewing page 3 of 37
Viewing questions 11 - 15 out of 342 questions



Post your Comments and Discuss Databricks Databricks Certified Associate Developer for Apache Spark 3.0 exam prep with other Community members:

Join the Databricks Certified Associate Developer for Apache Spark 3.0 Discussion