Databricks Certified Associate Developer for Apache Spark 3.0: Certified Associate Developer for Apache Spark
Free Practice Exam Questions (page: 3)
Updated On: 2-Jan-2026

Which of the following is one of the big performance advantages that Spark has over Hadoop?

  1. Spark achieves great performance by storing data in the DAG format, whereas Hadoop can only use parquet files.
  2. Spark achieves higher resiliency for queries since, different from Hadoop, it can be deployed on Kubernetes.
  3. Spark achieves great performance by storing data and performing computation in memory, whereas large jobs in Hadoop require a large amount of relatively slow disk I/O operations.
  4. Spark achieves great performance by storing data in the HDFS format, whereas Hadoop can only use parquet files.
  5. Spark achieves performance gains for developers by extending Hadoop's DataFrames with a user- friendly API.

Answer(s): C

Explanation:

Spark achieves great performance by storing data in the DAG format, whereas Hadoop can only use parquet files.
Wrong, there is no "DAG format". DAG stands for "directed acyclic graph". The DAG is a means of representing computational steps in Spark. However, it is true that Hadoop does not use a DAG.

The introduction of the DAG in Spark was a result of the limitation of Hadoop's map reduce framework in which data had to be written to and read from disk continuously.

Graph DAG in Apache Spark - DataFlair
Spark achieves great performance by storing data in the HDFS format, whereas Hadoop can only use parquet files.
No. Spark can certainly store data in HDFS (as well as other formats), but this is not a key performance advantage over Hadoop. Hadoop can use multiple file formats, not only parquet.
Spark achieves higher resiliency for queries since, different from Hadoop, it can be deployed on Kubernetes.
No, resiliency is not asked for in the question. The Question: is about
performance improvements. Both Hadoop and Spark can be deployed on Kubernetes.
Spark achieves performance gains for developers by extending Hadoop's DataFrames with a user- friendly API.
No. DataFrames are a concept in Spark, but not in Hadoop.



Which of the following is the deepest level in Spark's execution hierarchy?

  1. Job
  2. Task
  3. Executor
  4. Slot
  5. Stage

Answer(s): B

Explanation:

The hierarchy is, from top to bottom: Job, Stage, Task.
Executors and slots facilitate the execution of tasks, but they are not directly part of the hierarchy. Executors are launched by the driver on worker nodes for the purpose of running a specific Spark application. Slots help Spark parallelize work. An executor can have multiple slots which enable it to process multiple tasks in parallel.



Which of the following statements about garbage collection in Spark is incorrect?

  1. Garbage collection information can be accessed in the Spark UI's stage detail view.
  2. Optimizing garbage collection performance in Spark may limit caching ability.
  3. Manually persisting RDDs in Spark prevents them from being garbage collected.
  4. In Spark, using the G1 garbage collector is an alternative to using the default Parallel garbage collector.
  5. Serialized caching is a strategy to increase the performance of garbage collection.

Answer(s): C

Explanation:

Manually persisting RDDs in Spark prevents them from being garbage collected.
This statement is incorrect, and thus the correct answer to the question. Spark's garbage collector will remove even persisted objects, albeit in an "LRU" fashion. LRU stands for least recently used.
So, during a garbage collection run, the objects that were used the longest time ago will be garbage collected first.
See the linked StackOverflow post below for more information.
Serialized caching is a strategy to increase the performance of garbage collection.
This statement is correct. The more Java objects Spark needs to collect during garbage collection, the longer it takes. Storing a collection of many Java objects, such as a DataFrame with a complex schema, through serialization as a single byte array thus increases performance. This means that garbage collection takes less time on a serialized DataFrame than an unserialized DataFrame.
Optimizing garbage collection performance in Spark may limit caching ability.
This statement is correct. A full garbage collection run slows down a Spark application. When taking about "tuning" garbage collection, we mean reducing the amount or duration of these slowdowns.
A full garbage collection run is triggered when the Old generation of the Java heap space is almost full. (If you are unfamiliar with this concept, check out the link to the Garbage Collection Tuning docs below.) Thus, one measure to avoid triggering a garbage collection run is to prevent the Old generation share of the heap space to be almost full.
To achieve this, one may decrease its size. Objects with sizes greater than the Old generation space will then be discarded instead of cached (stored) in the space and helping it to be "almost full".
This will decrease the number of full garbage collection runs, increasing overall performance. Inevitably, however, objects will need to be recomputed when they are needed. So, this mechanism only works when a Spark application needs to reuse cached data as little as possible.
Garbage collection information can be accessed in the Spark UI's stage detail view.
This statement is correct. The task table in the Spark UI's stage detail view has a "GC Time" column, indicating the garbage collection time needed per task.
In Spark, using the G1 garbage collector is an alternative to using the default Parallel garbage collector.
This statement is correct. The G1 garbage collector, also known as garbage first garbage collector, is an alternative to the default Parallel garbage collector.
While the default Parallel garbage collector divides the heap into a few static regions, the G1 garbage collector divides the heap into many small regions that are created dynamically. The G1 garbage collector has certain advantages over the Parallel garbage collector which improve performance particularly for Spark workloads that require high throughput and low latency.
The G1 garbage collector is not enabled by default, and you need to explicitly pass an argument to Spark to enable it. For more information about the two garbage collectors, check out the Databricks article linked below.



Which of the following describes characteristics of the Dataset API?

  1. The Dataset API does not support unstructured data.
  2. In Python, the Dataset API mainly resembles Pandas' DataFrame API.
  3. In Python, the Dataset API's schema is constructed via type hints.
  4. The Dataset API is available in Scala, but it is not available in Python.
  5. The Dataset API does not provide compile-time type safety.

Answer(s): D

Explanation:

The Dataset API is available in Scala, but it is not available in Python.
Correct. The Dataset API uses fixed typing and is typically used for object-oriented programming. It is available when Spark is used with the Scala programming language, but not for Python. In Python, you use the DataFrame API, which is based on the Dataset API. The Dataset API does not provide compile-time type safety.
No – in fact, depending on the use case, the type safety that the Dataset API provides is an advantage.
The Dataset API does not support unstructured data.
Wrong, the Dataset API supports structured and unstructured data. In Python, the Dataset API's schema is constructed via type hints.
No, this is not applicable since the Dataset API is not available in Python. In Python, the Dataset API mainly resembles Pandas' DataFrame API. The Dataset API does not exist in Python, only in Scala and Java.



Viewing page 3 of 46
Viewing questions 9 - 12 out of 342 questions



Post your Comments and Discuss Databricks Databricks Certified Associate Developer for Apache Spark 3.0 exam prep with other Community members:

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Discussions & Posts