Free Databricks Certified Associate Developer for Apache Spark 3.0 Practice Questions & AI Tutor

QUESTION: 16

Which of the following describes a shuffle?

A shuffle is a process that is executed during a broadcast hash join.
A shuffle is a process that compares data across executors.
A shuffle is a process that compares data across partitions.
A shuffle is a Spark operation that results from DataFrame.coalesce().
A shuffle is a process that allocates partitions to executors.

Answer(s): C

Explanation:

A shuffle is a Spark operation that results from DataFrame.coalesce(). No. DataFrame.coalesce() does not result in a shuffle.
A shuffle is a process that allocates partitions to executors. This is incorrect.
A shuffle is a process that is executed during a broadcast hash join.
No, broadcast hash joins avoid shuffles and yield performance benefits if at least one of the two tables is small in size (<= 10 MB by default). Broadcast hash joins can avoid shuffles because instead of exchanging partitions between executors, they broadcast a small table to all executors that then perform the rest of the join operation locally.
A shuffle is a process that compares data across executors.
No, in a shuffle, data is compared across partitions, and not executors. More info: Spark Repartition & Coalesce - Explained (https://bit.ly/32KF7zS)

Show Answer Next Question

QUESTION: 17

Which of the following describes Spark's Adaptive Query Execution?

Adaptive Query Execution features include dynamically coalescing shuffle partitions, dynamically injecting scan filters, and dynamically optimizing skew joins.
Adaptive Query Execution is enabled in Spark by default.
Adaptive Query Execution reoptimizes queries at execution points.
Adaptive Query Execution features are dynamically switching join strategies and dynamically optimizing skew joins.
Adaptive Query Execution applies to all kinds of queries.

Answer(s): D

Explanation:

Adaptive Query Execution features include dynamically coalescing shuffle partitions, dynamically injecting scan filters, and dynamically optimizing skew joins.
This is almost correct. All of these features, except for dynamically injecting scan filters, are part of Adaptive Query Execution. Dynamically injecting scan filters for join operations to limit the amount of data to be considered in a query is part of Dynamic Partition Pruning and not of Adaptive Query Execution.
Adaptive Query Execution reoptimizes queries at execution points.
No, Adaptive Query Execution reoptimizes queries at materialization points. Adaptive Query Execution is enabled in Spark by default.
No, Adaptive Query Execution is disabled in Spark needs to be enabled through the spark.sql.adaptive.enabled property.
Adaptive Query Execution applies to all kinds of queries.
No, Adaptive Query Execution applies only to queries that are not streaming queries and that contain at least one exchange (typically expressed through a join, aggregate, or window operator) or one subquery.
More info: How to Speed up SQL Queries with Adaptive Query Execution, Learning Spark, 2nd Edition, Chapter 12 (https://bit.ly/3tOh8M1)

Show Answer Next Question

QUESTION: 18

The code block displayed below contains an error. The code block is intended to join DataFrame itemsDf with the larger DataFrame transactionsDf on column itemId. Find the error.
Code block:
transactionsDf.join(itemsDf, "itemId", how="broadcast")

The syntax is wrong, how= should be removed from the code block.
The join method should be replaced by the broadcast method.
Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.
The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.
broadcast is not a valid join type.

Answer(s): E

Explanation:

broadcast is not a valid join type.
Correct! The code block should read transactionsDf.join(broadcast(itemsDf), "itemId"). This would imply an inner join (this is the default in DataFrame.join()), but since the join type is not given in the question, this would be a valid choice.
The larger DataFrame transactionsDf is being broadcasted, rather than the smaller DataFrame itemsDf.
This option does not apply here, since the syntax around broadcasting is incorrect.
Spark will only perform the broadcast operation if this behavior has been enabled on the Spark cluster.
No, it is enabled by default, since the spark.sql.autoBroadcastJoinThreshold property is set to 10 MB by default. If that property would be set to -1, then broadcast joining would be disabled.
More info: Performance Tuning - Spark 3.1.1 Documentation (https://bit.ly/3gCz34r) The join method should be replaced by the broadcast method.
No, DataFrame has no broadcast() method.
The syntax is wrong, how= should be removed from the code block. No, having the keyword argument how= is totally acceptable.

Show Answer Next Question

QUESTION: 19

Which of the following code blocks efficiently converts DataFrame transactionsDf from 12 into 24 partitions?

transactionsDf.repartition(24, boost=True)
transactionsDf.repartition()
transactionsDf.repartition("itemId", 24)
transactionsDf.coalesce(24)
transactionsDf.repartition(24)

Answer(s): E

Explanation:

transactionsDf.coalesce(24)
No, the coalesce() method can only reduce, but not increase the number of partitions. transactionsDf.repartition()
No, repartition() requires a numPartitions argument. transactionsDf.repartition("itemId", 24)
No, here the cols and numPartitions argument have been mixed up. If the code block would be transactionsDf.repartition(24, "itemId"), this would be a valid solution. transactionsDf.repartition(24, boost=True)
No, there is no boost argument in the repartition() method.

Show Answer Next Question

QUESTION: 20

Which of the following code blocks removes all rows in the 6-column DataFrame transactionsDf that have missing data in at least 3 columns?

transactionsDf.dropna("any")
transactionsDf.dropna(thresh=4)
transactionsDf.drop.na("",2)
transactionsDf.dropna(thresh=2)
transactionsDf.dropna("",4)

Answer(s): B

Explanation:

transactionsDf.dropna(thresh=4)
Correct. Note that by only working with the thresh keyword argument, the first how keyword argument is ignored. Also, figuring out which value to set for thresh can be difficult, especially when under pressure in the exam. Here, I recommend you use the notes to create a "simulation" of what different values for thresh would do to a DataFrame. Here is an explanatory image why thresh=4 is the correct answer to the question:

transactionsDf.dropna(thresh=2)
Almost right. See the comment about thresh for the correct answer above. transactionsDf.dropna("any")
No, this would remove all rows that have at least one missing value. transactionsDf.drop.na("",2)
No, drop.na is not a proper DataFrame method. transactionsDf.dropna("",4)
No, this does not work and will throw an error in Spark because Spark cannot understand the first argument.
More info: pyspark.sql.DataFrame.dropna — PySpark 3.1.1 documentation (https://bit.ly/2QZpiCp) Static notebook | Dynamic notebook: See test 1, Question: 20 (Databricks import instructions) (https://flrs.github.io/spark_practice_tests_code/#1/20.html ,
https://bit.ly/sparkpracticeexams_import_instructions)

Show Answer Next Question

Databricks Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions
Certified Associate Developer for Apache Spark (Page 5 )

QUESTION: 16

Explanation:

QUESTION: 17

Explanation:

QUESTION: 18

Explanation:

QUESTION: 19

Explanation:

QUESTION: 20

Explanation:

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Discussions & Posts

Databricks Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions Certified Associate Developer for Apache Spark (Page 5 )

QUESTION: 16

Explanation:

QUESTION: 17

Explanation:

QUESTION: 18

Explanation:

QUESTION: 19

Explanation:

QUESTION: 20

Explanation:

Databricks Certified Associate Developer for Apache Spark 3.0 Exam Discussions & Posts

Databricks Databricks Certified Associate Developer for Apache Spark 3.0 Exam Questions
Certified Associate Developer for Apache Spark (Page 5 )