Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam
Databricks Certified Associate Developer for Apache Spark 3.5 - Python (Page 3 )

Updated On: 26-Jan-2026

A data scientist is analyzing a large dataset and has written a PySpark script that includes several transformations and actions on a DataFrame. The script ends with a collect() action to retrieve the results.

How does Apache SparkTM's execution hierarchy process the operations when the data scientist runs this script?

  1. The script is first divided into multiple applications, then each application is split into jobs, stages, and finally tasks.
  2. The entire script is treated as a single job, which is then divided into multiple stages, and each stage is further divided into tasks based on data partitions.
  3. The collect() action triggers a job, which is divided into stages at shuffle boundaries, and each stage is split into tasks that operate on individual data partitions.
  4. Spark creates a single task for each transformation and action in the script, and these tasks are grouped into stages and jobs based on their dependencies.

Answer(s): C

Explanation:

In Apache Spark, the execution hierarchy is structured as follows:

Application: The highest-level unit, representing the user program built on Spark.

Job: Triggered by an action (e.g., collect(), count()). Each action corresponds to a job.

Stage: A job is divided into stages based on shuffle boundaries. Each stage contains tasks that can be executed in parallel.

Task: The smallest unit of work, representing a single operation applied to a partition of the data.

When the collect() action is invoked, Spark initiates a job. This job is then divided into stages at points where data shuffling is required (i.e., wide transformations). Each stage comprises tasks that are distributed across the cluster's executors, operating on individual data partitions.

This hierarchical execution model allows Spark to efficiently process large-scale data by parallelizing tasks and optimizing resource utilization.



A developer is trying to join two tables, sales.purchases_fct and sales.customer_dim, using the following code:



fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'))

The developer has discovered that customers in the purchases_fct table that do not exist in the customer_dim table are being dropped from the joined table.

Which change should be made to the code to stop these customer records from being dropped?

  1. fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'), 'left')
  2. fact_df = cust_df.join(purch_df, F.col('customer_id') == F.col('custid'))
  3. fact_df = purch_df.join(cust_df, F.col('cust_id') == F.col('customer_id'))
  4. fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'), 'right_outer')

Answer(s): A

Explanation:

In Spark, the default join type is an inner join, which returns only the rows with matching keys in both DataFrames. To retain all records from the left DataFrame (purch_df) and include matching records from the right DataFrame (cust_df), a left outer join should be used.

By specifying the join type as 'left', the modified code ensures that all records from purch_df are preserved, and matching records from cust_df are included. Records in purch_df without a corresponding match in cust_df will have null values for the columns from cust_df.

This approach is consistent with standard SQL join operations and is supported in PySpark's DataFrame API.



A data engineer is reviewing a Spark application that applies several transformations to a DataFrame but notices that the job does not start executing immediately.

Which two characteristics of Apache Spark's execution model explain this behavior?

Choose 2 answers:

  1. The Spark engine requires manual intervention to start executing transformations.
  2. Only actions trigger the execution of the transformation pipeline.
  3. Transformations are executed immediately to build the lineage graph.
  4. The Spark engine optimizes the execution plan during the transformations, causing delays.
  5. Transformations are evaluated lazily.

Answer(s): B,E

Explanation:

Apache Spark employs a lazy evaluation model for transformations. This means that when transformations (e.g., map(), filter()) are applied to a DataFrame, Spark does not execute them immediately. Instead, it builds a logical plan (lineage) of transformations to be applied.

Execution is deferred until an action (e.g., collect(), count(), save()) is called. At that point, Spark's Catalyst optimizer analyzes the logical plan, optimizes it, and then executes the physical plan to produce the result.

This lazy evaluation strategy allows Spark to optimize the execution plan, minimize data shuffling, and improve overall performance by reducing unnecessary computations.



A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:



The resulting Python dictionary must contain a mapping of region -> region id containing the smallest 3 region_id values.

Which code fragment meets the requirements?

A)



B)



C)



D)



The resulting Python dictionary must contain a mapping of region -> region_id for the smallest 3 region_id values.

Which code fragment meets the requirements?

  1. regions = dict(
    regions_df
    .select('region', 'region_id')
    .sort('region_id')
    .take(3)
    )
  2. regions = dict(
    regions_df

    .select('region_id', 'region')
    .sort('region_id')
    .take(3)
    )
  3. regions = dict(
    regions_df
    .select('region_id', 'region')
    .limit(3)
    .collect()
    )
  4. regions = dict(
    regions_df
    .select('region', 'region_id')
    .sort(desc('region_id'))
    .take(3)
    )
  5. Option A
  6. Option B
  7. Option C
  8. Option D

Answer(s): A

Explanation:

The question requires creating a dictionary where keys are region values and values are the corresponding region_id integers. Furthermore, it asks to retrieve only the smallest 3 region_id values.

Key observations:

.select('region', 'region_id') puts the column order as expected by dict() -- where the first column becomes the key and the second the value.

.sort('region_id') ensures sorting in ascending order so the smallest IDs are first.

.take(3) retrieves exactly 3 rows.

Wrapping the result in dict(...) correctly builds the required Python dictionary: { 'AFRICA': 0, 'AMERICA': 1, 'ASIA': 2 }.

Incorrect options:

Option B flips the order to region_id first, resulting in a dictionary with integer keys -- not what's asked.

Option C uses .limit(3) without sorting, which leads to non-deterministic rows based on partition layout.

Option D sorts in descending order, giving the largest rather than smallest region_ids.

Hence, Option A meets all the requirements precisely.



An engineer has a large ORC file located at /file/test_data.orc and wants to read only specific columns to reduce memory usage.

Which code fragment will select the columns, i.e., col1, col2, during the reading process?

  1. spark.read.orc("/file/test_data.orc").filter("col1 = 'value' ").select("col2")
  2. spark.read.format("orc").select("col1", "col2").load("/file/test_data.orc")
  3. spark.read.orc("/file/test_data.orc").selected("col1", "col2")
  4. spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

Answer(s): D

Explanation:

The correct way to load specific columns from an ORC file is to first load the file using .load() and then apply .select() on the resulting DataFrame. This is valid with .read.format("orc") or the shortcut

.read.orc().

df = spark.read.format("orc").load("/file/test_data.orc").select("col1", "col2")

Why others are incorrect:

A performs selection after filtering, but doesn't match the intention to minimize memory at load.

B incorrectly tries to use .select() before .load(), which is invalid.

C uses a non-existent .selected() method.

D correctly loads and then selects.


Reference:

Apache Spark SQL API - ORC Format



Viewing page 3 of 28
Viewing questions 11 - 15 out of 135 questions



Post your Comments and Discuss Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam prep with other Community members:

Join the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Discussion