Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam
Databricks Certified Associate Developer for Apache Spark 3.5 - Python (Page 11 )

Updated On: 26-Jan-2026

A data engineer is asked to build an ingestion pipeline for a set of Parquet files delivered by an upstream team on a nightly basis. The data is stored in a directory structure with a base path of "/path/events/data". The upstream team drops daily data into the underlying subdirectories following the convention year/month/day.

A few examples of the directory structure are:



Which of the following code snippets will read all the data within the directory structure?

  1. df = spark.read.option("inferSchema", "true").parquet("/path/events/data/")
  2. df = spark.read.option("recursiveFileLookup", "true").parquet("/path/events/data/")
  3. df = spark.read.parquet("/path/events/data/*")
  4. df = spark.read.parquet("/path/events/data/")

Answer(s): B

Explanation:

To read all files recursively within a nested directory structure, Spark requires the recursiveFileLookup option to be explicitly enabled. According to Databricks official documentation, when dealing with deeply nested Parquet files in a directory tree (as shown in this example), you should set:

df = spark.read.option("recursiveFileLookup", "true").parquet("/path/events/data/")

This ensures that Spark searches through all subdirectories under /path/events/data/ and reads any Parquet files it finds, regardless of the folder depth.

Option A is incorrect because while it includes an option, inferSchema is irrelevant here and does not enable recursive file reading.

Option C is incorrect because wildcards may not reliably match deep nested structures beyond one directory level.

Option D is incorrect because it will only read files directly within /path/events/data/ and not subdirectories like /2023/01/01.

Databricks documentation reference:

"To read files recursively from nested folders, set the recursiveFileLookup option to true. This is useful when data is organized in hierarchical folder structures" -- Databricks documentation on Parquet files ingestion and options.



Given a DataFrame df that has 10 partitions, after running the code:

result = df.coalesce(20)

How many partitions will the result DataFrame have?

  1. 10
  2. Same number as the cluster executors
  3. 1
  4. 20

Answer(s): A

Explanation:

The .coalesce(numPartitions) function is used to reduce the number of partitions in a DataFrame. It does not increase the number of partitions. If the specified number of partitions is greater than the current number, it will not have any effect.

From the official Spark documentation:

"coalesce() results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim one or more of the current partitions."
However, if you try to increase partitions using coalesce (e.g., from 10 to 20), the number of partitions remains unchanged.

Hence, df.coalesce(20) will still return a DataFrame with 10 partitions.


Reference:

Apache Spark 3.5 Programming Guide RDD and DataFrame Operations coalesce()



Given the following code snippet in my_spark_app.py:



What is the role of the driver node?

  1. The driver node orchestrates the execution by transforming actions into tasks and distributing them to worker nodes
  2. The driver node only provides the user interface for monitoring the application
  3. The driver node holds the DataFrame data and performs all computations locally
  4. The driver node stores the final result after computations are completed by worker nodes

Answer(s): A

Explanation:

In the Spark architecture, the driver node is responsible for orchestrating the execution of a Spark application. It converts user-defined transformations and actions into a logical plan, optimizes it into a physical plan, and then splits the plan into tasks that are distributed to the executor nodes.

As per Databricks and Spark documentation:

"The driver node is responsible for maintaining information about the Spark application, responding to a user's program or input, and analyzing, distributing, and scheduling work across the executors."

This means:

Option A is correct because the driver schedules and coordinates the job execution.

Option B is incorrect because the driver does more than just UI monitoring.

Option C is incorrect since data and computations are distributed across executor nodes.

Option D is incorrect; results are returned to the driver but not stored long-term by it.


Reference:

Databricks Certified Developer Spark 3.5 Documentation Spark Architecture Driver vs Executors.



A Spark developer wants to improve the performance of an existing PySpark UDF that runs a hash function that is not available in the standard Spark functions library. The existing UDF code is:



import hashlib import pyspark.sql.functions as sf from pyspark.sql.types import StringType def shake_256(raw):

return hashlib.shake_256(raw.encode()).hexdigest(20)

shake_256_udf = sf.udf(shake_256, StringType())

The developer wants to replace this existing UDF with a Pandas UDF to improve performance. The developer changes the definition of shake_256_udf to this:CopyEdit shake_256_udf = sf.pandas_udf(shake_256, StringType())

However, the developer receives the error:

What should the signature of the shake_256() function be changed to in order to fix this error?

  1. def shake_256(df: pd.Series) -> str:
  2. def shake_256(df: Iterator[pd.Series]) -> Iterator[pd.Series]:
  3. def shake_256(raw: str) -> str:
  4. def shake_256(df: pd.Series) -> pd.Series:

Answer(s): D

Explanation:

When converting a standard PySpark UDF to a Pandas UDF for performance optimization, the function must operate on a Pandas Series as input and return a Pandas Series as output.

In this case, the original function signature:

def shake_256(raw: str) -> str is scalar -- not compatible with Pandas UDFs.

According to the official Spark documentation:

"Pandas UDFs operate on pandas.Series and return pandas.Series. The function definition should be:

def my_udf(s: pd.Series) -> pd.Series:

and it must be registered using pandas_udf(...)."

Therefore, to fix the error:

The function should be updated to:

def shake_256(df: pd.Series) -> pd.Series:

return df.apply(lambda x: hashlib.shake_256(x.encode()).hexdigest(20))

This will allow Spark to efficiently execute the Pandas UDF in vectorized form, improving performance compared to standard UDFs.


Reference:

Apache Spark 3.5 Documentation User-Defined Functions Pandas UDFs



A developer is working with a pandas DataFrame containing user behavior data from a web application.
Which approach should be used for executing a groupBy operation in parallel across all workers in Apache Spark 3.5?

A)

Use the applylnPandas API

B)



C)



D)

  1. Use the applyInPandas API:
    df.groupby("user_id").applyInPandas(mean_func, schema="user_id long, value double").show()
  2. Use the mapInPandas API:
    df.mapInPandas(mean_func, schema="user_id long, value double").show()
  3. Use a regular Spark UDF:
    from pyspark.sql.functions import mean df.groupBy("user_id").agg(mean("value")).show()
  4. Use a Pandas UDF:
    @pandas_udf("double")
    def mean_func(value: pd.Series) -> float:
    return value.mean()
    df.groupby("user_id").agg(mean_func(df["value"])).show()

Answer(s): A

Explanation:

The correct approach to perform a parallelized groupBy operation across Spark worker nodes using Pandas API is via applyInPandas. This function enables grouped map operations using Pandas logic in a distributed Spark environment. It applies a user-defined function to each group of data represented as a Pandas DataFrame.

As per the Databricks documentation:

"applyInPandas() allows for vectorized operations on grouped data in Spark. It applies a user-defined function to each group of a DataFrame and outputs a new DataFrame. This is the recommended approach for using Pandas logic across grouped data with parallel execution."

Option A is correct and achieves this parallel execution.

Option B (mapInPandas) applies to the entire DataFrame, not grouped operations.

Option C uses built-in aggregation functions, which are efficient but not customizable with Pandas logic.

Option D creates a scalar Pandas UDF which does not perform a group-wise transformation.

Therefore, to run a groupBy with parallel Pandas logic on Spark workers, Option A using applyInPandas is the only correct answer.


Reference:

Apache Spark 3.5 Documentation Pandas API on Spark Grouped Map Pandas UDFs (applyInPandas)



Viewing page 11 of 28
Viewing questions 51 - 55 out of 135 questions



Post your Comments and Discuss Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam prep with other Community members:

Join the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Discussion