Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Exam
Databricks Certified Associate Developer for Apache Spark 3.5 - Python (Page 4 )

Updated On: 26-Jan-2026

Given the code fragment:



import pyspark.pandas as ps psdf = ps.DataFrame({'col1': [1, 2], 'col2': [3, 4]})

Which method is used to convert a Pandas API on Spark DataFrame (pyspark.pandas.DataFrame) into a standard PySpark DataFrame (pyspark.sql.DataFrame)?

  1. psdf.to_spark()
  2. psdf.to_pyspark()
  3. psdf.to_pandas()
  4. psdf.to_dataframe()

Answer(s): A

Explanation:

Pandas API on Spark (pyspark.pandas) allows interoperability with PySpark DataFrames. To convert a pyspark.pandas.DataFrame to a standard PySpark DataFrame, you use .to_spark().

Example:

df = psdf.to_spark()

This is the officially supported method as per Databricks Documentation.

Incorrect options:

B, D: Invalid or nonexistent methods.

C: Converts to a local pandas DataFrame, not a PySpark DataFrame.



A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.

Which action should the engineer take to resolve this issue?

  1. Optimize the data processing logic by repartitioning the DataFrame.
  2. Modify the Spark configuration to disable garbage collection
  3. Increase the memory allocated to the Spark Driver.
  4. Cache large DataFrames to persist them in memory.

Answer(s): C

Explanation:

The message "GC overhead limit exceeded" typically indicates that the JVM is spending too much time in garbage collection with little memory recovery. This suggests that the driver or executor is under-provisioned in memory.

The most effective remedy is to increase the driver memory using:

--driver-memory 4g

This is confirmed in Spark's official troubleshooting documentation:

"If you see a lot of GC overhead limit exceeded errors in the driver logs, it's a sign that the driver is running out of memory."
-- Spark Tuning Guide

Why others are incorrect:

A may help but does not directly address the driver memory shortage.

B is not a valid action; GC cannot be disabled.

D increases memory usage, worsening the problem.



A DataFrame df has columns name, age, and salary. The developer needs to sort the DataFrame by age in ascending order and salary in descending order.

Which code snippet meets the requirement of the developer?

  1. df.orderBy(col("age").asc(), col("salary").asc()).show()
  2. df.sort("age", "salary", ascending=[True, True]).show()
  3. df.sort("age", "salary", ascending=[False, True]).show()
  4. df.orderBy("age", "salary", ascending=[True, False]).show()

Answer(s): D

Explanation:

To sort a PySpark DataFrame by multiple columns with mixed sort directions, the correct usage is:

python

CopyEdit df.orderBy("age", "salary", ascending=[True, False])

age will be sorted in ascending order salary will be sorted in descending order

The orderBy() and sort() methods in PySpark accept a list of booleans to specify the sort direction for each column.

Documentation


Reference:

PySpark API - DataFrame.orderBy



What is the difference between df.cache() and df.persist() in Spark DataFrame?

  1. Both cache() and persist() can be used to set the default storage level (MEMORY_AND_DISK_SER)
  2. Both functions perform the same operation. The persist() function provides improved performance as its default storage level is DISK_ONLY.
  3. persist() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK_SER) and cache() - Can be used to set different storage levels to persist the contents of the DataFrame.
  4. cache() - Persists the DataFrame with the default storage level (MEMORY_AND_DISK) and persist()
    - Can be used to set different storage levels to persist the contents of the DataFrame

Answer(s): D

Explanation:

df.cache() is shorthand for df.persist(StorageLevel.MEMORY_AND_DISK)

df.persist() allows specifying any storage level such as MEMORY_ONLY, DISK_ONLY, MEMORY_AND_DISK_SER, etc.

By default, persist() uses MEMORY_AND_DISK, unless specified otherwise.


Reference:

Spark Programming Guide - Caching and Persistence



A data analyst builds a Spark application to analyze finance data and performs the following operations: filter, select, groupBy, and coalesce.

Which operation results in a shuffle?

  1. groupBy
  2. filter
  3. select
  4. coalesce

Answer(s): A

Explanation:

The groupBy() operation causes a shuffle because it requires all values for a specific key to be brought together, which may involve moving data across partitions.

In contrast:

filter() and select() are narrow transformations and do not cause shuffles.

coalesce() tries to reduce the number of partitions and avoids shuffling by moving data to fewer partitions without a full shuffle (unlike repartition()).


Reference:

Apache Spark - Understanding Shuffle



Viewing page 4 of 28
Viewing questions 16 - 20 out of 135 questions



Post your Comments and Discuss Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 exam prep with other Community members:

Join the Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Discussion