Databricks Databricks Certified Associate Developer for Apache Spark 3.0 Exam
Certified Associate Developer for Apache Spark (Page 6 )

Updated On: 26-Jan-2026

Which of the following code blocks reads in the parquet file stored at location filePath, given that all columns in the parquet file contain only whole numbers and are stored in the most appropriate format for this kind of data?

  1. 1. spark.read.schema(
    2. StructType(
    3. StructField("transactionId", IntegerType(), True),
    4. StructField("predError", IntegerType(), True)
    5. )).load(filePath)
  2. 1. spark.read.schema([
    2. StructField("transactionId", NumberType(), True),
    3. StructField("predError", IntegerType(), True)
    4. ]).load(filePath)
  3. 1. spark.read.schema(
    2. StructType([
    3. StructField("transactionId", StringType(), True),
    4. StructField("predError", IntegerType(), True)]
    5. )).parquet(filePath)
  4. 1. spark.read.schema(
    2. StructType([
    3. StructField("transactionId", IntegerType(), True),
    4. StructField("predError", IntegerType(), True)]
    5. )).format("parquet").load(filePath)
  5. 1. spark.read.schema([
    2. StructField("transactionId", IntegerType(), True),
    3. StructField("predError", IntegerType(), True)
    4. ]).load(filePath, format="parquet")

Answer(s): D

Explanation:

The schema passed into schema should be of type StructType or a string, so all entries in which a list is passed are incorrect.
In addition, since all numbers are whole numbers, the IntegerType() data type is the correct option here. NumberType() is not a valid data type and StringType() would fail, since the parquet file is stored in the "most appropriate format for this kind of data", meaning that it is most likely an IntegerType, and Spark does not convert data types if a schema is provided.
Also note that StructType accepts only a single argument (a list of StructFields). So, passing multiple arguments is invalid.
Finally, Spark needs to know which format the file is in. However, all of the options listed are valid here, since Spark assumes parquet as a default when no file format is specifically passed.
More info: pyspark.sql.DataFrameReader.schema — PySpark 3.1.2 documentation and StructType — PySpark 3.1.2 documentation



Which of the following code blocks returns a DataFrame that is an inner join of DataFrame itemsDf and DataFrame transactionsDf, on columns itemId and productId, respectively and in which every itemId just appears once?

  1. itemsDf.join(transactionsDf, "itemsDf.itemId==transactionsDf.productId").distinct("itemId")
  2. itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId).dropDuplicates(["itemId"])
  3. itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId).dropDuplicates("itemId")
  4. itemsDf.join(transactionsDf, itemsDf.itemId==transactionsDf.productId, how="inner").distinct(["itemId"])
  5. itemsDf.join(transactionsDf, "itemsDf.itemId==transactionsDf.productId", how="inner").dropDuplicates(["itemId"])

Answer(s): B

Explanation:

Filtering out distinct rows based on columns is achieved with the dropDuplicates method, not the distinct method which does not take any arguments.
The second argument of the join() method only accepts strings if they are column names. The SQL- like statement "itemsDf.itemId==transactionsDf.productId" is therefore invalid.
In addition, it is not necessary to specify how="inner", since the default join type for the join command is already inner.

More info: pyspark.sql.DataFrame.join — PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2, Question: 53 ( Databricks import instructions)



Which of the following code blocks produces the following output, given DataFrame transactionsDf?

  1. transactionsDf.schema.print()
  2. transactionsDf.rdd.printSchema()
  3. transactionsDf.rdd.formatSchema()
  4. transactionsDf.printSchema()
  5. print(transactionsDf.schema)

Answer(s): D

Explanation:

The output is the typical output of a DataFrame.printSchema() call. The DataFrame's RDD representation does not have a printSchema or formatSchema method (find available methods in the RDD documentation linked below). The output of print(transactionsDf.schema) is this: StructType(List(StructField(transactionId,IntegerType,true),StructField(predError,IntegerType,true),St ructField (value,IntegerType,true),StructField(storeId,IntegerType,true),StructField(productId,IntegerType,true
),StructField(f,IntegerType,true))). It includes the same information as the nicely formatted original

output, but is not nicely formatted itself. Lastly, the DataFrame's schema attribute does not have a print() method.

More info:
- pyspark.RDD: pyspark.RDD — PySpark 3.1.2 documentation
- DataFrame.printSchema(): pyspark.sql.DataFrame.printSchema — PySpark 3.1.2 documentation

Static notebook | Dynamic notebook: See test 2, Question: 52 ( Databricks import instructions)



Which of the following code blocks returns a copy of DataFrame transactionsDf where the column storeId has been converted to string type?

  1. transactionsDf.withColumn("storeId", convert("storeId", "string"))
  2. transactionsDf.withColumn("storeId", col("storeId", "string"))
  3. transactionsDf.withColumn("storeId", col("storeId").convert("string"))
  4. transactionsDf.withColumn("storeId", col("storeId").cast("string"))
  5. transactionsDf.withColumn("storeId", convert("storeId").as("string"))

Answer(s): D

Explanation:

This Question: asks for your knowledge about the cast syntax. cast is a method of the Column class. It is worth noting that one could also convert a column type using the Column.astype() method, which is just an alias for cast.
Find more info in the documentation linked below.
More info: pyspark.sql.Column.cast — PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2, Question: 33 (Databricks import instructions)



Which of the following code blocks applies the boolean-returning Python function evaluateTestSuccess to column storeId of DataFrame transactionsDf as a user-defined function?

  1. 1. from pyspark.sql import types as T
    2. evaluateTestSuccessUDF = udf(evaluateTestSuccess, T.BooleanType())
    3. transactionsDf.withColumn("result", evaluateTestSuccessUDF(col("storeId")))
  2. 1. evaluateTestSuccessUDF = udf(evaluateTestSuccess)
    2. transactionsDf.withColumn("result", evaluateTestSuccessUDF(storeId))
  3. 1. from pyspark.sql import types as T
    2. evaluateTestSuccessUDF = udf(evaluateTestSuccess, T.IntegerType())
    3. transactionsDf.withColumn("result", evaluateTestSuccess(col("storeId")))
  4. 1. evaluateTestSuccessUDF = udf(evaluateTestSuccess)
    2. transactionsDf.withColumn("result", evaluateTestSuccessUDF(col("storeId")))
  5. 1. from pyspark.sql import types as T
    2. evaluateTestSuccessUDF = udf(evaluateTestSuccess, T.BooleanType())
    3. transactionsDf.withColumn("result", evaluateTestSuccess(col("storeId")))

Answer(s): A

Explanation:

Recognizing that the UDF specification requires a return type (unless it is a string, which is the default) is important for solving this question. In addition, you should make sure that the generated UDF (evaluateTestSuccessUDF) and not the Python function (evaluateTestSuccess) is applied to column storeId.

More info: pyspark.sql.functions.udf — PySpark 3.1.2 documentation
Static notebook | Dynamic notebook: See test 2, Question: 34 ( Databricks import instructions)



Viewing page 6 of 37
Viewing questions 26 - 30 out of 342 questions



Post your Comments and Discuss Databricks Databricks Certified Associate Developer for Apache Spark 3.0 exam prep with other Community members:

Join the Databricks Certified Associate Developer for Apache Spark 3.0 Discussion