Databricks Certified Data Engineer Professional Exam Questions
Certified Data Engineer Professional (Page 5 )

Updated On: 23-Apr-2026

A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.

Which situation is causing increased duration of the overall job?

  1. Task queueing resulting from improper thread pool assignment.
  2. Spill resulting from attached volume storage being too small.
  3. Network latency due to some cluster nodes being in different regions from the source data
  4. Skew caused by more data being assigned to a subset of spark-partitions.
  5. Credential validation errors while pulling data from an external system.

Answer(s): D



Each configuration below is identical to the extent that each cluster has 400 GB total of RAM, 160 total cores and only one Executor per VM.

Given a job with at least one wide transformation, which of the following cluster configurations will result in

maximum performance?

  1. · Total VMs: 1
    · 400 GB per Executor
    · 160 Cores / Executor
  2. · Total VMs: 8
    · 50 GB per Executor
    · 20 Cores / Executor
  3. · Total VMs: 16
    · 25 GB per Executor
    · 10 Cores/Executor
  4. · Total VMs: 4
    · 100 GB per Executor
    · 40 Cores/Executor
  5. · Total VMs: 2
    · 200 GB per Executor
    · 80 Cores / Executor

Answer(s): B



A junior data engineer has implemented the following code block.



The view new_events contains a batch of records with the same schema as the events Delta table. The event_id field serves as a unique key for this table.

When this query is executed, what will happen with new records that have the same event_id as an existing record?

  1. They are merged.
  2. They are ignored.
  3. They are updated.
  4. They are inserted.
  5. They are deleted.

Answer(s): B



A junior data engineer seeks to leverage Delta Lake's Change Data Feed functionality to create a Type 1 table representing all of the values that have ever been valid for all rows in a bronze table created with the property

delta.enableChangeDataFeed = true. They plan to execute the following code as a daily job:



Which statement describes the execution and results of running the above query multiple times?

  1. Each time the job is executed, newly updated records will be merged into the target table, overwriting previous values with the same primary keys.
  2. Each time the job is executed, the entire available history of inserted or updated records will be appended to the target table, resulting in many duplicate entries.
  3. Each time the job is executed, the target table will be overwritten using the entire history of inserted or updated records, giving the desired result.
  4. Each time the job is executed, the differences between the original and current versions are calculated; this may result in duplicate entries for some records.
  5. Each time the job is executed, only those records that have been inserted or updated since the last execution will be appended to the target table, giving the desired result.

Answer(s): B



A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days.
The pipeline has been in production for three months.

Which describes how Delta Lake can help to avoid data loss of this nature in the future?

  1. The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer.
  2. Delta Lake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original source.
  3. Delta Lake automatically checks that all fields present in the source data are included in the ingestion layer.
  4. Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible under any circumstance.
  5. Ingesting all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.

Answer(s): E



A nightly job ingests data into a Delta Lake table using the following code:



The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.

Which code snippet completes this function definition?

def new_records():

  1. return spark.readStream.table("bronze")
  2. return spark.readStream.load("bronze")

  3. return spark.read.option("readChangeFeed", "true").table ("bronze")

Answer(s): A



A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.

The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.

Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?

  1. The Tungsten encoding used by Databricks is optimized for storing string data; newly-added native support for querying JSON strings means that string types are always most efficient.
  2. Because Delta Lake uses Parquet for data storage, data types can be easily evolved by just modifying file footer information in place.
  3. Human labor in writing code is the largest cost associated with data engineering workloads; as such, automating table declaration logic should be a priority in all migration workloads.
  4. Because Databricks will infer schema using types that allow all observed data to be processed, setting types manually provides greater assurance of data quality enforcement.
  5. Schema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

Answer(s): D



The data engineering team maintains the following code:



Assuming that this code produces logically correct results and the data in the source tables has been de- duplicated and validated, which statement describes what will occur when this code is executed?

  1. A batch job will update the enriched_itemized_orders_by_account table, replacing only those rows that have different values than the current version of the table, using accountID as the primary key.
  2. The enriched_itemized_orders_by_account table will be overwritten using the current valid version of data in each of the three tables referenced in the join logic.
  3. An incremental job will leverage information in the state store to identify unjoined rows in the source tables and write these rows to the enriched_itemized_orders_by_account table.
  4. An incremental job will detect if new rows have been written to any of the source tables; if new rows are detected, all results will be recalculated and used to overwrite the enriched_itemized_orders_by_account table.
  5. No computation will occur until enriched_itemized_orders_by_account is queried; upon query materialization, results will be calculated using the current valid version of data in each of the three tables referenced in the join logic.

Answer(s): B



Viewing page 5 of 44
Viewing questions 33 - 40 out of 339 questions


Certified Data Engineer Professional Exam Discussions & Posts

AI Tutor AI Tutor 👋 I’m here to help!