A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.Which of the following likely explains these smaller file sizes?
Answer(s): A
Which statement regarding stream-static joins and static Delta tables is correct?
A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non- overlapping five-minute interval. Events are recorded once per minute per device.Streaming DataFrame df has the following schema:"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"Code block:Choose the response that correctly fills in the blank within the code block to complete this task.
Answer(s): B
A data architect has designed a system in which two Structured Streaming jobs will concurrently write to a single bronze Delta table. Each job is subscribing to a different topic from an Apache Kafka source, but they will write data with the same schema. To keep the directory structure simple, a data engineer has decided to nest a checkpoint directory to be shared by both streams.The proposed directory structure is displayed below:Which statement describes whether this checkpoint directory structure is valid for the given scenario and why?
Answer(s): E
A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?
Which statement describes Delta Lake Auto Compaction?
Which statement characterizes the general programming model used by Spark Structured Streaming?
Answer(s): D
Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?
Post your Comments and Discuss Databricks Certified Data Engineer Professional exam dumps with other Community members:
💬 Did you find this helpful?
Thank you for sharing! Your feedback helps the community.