Free Certified Data Engineer Professional Exam Braindumps (page: 6)

Page 6 of 46

A Structured Streaming job deployed to production has been experiencing delays during peak hours of the day. At present, during normal execution, each microbatch of data is processed in less than 3 seconds. During peak hours of the day, execution time for each microbatch becomes very inconsistent, sometimes exceeding 30 seconds. The streaming write is currently configured with a trigger interval of 10 seconds.

Holding all other variables constant and assuming records need to be processed in less than 10 seconds, which adjustment will meet the requirement?

  1. Decrease the trigger interval to 5 seconds; triggering batches more frequently allows idle executors to begin processing the next batch while longer running tasks from previous batches finish.
  2. Increase the trigger interval to 30 seconds; setting the trigger interval near the maximum execution time observed for each batch is always best practice to ensure no records are dropped.
  3. The trigger interval cannot be modified without modifying the checkpoint directory; to maintain the current stream state, increase the number of shuffle partitions to maximize parallelism.
  4. Use the trigger once option and configure a Databricks job to execute the query every 10 seconds; this ensures all backlogged records are processed with each batch.
  5. Decrease the trigger interval to 5 seconds; triggering batches more frequently may prevent records from backing up and large batches from causing spill.

Answer(s): E



Which statement describes Delta Lake Auto Compaction?

  1. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 1 GB.
  2. Before a Jobs cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job.
  3. Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.
  4. Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.
  5. An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an OPTIMIZE job is executed toward a default of 128 MB.

Answer(s): E



Which statement characterizes the general programming model used by Spark Structured Streaming?

  1. Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.
  2. Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.
  3. Structured Streaming uses specialized hardware and I/O streams to achieve sub-second latency for data transfer.
  4. Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.
  5. Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.

Answer(s): D



Which configuration parameter directly affects the size of a spark-partition upon ingestion of data into Spark?

  1. spark.sql.files.maxPartitionBytes
  2. spark.sql.autoBroadcastJoinThreshold
  3. spark.sql.files.openCostInBytes
  4. spark.sql.adaptive.coalescePartitions.minPartitionNum
  5. spark.sql.adaptive.advisoryPartitionSizeInBytes

Answer(s): A



Page 6 of 46



Post your Comments and Discuss Databricks Certified Data Engineer Professional exam with other Community members:

Puran commented on September 18, 2024
Good material and very honest and knowledgeable support team. Contacted the support team and got a reply in less than 30 minutes.
New Zealand
upvote