Free Certified Data Engineer Professional Exam Braindumps (page: 17)

Page 17 of 46

Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().

Which of the following statements is correct?

  1. DBFS is a file system protocol that allows users to interact with files stored in object storage using syntax and guarantees similar to Unix file systems.
  2. By default, both the DBFS root and mounted data sources are only accessible to workspace administrators.
  3. The DBFS root is the most secure location to store data, because mounted storage volumes must have full public read and write permissions.
  4. Neither the DBFS root nor mounted storage can be accessed when using %sh in a Databricks notebook.
  5. The DBFS root stores files in ephemeral block volumes attached to the driver, while mounted directories will always persist saved data to external storage between sessions.

Answer(s): A



The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

  1. %sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.
  2. Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.
  3. %sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.
  4. Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.
  5. %sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

Answer(s): E



The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.
Which response to the junior data engineer s suggestion is correct?

  1. Delta Lake statistics are not optimized for free text fields with high cardinality.
  2. Text data cannot be stored with Delta Lake.
  3. ZORDER ON review will need to be run to see performance gains.
  4. The Delta log creates a term matrix for free text fields to support selective filtering.
  5. Delta Lake statistics are only collected on the first 4 columns in a table.

Answer(s): A



Assuming that the Databricks CLI has been installed and configured correctly, which Databricks CLI command can be used to upload a custom Python Wheel to object storage mounted with the DBFS for use with a production job?

  1. configure
  2. fs
  3. jobs
  4. libraries
  5. workspace

Answer(s): B



Page 17 of 46



Post your Comments and Discuss Databricks Certified Data Engineer Professional exam with other Community members:

Puran commented on September 18, 2024
Good material and very honest and knowledgeable support team. Contacted the support team and got a reply in less than 30 minutes.
New Zealand
upvote