Free Professional Data Engineer Exam Braindumps (page: 35)

Page 34 of 95

You are implementing security best practices on your data pipeline. Currently, you are manually executing jobs as the Project Owner. You want to automate these jobs by taking nightly batch files containing non-public information from Google Cloud Storage, processing them with a Spark Scala job on a Google Cloud Dataproc cluster, and depositing the results into Google BigQuery.

How should you securely run this workload?

  1. Restrict the Google Cloud Storage bucket so only you can see the files
  2. Grant the Project Owner role to a service account, and run the job with it
  3. Use a service account with the ability to read the batch files and to write to BigQuery
  4. Use a user account with the Project Viewer role on the Cloud Dataproc cluster to read the batch files and write to BigQuery

Answer(s): B



You are using Google BigQuery as your data warehouse. Your users report that the following simple query is running very slowly, no matter when they run the query:

SELECT country, state, city FROM [myproject:mydataset.mytable] GROUP BY country

You check the query plan for the query and see the following output in the Read section of Stage:1:



What is the most likely cause of the delay for this query?

  1. Users are running too many concurrent queries in the system
  2. The [myproject:mydataset.mytable] table has too many partitions
  3. Either the state or the city columns in the [myproject:mydataset.mytable] table have too many NULL values
  4. Most rows in the [myproject:mydataset.mytable] table have the same value in the country column, causing data skew

Answer(s): A



Your globally distributed auction application allows users to bid on items. Occasionally, users place identical bids at nearly identical times, and different application servers process those bids. Each bid event contains the item, amount, user, and timestamp. You want to collate those bid events into a single location in real time to determine which user bid first.
What should you do?

  1. Create a file on a shared file and have the application servers write all bid events to that file.
    Process the file with Apache Hadoop to identify which user bid first.
  2. Have each application server write the bid events to Cloud Pub/Sub as they occur. Push the events from Cloud Pub/Sub to a custom endpoint that writes the bid event information into Cloud SQL.
  3. Set up a MySQL database for each application server to write bid events into. Periodically query each of those distributed MySQL databases and update a master MySQL database with bid event information.
  4. Have each application server write the bid events to Google Cloud Pub/Sub as they occur. Use a pull subscription to pull the bid events using Google Cloud Dataflow. Give the bid for each item to the user in the bid event that is processed first.

Answer(s): C



Your organization has been collecting and analyzing data in Google BigQuery for 6 months. The majority of the data analyzed is placed in a time-partitioned table named events_partitioned. To reduce the cost of queries, your organization created a view called events, which queries only the last 14 days of dat

  1. The view is described in legacy SQL. Next month, existing applications will be connecting to
    BigQuery to read the events data via an ODBC connection. You need to ensure the applications can connect.
    Which two actions should you take? (Choose two.)
  2. Create a new view over events using standard SQL
  3. Create a new partitioned table using a standard SQL query
  4. Create a new view over events_partitioned using standard SQL
  5. Create a service account for the ODBC connection to use for authentication
  6. Create a Google Cloud Identity and Access Management (Cloud IAM) role for the ODBC connection and shared "events"

Answer(s): A,E






Post your Comments and Discuss Google Professional Data Engineer exam with other Community members:

Exam Discussions & Posts