Free Professional Data Engineer Exam Braindumps (page: 1)

Page 1 of 68

Your company uses a proprietary system to send inventory data every 6 hours to a data ingestion service in the cloud. Transmitted data includes a payload of several fields and the timestamp of the transmission. If there are any concerns about a transmission, the system re-transmits the data. How should you deduplicate the data most efficiency?

  1. Assign global unique identifiers (GUID) to each data entry.
  2. Compute the hash value of each data entry, and compare it with all historical data.
  3. Store each data entry as the primary key in a separate database and apply an index.
  4. Maintain a database table to store the hash value and other metadata for each data entry.

Answer(s): D



Your company is performing data preprocessing for a learning algorithm in Google Cloud Dataflow. Numerous data logs are being are being generated during this step, and the team wants to analyze them. Due to the dynamic nature of the campaign, the data is growing exponentially every hour.

The data scientists have written the following code to read the data for a new key features in the logs.

BigQueryIO.Read

.named("ReadLogData")

.from("clouddataflow-readonly:samples.log_data")

You want to improve the performance of this data read.
What should you do?

  1. Specify the TableReference object in the code.
  2. Use .fromQuery operation to read specific fields from the table.
  3. Use of both the Google BigQuery TableSchema and TableFieldSchema classes.
  4. Call a transform that returns TableRow objects, where each element in the PCollexction represents a single row in the table.

Answer(s): D



You are creating a model to predict housing prices. Due to budget constraints, you must run it on a single resource-constrained virtual machine.
Which learning algorithm should you use?

  1. Linear regression
  2. Logistic classification
  3. Recurrent neural network
  4. Feedforward neural network

Answer(s): A



You need to store and analyze social media postings in Google BigQuery at a rate of 10,000 messages per minute in near real-time. Initially, design the application to use streaming inserts for individual postings. Your application also performs data aggregations right after the streaming inserts. You discover that the queries after streaming inserts do not exhibit strong consistency, and reports from the queries might miss in-flight data. How can you adjust your application design?

  1. Re-write the application to load accumulated data every 2 minutes.
  2. Convert the streaming insert code to batch load for individual messages.
  3. Load the original message to Google Cloud SQL, and export the table every hour to BigQuery via streaming inserts.
  4. Estimate the average latency for data availability after streaming inserts, and always run queries after waiting twice as long.

Answer(s): D

Explanation:

The data is first comes to buffer and then written to Storage. If we are running queries in
buffer we will face above mentioned issues. If we wait for the bigquery to write the data to storage then we won't face the issue. So We need to wait till it's written tio storage



Page 1 of 68



Post your Comments and Discuss Google Professional Data Engineer exam with other Community members:

madhan commented on June 16, 2023
next question
EUROPEAN UNION
upvote