Free Professional Data Engineer Exam Braindumps (page: 25)

Page 24 of 95

You are planning to use Google's Dataflow SDK to analyze customer data such as displayed below. Your project requirement is to extract only the customer name from the data source and then write to an output PCollection.

Tom,555 X street

Tim,553 Y street

Sam, 111 Z street

Which operation is best suited for the above data processing requirement?

  1. ParDo
  2. Sink API
  3. Source API
  4. Data extraction

Answer(s): A

Explanation:

In Google Cloud dataflow SDK, you can use the ParDo to extract only a customer name of each element in your PCollection.


Reference:

https://cloud.google.com/dataflow/model/par-do



Which Cloud Dataflow / Beam feature should you use to aggregate data in an unbounded data source every hour based on the time when the data entered the pipeline?

  1. An hourly watermark
  2. An event time trigger
  3. The with Allowed Lateness method
  4. A processing time trigger

Answer(s): D

Explanation:

When collecting and grouping data into windows, Beam uses triggers to determine when to emit the aggregated results of each window.

Processing time triggers. These triggers operate on the processing time ­ the time when the data element is processed at any given stage in the pipeline.

Event time triggers. These triggers operate on the event time, as indicated by the timestamp on each data element. Beam's default trigger is event time-based.


Reference:

https://beam.apache.org/documentation/programming-guide/#triggers



Which of the following is NOT true about Dataflow pipelines?

  1. Dataflow pipelines are tied to Dataflow, and cannot be run on any other runner
  2. Dataflow pipelines can consume data from other Google Cloud services
  3. Dataflow pipelines can be programmed in Java
  4. Dataflow pipelines use a unified programming model, so can work both with streaming and batch data sources

Answer(s): A

Explanation:

Dataflow pipelines can also run on alternate runtimes like Spark and Flink, as they are built using the Apache Beam SDKs


Reference:

https://cloud.google.com/dataflow/



You are developing a software application using Google's Dataflow SDK, and want to use conditional, for loops and other complex programming structures to create a branching pipeline.
Which component will be used for the data processing operation?

  1. PCollection
  2. Transform
  3. Pipeline
  4. Sink API

Answer(s): B

Explanation:

In Google Cloud, the Dataflow SDK provides a transform component. It is responsible for the data processing operation. You can use conditional, for loops, and other complex programming structure to create a branching pipeline.


Reference:

https://cloud.google.com/dataflow/model/programming-model






Post your Comments and Discuss Google Professional Data Engineer exam with other Community members:

Exam Discussions & Posts