Free Professional Data Engineer Exam Braindumps (page: 31)

Page 31 of 68

Which row keys are likely to cause a disproportionate number of reads and/or writes on a particular node in a Bigtable cluster (select 2 answers)?

  1. A sequential numeric ID
  2. A timestamp followed by a stock symbol
  3. A non-sequential numeric ID
  4. A stock symbol followed by a timestamp

Answer(s): A,B

Explanation:

using a timestamp as the first element of a row key can cause a variety of problems. In brief, when a row key for a time series includes a timestamp, all of your writes will target a single node; fill that node; and then move onto the next node in the cluster, resulting in hotspotting.
Suppose your system assigns a numeric ID to each of your application's users. You might be tempted to use the user's numeric ID as the row key for your table. However, since new users are more likely to be active users, this approach is likely to push most of your traffic
to a small number of nodes. [https://cloud.google.com/bigtable/docs/schema-design]


Reference:

https://cloud.google.com/bigtable/docs/schema-design-time- series#ensure_that_your_row_key_avoids_hotspotting



Which of these is NOT a way to customize the software on Dataproc cluster instances?

  1. Set initialization actions
  2. Modify configuration files using cluster properties
  3. Configure the cluster using Cloud Deployment Manager
  4. Log into the master node and make changes from there

Answer(s): C

Explanation:

You can access the master node of the cluster by clicking the SSH button next to it in the Cloud Console.
You can easily use the --properties option of the dataproc command in the Google Cloud SDK to modify many common configuration files when creating a cluster.
When creating a Cloud Dataproc cluster, you can specify initialization actions in executables and/or scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc cluster immediately after the cluster is set up. [https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions]


Reference:

https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/cluster- properties



Which of the following are feature engineering techniques? (Select 2 answers)

  1. Hidden feature layers
  2. Feature prioritization
  3. Crossed feature columns
  4. Bucketization of a continuous feature

Answer(s): C,D

Explanation:

Selecting and crafting the right set of feature columns is key to learning an effective model. Bucketization is a process of dividing the entire range of a continuous feature into a set of consecutive bins/buckets, and then converting the original numerical feature into a bucket ID (as a categorical feature) depending on which bucket that value falls into. Using each base feature column separately may not be enough to explain the data. To learn the differences between different feature combinations, we can add crossed feature columns to the model.


Reference:

https://www.tensorflow.org/tutorials/wide#selecting_and_engineering_features_for_the_mo del



How can you get a neural network to learn about relationships between categories in a categorical feature?

  1. Create a multi-hot column
  2. Create a one-hot column
  3. Create a hash bucket
  4. Create an embedding column

Answer(s): D

Explanation:

There are two problems with one-hot encoding. First, it has high dimensionality, meaning that instead of having just one value, like a continuous feature, it has many values, or dimensions. This makes computation more time-consuming, especially if a feature has a very large number of categories. The second problem is that it doesn't encode any relationships between the categories. They are completely independent from each other, so the network has no way of knowing which ones are similar to each other. Both of these problems can be solved by representing a categorical feature with an embedding
column. The idea is that each category has a smaller vector with, let's say, 5 values in it. But unlike a one-hot vector, the values are not usually 0. The values are weights, similar to the weights that are used for basic features in a neural network. The difference is that each category has a set of weights (5 of them in this case). You can think of each value in the embedding vector as a feature of the category. So, if two categories are very similar to each other, then their embedding vectors should be very similar too.


Reference:

https://cloudacademy.com/google/introduction-to-google-cloud-machine- learning-engine-course/a-wide-and-deep-model.html



Page 31 of 68



Post your Comments and Discuss Google Professional Data Engineer exam with other Community members:

madhan commented on June 16, 2023
next question
EUROPEAN UNION
upvote