QUESTION: 25

Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?

Keras
pandas
PvTorch
Spark ML
Scikit-learn

Answer(s): D

Explanation:

Spark ML (Machine Learning Library) is designed specifically for handling large-scale data processing and machine learning tasks directly within Apache Spark. It provides tools and APIs for large-scale feature engineering without the need to rely on user-defined functions (UDFs) or pandas Function API, allowing for more scalable and efficient data transformations directly distributed across a Spark cluster. Unlike Keras, pandas, PyTorch, and scikit-learn, Spark ML operates natively in a distributed environment suitable for big data scenarios.

Reference:

Spark MLlib documentation (Feature Engineering with Spark ML).

Reveal Solution Next Question

QUESTION: 26

A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema:
prediction DOUBLE
actual DOUBLE
Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable? A)

B)

C)

D)

E)

Option A
Option B
Option C
Option D
Option E

Answer(s): C

Explanation:

The code block to compute the root mean-squared error (RMSE) for a linear regression model in Spark ML should use the RegressionEvaluator class with metricName set to "rmse". Given the schema of preds_df with columns prediction and actual, the correct evaluator setup will specify predictionCol="prediction" and labelCol="actual". Thus, the appropriate code block (Option C in your list) that uses RegressionEvaluator to compute the RMSE is the correct choice. This setup correctly measures the performance of the regression model using the predictions and actual outcomes from the DataFrame.

Reference:

Spark ML documentation (Using RegressionEvaluator to Compute RMSE).

Reveal Solution Next Question

QUESTION: 27

A machine learning engineer wants to parallelize the training of group-specific models using the Pandas Function API. They have developed the train_model function, and they want to apply it to each group of DataFrame df.

They have written the following incomplete code block:

Which of the following pieces of code can be used to fill in the above blank to complete the task?

applyInPandas
mapInPandas
predict
train_model
groupedApplyIn

Answer(s): B

Explanation:

The function mapInPandas in the PySpark DataFrame API allows for applying a function to each partition of the DataFrame.
When working with grouped data, groupby followed by applyInPandas is the correct approach to apply a function to each group as a separate Pandas DataFrame. However, if the function should apply across each partition of the grouped data rather than on each individual group, mapInPandas would be utilized. Since the code snippet indicates the use of groupby, the intent seems to be to apply train_model on each group specifically, which aligns with applyInPandas. Thus, applyInPandas is a better fit to ensure that each group generated by groupby is processed through the train_model function, preserving the partitioning and grouping integrity.
Reference
PySpark Documentation on applying functions to grouped data:
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.GroupedData.applyInPa ndas.html

Reveal Solution Next Question

QUESTION: 28

Which of the following statements describes a Spark ML estimator?

An estimator is a hyperparameter arid that can be used to train a model
An estimator chains multiple alqorithms toqether to specify an ML workflow
An estimator is a trained ML model which turns a DataFrame with features into a DataFrame with predictions
An estimator is an alqorithm which can be fit on a DataFrame to produce a Transformer
An estimator is an evaluation tool to assess to the quality of a model

Answer(s): D

Explanation:

In the context of Spark MLlib, an estimator refers to an algorithm which can be "fit" on a DataFrame to produce a model (referred to as a Transformer), which can then be used to transform one DataFrame into another, typically adding predictions or model scores. This is a fundamental concept in machine learning pipelines in Spark, where the workflow includes fitting estimators to data to produce transformers.
Reference
Spark MLlib Documentation: https://spark.apache.org/docs/latest/ml-pipeline.html#estimators

Reveal Solution Next Question

Free Databricks-Machine-Learning-Associate Exam Braindumps (page: 7)

QUESTION: 25

Explanation:

Reference:

QUESTION: 26

Explanation:

Reference:

QUESTION: 27

Explanation:

QUESTION: 28

Explanation: