Free Databricks Databricks-Machine-Learning-Associate Practice Exam with Questions & Answers

Name: How to Pass Databricks-Machine-Learning-Associate Exams
Brand: Examstrack
SKU: databricks-machine-learning-associate
Price: 36.75 USD
Availability: InStock

Questions 1

Which of the following tools can be used to distribute large-scale feature engineering without the use of a UDF or pandas Function API for machine learning pipelines?

Options:

Keras

pandas

PvTorch

Spark ML

Scikit-learn

Databricks Databricks-Machine-Learning-Associate Premium Access

Gavin

06-Sep-2025

The testing engine on examstrack.com made my Databricks-Machine-Learning-Associate journey smoother.

Questions 2

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository.

Which of the following explanations justifies this suggestion?

Options:

One-hot encoding is not supported by most machine learning libraries.

One-hot encoding is dependent on the target variable's values which differ for each application.

One-hot encoding is computationally intensive and should only be performed on small samples of training sets for individual machine learning problems.

One-hot encoding is not a common strategy for representing categorical feature variables numerically.

One-hot encoding is a potentially problematic categorical variable strategy for some machine learning algorithms.

Questions 3

A data scientist wants to parallelize the training of trees in a gradient boosted tree to speed up the training process. A colleague suggests that parallelizing a boosted tree algorithm can be difficult.

Which of the following describes why?

Options:

Gradient boosting is not a linear algebra-based algorithm which is required for parallelization

Gradient boosting requires access to all data at once which cannot happen during parallelization.

Gradient boosting calculates gradients in evaluation metrics using all cores which prevents parallelization.

Gradient boosting is an iterative algorithm that requires information from the previous iteration to perform the next step.

Answer:

Explanation:

Gradient boosting is fundamentally an iterative algorithm where each new tree is built based on the errors of the previous ones. This sequential dependency makes it difficult to parallelize the training of trees in gradient boosting, as each step relies on the results from the preceding step. Parallelization in this context would undermine the core methodology of the algorithm, which depends on sequentially improving the model'sperformance with each iteration.References:

Machine Learning Algorithms (Challenges with Parallelizing Gradient Boosting).

Gradient boosting is an ensemble learning technique that builds models in a sequential manner. Each new model corrects the errors made by the previous ones. This sequential dependency means that each iteration requires the results of the previous iteration to make corrections. Here is a step-by-step explanation of why this makes parallelization challenging:

Sequential Nature: Gradient boosting builds one tree at a time. Each tree is trained to correct the residual errors of the previous trees. This requires the model to complete one iteration before starting the next.
Dependence on Previous Iterations: The gradient calculation at each step depends on the predictions made by the previous models. Therefore, the model must wait until the previous tree has been fully trained and evaluated before starting to train the next tree.
Difficulty in Parallelization: Because of this dependency, it is challenging to parallelize the training process. Unlike algorithms that process data independently in each step (e.g., random forests), gradient boosting cannot easily distribute the work across multiple processors or cores for simultaneous execution.

This iterative and dependent nature of the gradient boosting process makes it difficult to parallelize effectively.

References

Gradient Boosting Machine Learning Algorithm
Understanding Gradient Boosting Machines

Questions 4

Which of the Spark operations can be used to randomly split a Spark DataFrame into a training DataFrame and a test DataFrame for downstream use?

Options:

TrainValidationSplit

DataFrame.where

CrossValidator

TrainValidationSplitModel

DataFrame.randomSplit

Questions 5

A machine learning engineer has identified the best run from an MLflow Experiment. They have stored the run ID in the run_id variable and identified the logged model name as "model". They now want to register that model in the MLflow Model Registry with the name "best_model".

Which lines of code can they use to register the model associated with run_id to the MLflow Model Registry?

Options:

mlflow.register_model(run_id, "best_model")

mlflow.register_model(f"runs:/{run_id}/model”, "best_model”)

millow.register_model(f"runs:/{run_id)/model")

mlflow.register_model(f"runs:/{run_id}/best_model", "model")

Questions 6

A machine learning engineering team has a Job with three successive tasks. Each task runs a single notebook. The team has been alerted that the Job has failed in its latest run.

Which of the following approaches can the team use to identify which task is the cause of the failure?

Options:

Run each notebook interactively

Review the matrix view in the Job's runs

Migrate the Job to a Delta Live Tables pipeline

Change each Task’s setting to use a dedicated cluster

Questions 7

A data scientist is developing a machine learning pipeline using AutoML on Databricks Machine Learning.

Which of the following steps will the data scientist need to perform outside of their AutoML experiment?

Options:

Model tuning

Model evaluation

Model deployment

Exploratory data analysis

Questions 8

Which of the following tools can be used to parallelize the hyperparameter tuning process for single-node machine learning models using a Spark cluster?

Options:

MLflow Experiment Tracking

Spark ML

Autoscaling clusters

Delta Lake

Answer:

Explanation:

Spark ML (part of Apache Spark's MLlib) is designed to handle machine learning tasks across multiple nodes in a cluster, effectively parallelizing tasks like hyperparameter tuning. It supports various machine learning algorithms that can be optimized over a Spark cluster, making it suitable for parallelizing hyperparameter tuning for single-node machine learning models when they are adapted to run on Spark.

References

Apache Spark MLlib Guide:https://spark.apache.org/docs/latest/ml-guide.html

Spark ML is a library within Apache Spark designed for scalable machine learning. It provides tools to handle large-scale machine learning tasks, including parallelizing the hyperparameter tuning process for single-node machine learning models using a Spark cluster. Here’s a detailed explanation of how Spark ML can be used:

Hyperparameter Tuning with CrossValidator: Spark ML includes theCrossValidatorandTrainValidationSplitclasses, which are used for hyperparameter tuning. These classes can evaluate multiple sets of hyperparameters in parallel using a Spark cluster.

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Define the model

model = ...

# Create a parameter grid

paramGrid = ParamGridBuilder() \

addGrid(model.hyperparam1, [value1, value2]) \

addGrid(model.hyperparam2, [value3, value4]) \

build()

# Define the evaluator

evaluator = BinaryClassificationEvaluator()

# Define the CrossValidator

crossval = CrossValidator(estimator=model,

estimatorParamMaps=paramGrid,

evaluator=evaluator,

numFolds=3)

Parallel Execution: Spark distributes the tasks of training models with different hyperparameters across the cluster’s nodes. Each node processes a subset of the parameter grid, which allows multiple models to be trained simultaneously.
Scalability: Spark ML leverages the distributed computing capabilities of Spark. This allows for efficient processing of large datasets and training of models across many nodes, which speeds up the hyperparameter tuning process significantly compared to single-node computations.

References

Apache Spark MLlib Documentation
Hyperparameter Tuning in Spark ML

Questions 9

A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model:

Databricks-Machine-Learning-Associate Question 9

They have written the following incomplete code block to use predict to score each record of Spark DataFramespark_df:

Databricks-Machine-Learning-Associate Question 9

Which of the following lines of code can be used to complete the code block to successfully complete the task?

Options:

predict(*spark_df.columns)

mapInPandas(predict)

predict(Iterator(spark_df))

mapInPandas(predict(spark_df.columns))

predict(spark_df.columns)

Questions 10

A data scientist has developed a linear regression model using Spark ML and computed the predictions in a Spark DataFrame preds_df with the following schema:

prediction DOUBLE

actual DOUBLE

Which of the following code blocks can be used to compute the root mean-squared-error of the model according to the data in preds_df and assign it to the rmse variable?

Databricks-Machine-Learning-Associate Question 10