Weekend Sale 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sale65best

Free Databricks Databricks-Machine-Learning-Associate Practice Exam with Questions & Answers | Set: 2

Questions 11

A data scientist is using MLflow to track their machine learning experiment. As a part of each of their MLflow runs, they are performing hyperparameter tuning. The data scientist would like to have one parent run for the tuning process with a child run for each unique combination of hyperparameter values. All parent and child runs are being manually started with mlflow.start_run.

Which of the following approaches can the data scientist use to accomplish this MLflow run organization?

Options:
A.

Theycan turn on Databricks Autologging

B.

Theycan specify nested=True when startingthe child run for each unique combination of hyperparameter values

C.

Theycan start each child run inside the parentrun's indented code block usingmlflow.start runO

D.

They can start each child run with the same experiment ID as the parent run

E.

They can specify nested=True when starting the parent run for the tuningprocess

Databricks Databricks-Machine-Learning-Associate Premium Access
Questions 12

A data scientist is wanting to explore summary statistics for Spark DataFrame spark_df. The data scientist wants to see the count, mean, standard deviation, minimum, maximum, and interquartile range (IQR) for each numerical feature.

Which of the following lines of code can the data scientist run to accomplish the task?

Options:
A.

spark_df.summary ()

B.

spark_df.stats()

C.

spark_df.describe().head()

D.

spark_df.printSchema()

E.

spark_df.toPandas()

Questions 13

A data scientist has developed a machine learning pipeline with a static input data set using Spark ML, but the pipeline is taking too long to process. They increase the number of workers in the cluster to get the pipeline to run more efficiently. They notice that the number of rows in the training set after reconfiguring the cluster is different from the number of rows in the training set prior to reconfiguring the cluster.

Which of the following approaches will guarantee a reproducible training and test set for each model?

Options:
A.

Manually configure the cluster

B.

Write out the split data sets to persistent storage

C.

Set a speed in the data splitting operation

D.

Manually partition the input data

Questions 14

A data scientist has created a linear regression model that useslog(price)as a label variable. Using this model, they have performed inference and the predictions and actual label values are in Spark DataFramepreds_df.

They are using the following code block to evaluate the model:

regression_evaluator.setMetricName("rmse").evaluate(preds_df)

Which of the following changes should the data scientist make to evaluate the RMSE in a way that is comparable withprice?

Options:
A.

They should exponentiate the computed RMSE value

B.

They should take the log of the predictions before computing the RMSE

C.

They should evaluate the MSE of the log predictions to compute the RMSE

D.

They should exponentiate the predictions before computing the RMSE

Questions 15

A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process.

Which of the following feature engineering tasks will be the least efficient to distribute?

Options:
A.

One-hot encoding categorical features

B.

Target encoding categorical features

C.

Imputing missing feature values with the mean

D.

Imputing missing feature values with the true median

E.

Creating binary indicator features for missing values

Questions 16

A data scientist is using the following code block to tune hyperparameters for a machine learning model:

Databricks-Machine-Learning-Associate Question 16

Which change can they make the above code block to improve the likelihood of a more accurate model?

Options:
A.

Increase num_evals to 100

B.

Change fmin() to fmax()

C.

Change sparkTrials() to Trials()

D.

Change tpe.suggest to random.suggest

Questions 17

A data scientist is utilizing MLflow Autologging to automatically track their machine learning experiments. After completing a series of runs for the experiment experiment_id, the data scientist wants to identify the run_id of the run with the best root-mean-square error (RMSE).

Which of the following lines of code can be used to identify the run_id of the run with the best RMSE in experiment_id?

A)

Databricks-Machine-Learning-Associate Question 17

B)

Databricks-Machine-Learning-Associate Question 17

C)

Databricks-Machine-Learning-Associate Question 17

D)

Databricks-Machine-Learning-Associate Question 17

Options:
A.

OptionA

B.

Option B

C.

Option C

D.

Option D

Questions 18

A machine learning engineer wants to parallelize the training of group-specific models using the Pandas Function API. They have developed thetrain_modelfunction, and they want to apply it to each group of DataFramedf.

They have written the following incomplete code block:

Databricks-Machine-Learning-Associate Question 18

Which of the following pieces of code can be used to fill in the above blank to complete the task?

Options:
A.

applyInPandas

B.

mapInPandas

C.

predict

D.

train_model

E.

groupedApplyIn

Questions 19

A data scientist has replaced missing values in their feature set with each respective feature variable’s median value. A colleague suggests that the data scientist is throwing away valuable information by doing this.

Which of the following approaches can they take to include as much information as possible in the feature set?

Options:
A.

Impute the missing values using each respective feature variable's mean value instead of the median value

B.

Refrain from imputing the missing values in favor of letting the machine learning algorithm determine how to handle them

C.

Remove all feature variables that originally contained missing values from the feature set

D.

Create a binary feature variable for each feature that contained missing values indicating whether each row's value has been imputed

E.

Create a constant feature variable for each feature that contained missing values indicating the percentage of rows from the feature that was originally missing

Questions 20

A data scientist has been given an incomplete notebook from the data engineering team. The notebook uses a Spark DataFrame spark_df on which the data scientist needs to perform further feature engineering. Unfortunately, the data scientist has not yet learned the PySpark DataFrame API.

Which of the following blocks of code can the data scientist run to be able to use the pandas API on Spark?

Options:
A.

import pyspark.pandas as ps

df = ps.DataFrame(spark_df)

B.

import pyspark.pandas as ps

df = ps.to_pandas(spark_df)

C.

spark_df.to_pandas()

D.

import pandas as pd

df = pd.DataFrame(spark_df)

Certification Provider: Databricks
Exam Name: Databricks Certified Machine Learning Associate Exam
Last Update: Jul 12, 2025
Questions: 74