Weekend Sale 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sale65best

Free Databricks Databricks-Certified-Professional-Data-Scientist Practice Exam with Questions & Answers | Set: 3

Questions 21

Which of the following true with regards to the K-Means clustering algorithm?

Options:
A.

Labels are not pre-assigned to each objects in the cluster.

B.

Labels are pre-assigned to each objects in the cluster.

C.

It classify the data based on the labels.

D.

It discovers the center of each cluster.

E.

It find each objects fall in which particular cluster

Databricks Databricks-Certified-Professional-Data-Scientist Premium Access
Questions 22

You are working in a classification model for a book, written by HadoopExam Learning Resources and decided to use building a text classification model

for determining whether this book is for Hadoop or Cloud computing. You have to select the proper features (feature selection) hence, to cut down on the size of the feature space, you will use the mutual information of each word with the label of hadoop or cloud to select the 1000 best features to use as input to a Naive Bayes model. When you compare the performance of a model built with the 250 best features to a model built with the 1000 best features, you notice that the model with only 250 features performs slightly better on our test data.

What would help you choose better features for your model?

Options:
A.

Include least mutual information with other selected features as a feature selection criterion

B.

Include the number of times each of the words appears in the book in your model

C.

Decrease the size of our training data

D.

Evaluate a model that only includes the top 100 words

Questions 23

Refer to exhibit

Databricks-Certified-Professional-Data-Scientist Question 23

You are asked to write a report on how specific variables impact your client's sales using a data set provided to you by the client. The data includes 15 variables that the client views as directly related to sales, and you are restricted to these variables only. After a preliminary analysis of the data, the following findings were made: 1. Multicollinearity is not an issue among the variables 2. Only three variables-A, B, and C-have significant correlation with sales You build a linear regression model on the dependent variable of sales with the independent variables of A, B, and C. The results of the regression are seen in the exhibit. You cannot request additional data. what is a way that you could try to increase the R2 of the model without artificially inflating it?

Options:
A.

Create clusters based on the data and use them as model inputs

B.

Force all 15 variables into the model as independent variables

C.

Create interaction variables based only on variables A, B, and C

D.

Break variables A, B, and C into their own univariate models

Questions 24

In which phase of the analytic lifecycle would you expect to spend most of the project time?

Options:
A.

Discovery

B.

Data preparation

C.

Communicate Results

D.

Operationalize

Questions 25

Let's say you have two cases as below for the movie ratings

1. You recommend to a user a movie with four stars and he really doesn't like it and he'd rate it two stars

2. You recommend a movie with three stars but the user loves it (he'd rate it five stars). So which statement correctly applies?

Options:
A.

In both cases, the contribution to the RMSE is the same

B.

In both cases, the contribution to the RMSE is the different

C.

In both cases, the contribution to the RMSE, could varies

D.

None of the above

Questions 26

Suppose you have been given a relatively high-dimension set of independent variables and you are asked to come up with a model that predicts one of Two possible outcomes like "YES" or "NO", then which of the following technique best fit.

Options:
A.

Support vector machines

B.

Naive Bayes

C.

Logistic regression

D.

Random decision forests

E.

All of the above

Questions 27

Which of the following metrics are useful in measuring the accuracy and quality of a recommender system?

Options:
A.

Cluster Density

B.

Support Vector Count

C.

Mean Absolute Error

D.

Sum of Absolute Errors

Questions 28

What describes a true property of Logistic Regression method?

Options:
A.

It handles missing values well.

B.

It works well with discrete variables that have many distinct values.

C.

It is robust with redundant variables and correlated variables.

D.

It works well with variables that affect the outcome in a discontinuous way.

Questions 29

A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don't admit, is a binary variable.

Above is an example of

Options:
A.

Linear Regression

B.

Logistic Regression

C.

Recommendation system

D.

Maximum likelihood estimation

E.

Hierarchical linear models

Questions 30

In which of the following scenario we can use naTve Bayes theorem for classification

Options:
A.

Classify whether a given person is a male or a female based on the measured features. The features include height, weight and foot size.

B.

To classify whether an email is spam or not spam

C.

To identify whether a fruit is an orange or not based on features like diameter, color and shape