Weekend Sale 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sale65best

Free Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Practice Exam with Questions & Answers | Set: 3

Questions 21

42 of 55.

A developer needs to write the output of a complex chain of Spark transformations to a Parquet table called events.liveLatest.

Consumers of this table query it frequently with filters on both year and month of the event_ts column (a timestamp).

The current code:

from pyspark.sql import functions as F

final = df.withColumn("event_year", F.year("event_ts")) \

.withColumn("event_month", F.month("event_ts")) \

.bucketBy(42, ["event_year", "event_month"]) \

.saveAsTable("events.liveLatest")

However, consumers report poor query performance.

Which change will enable efficient querying by year and month?

Options:
A.

Replace .bucketBy() with .partitionBy("event_year", "event_month")

B.

Change the bucket count (42) to a lower number

C.

Add .sortBy() after .bucketBy()

D.

Replace .bucketBy() with .partitionBy("event_year") only

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Premium Access
Questions 22

41 of 55.

A data engineer is working on the DataFrame df1 and wants the Name with the highest count to appear first (descending order by count), followed by the next highest, and so on.

The DataFrame has columns:

id | Name | count | timestamp

---------------------------------

1 | USA | 10

2 | India | 20

3 | England | 50

4 | India | 50

5 | France | 20

6 | India | 10

7 | USA | 30

8 | USA | 40

Which code fragment should the engineer use to sort the data in the Name and count columns?

Options:
A.

df1.orderBy(col("count").desc(), col("Name").asc())

B.

df1.sort("Name", "count")

C.

df1.orderBy("Name", "count")

D.

df1.orderBy(col("Name").desc(), col("count").asc())

Questions 23

1 of 55. A data scientist wants to ingest a directory full of plain text files so that each record in the output DataFrame contains the entire contents of a single file and the full path of the file the text was read from.

The first attempt does read the text files, but each record contains a single line. This code is shown below:

txt_path = "/datasets/raw_txt/*"

df = spark.read.text(txt_path) # one row per line by default

df = df.withColumn("file_path", input_file_name()) # add full path

Which code change can be implemented in a DataFrame that meets the data scientist's requirements?

Options:
A.

Add the option wholetext to the text() function.

B.

Add the option lineSep to the text() function.

C.

Add the option wholetext=False to the text() function.

D.

Add the option lineSep=", " to the text() function.

Questions 24

Which Spark configuration controls the number of tasks that can run in parallel on the executor?

Options:

Options:
A.

spark.executor.cores

B.

spark.task.maxFailures

C.

spark.driver.cores

D.

spark.executor.memory

Questions 25

An engineer notices a significant increase in the job execution time during the execution of a Spark job. After some investigation, the engineer decides to check the logs produced by the Executors.

How should the engineer retrieve the Executor logs to diagnose performance issues in the Spark application?

Options:
A.

Locate the executor logs on the Spark master node, typically under the /tmp directory.

B.

Use the command spark-submit with the —verbose flag to print the logs to the console.

C.

Use the Spark UI to select the stage and view the executor logs directly from the stages tab.

D.

Fetch the logs by running a Spark job with the spark-sql CLI tool.

Questions 26

A Spark developer is building an app to monitor task performance. They need to track the maximum task processing time per worker node and consolidate it on the driver for analysis.

Which technique should be used?

Options:
A.

Use an RDD action like reduce() to compute the maximum time

B.

Use an accumulator to record the maximum time on the driver

C.

Broadcast a variable to share the maximum time among workers

D.

Configure the Spark UI to automatically collect maximum times

Questions 27

4 of 55.

A developer is working on a Spark application that processes a large dataset using SQL queries. Despite having a large cluster, the developer notices that the job is underutilizing the available resources. Executors remain idle for most of the time, and logs reveal that the number of tasks per stage is very low. The developer suspects that this is causing suboptimal cluster performance.

Which action should the developer take to improve cluster utilization?

Options:
A.

Increase the value of spark.sql.shuffle.partitions

B.

Reduce the value of spark.sql.shuffle.partitions

C.

Enable dynamic resource allocation to scale resources as needed

D.

Increase the size of the dataset to create more partitions

Questions 28

Given:

python

CopyEdit

spark.sparkContext.setLogLevel("")

Which set contains the suitable configuration settings for Spark driver LOG_LEVELs?

Options:
A.

ALL, DEBUG, FAIL, INFO

B.

ERROR, WARN, TRACE, OFF

C.

WARN, NONE, ERROR, FATAL

D.

FATAL, NONE, INFO, DEBUG

Questions 29

A data engineer uses a broadcast variable to share a DataFrame containing millions of rows across executors for lookup purposes. What will be the outcome?

Options:
A.

The job may fail if the memory on each executor is not large enough to accommodate the DataFrame being broadcasted

B.

The job may fail if the executors do not have enough CPU cores to process the broadcasted dataset

C.

The job will hang indefinitely as Spark will struggle to distribute and serialize such a large broadcast variable to all executors

D.

The job may fail because the driver does not have enough CPU cores to serialize the large DataFrame

Questions 30

47 of 55.

A data engineer has written the following code to join two DataFrames df1 and df2:

df1 = spark.read.csv("sales_data.csv")

df2 = spark.read.csv("product_data.csv")

df_joined = df1.join(df2, df1.product_id == df2.product_id)

The DataFrame df1 contains ~10 GB of sales data, and df2 contains ~8 MB of product data.

Which join strategy will Spark use?

Options:
A.

Shuffle join, as the size difference between df1 and df2 is too large for a broadcast join to work efficiently.

B.

Shuffle join, because AQE is not enabled, and Spark uses a static query plan.

C.

Shuffle join because no broadcast hints were provided.

D.

Broadcast join, as df2 is smaller than the default broadcast threshold.