Weekend Sale 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sale65best

Free Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Practice Exam with Questions & Answers | Set: 2

Questions 11

You have:

DataFrame A: 128 GB of transactions

DataFrame B: 1 GB user lookup table

Which strategy is correct for broadcasting?

Options:
A.

DataFrame B should be broadcasted because it is smaller and will eliminate the need for shuffling itself

B.

DataFrame B should be broadcasted because it is smaller and will eliminate the need for shuffling DataFrame A

C.

DataFrame A should be broadcasted because it is larger and will eliminate the need for shuffling DataFrame B

D.

DataFrame A should be broadcasted because it is smaller and will eliminate the need for shuffling itself

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Premium Access
Questions 12

What is the risk associated with this operation when converting a large Pandas API on Spark DataFrame back to a Pandas DataFrame?

Options:
A.

The conversion will automatically distribute the data across worker nodes

B.

The operation will fail if the Pandas DataFrame exceeds 1000 rows

C.

Data will be lost during conversion

D.

The operation will load all data into the driver's memory, potentially causing memory overflow

Questions 13

A developer is running Spark SQL queries and notices underutilization of resources. Executors are idle, and the number of tasks per stage is low.

What should the developer do to improve cluster utilization?

Options:
A.

Increase the value of spark.sql.shuffle.partitions

B.

Reduce the value of spark.sql.shuffle.partitions

C.

Increase the size of the dataset to create more partitions

D.

Enable dynamic resource allocation to scale resources as needed

Questions 14

Given a DataFrame df that has 10 partitions, after running the code:

result = df.coalesce(20)

How many partitions will the result DataFrame have?

Options:
A.

10

B.

Same number as the cluster executors

C.

1

D.

20

Questions 15

Which UDF implementation calculates the length of strings in a Spark DataFrame?

Options:
A.

df.withColumn("length", spark.udf("len", StringType()))

B.

df.select(length(col("stringColumn")).alias("length"))

C.

spark.udf.register("stringLength", lambda s: len(s))

D.

df.withColumn("length", udf(lambda s: len(s), StringType()))

Questions 16

39 of 55.

A Spark developer is developing a Spark application to monitor task performance across a cluster.

One requirement is to track the maximum processing time for tasks on each worker node and consolidate this information on the driver for further analysis.

Which technique should the developer use?

Options:
A.

Broadcast a variable to share the maximum time among workers.

B.

Configure the Spark UI to automatically collect maximum times.

C.

Use an RDD action like reduce() to compute the maximum time.

D.

Use an accumulator to record the maximum time on the driver.

Questions 17

18 of 55.

An engineer has two DataFrames — df1 (small) and df2 (large). To optimize the join, the engineer uses a broadcast join:

from pyspark.sql.functions import broadcast

df_result = df2.join(broadcast(df1), on="id", how="inner")

What is the purpose of using broadcast() in this scenario?

Options:
A.

It increases the partition size for df1 and df2.

B.

It ensures that the join happens only when the id values are identical.

C.

It reduces the number of shuffle operations by replicating the smaller DataFrame to all nodes.

D.

It filters the id values before performing the join.

Questions 18

A data scientist is working on a project that requires processing large amounts of structured data, performing SQL queries, and applying machine learning algorithms. The data scientist is considering using Apache Spark for this task.

Which combination of Apache Spark modules should the data scientist use in this scenario?

Options:

Options:
A.

Spark DataFrames, Structured Streaming, and GraphX

B.

Spark SQL, Pandas API on Spark, and Structured Streaming

C.

Spark Streaming, GraphX, and Pandas API on Spark

D.

Spark DataFrames, Spark SQL, and MLlib

Questions 19

The following code fragment results in an error:

@F.udf(T.IntegerType())

def simple_udf(t: str) -> str:

return answer * 3.14159

Which code fragment should be used instead?

Options:
A.

@F.udf(T.IntegerType())

def simple_udf(t: int) -> int:

return t * 3.14159

B.

@F.udf(T.DoubleType())

def simple_udf(t: float) -> float:

return t * 3.14159

C.

@F.udf(T.DoubleType())

def simple_udf(t: int) -> int:

return t * 3.14159

D.

@F.udf(T.IntegerType())

def simple_udf(t: float) -> float:

return t * 3.14159

Questions 20

29 of 55.

A Spark application is experiencing performance issues in client mode due to the driver being resource-constrained.

How should this issue be resolved?

Options:
A.

Switch the deployment mode to cluster mode.

B.

Add more executor instances to the cluster.

C.

Increase the driver memory on the client machine.

D.

Switch the deployment mode to local mode.