Big 11.11 Sale 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sale65best

Free Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Practice Exam with Questions & Answers

Questions 1

34 of 55.

A data engineer is investigating a Spark cluster that is experiencing underutilization during scheduled batch jobs.

After checking the Spark logs, they noticed that tasks are often getting killed due to timeout errors, and there are several warnings about insufficient resources in the logs.

Which action should the engineer take to resolve the underutilization issue?

Options:
A.

Set the spark.network.timeout property to allow tasks more time to complete without being killed.

B.

Increase the executor memory allocation in the Spark configuration.

C.

Reduce the size of the data partitions to improve task scheduling.

D.

Increase the number of executor instances to handle more concurrent tasks.

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Premium Access
Questions 2

25 of 55.

A Data Analyst is working on employees_df and needs to add a new column where a 10% tax is calculated on the salary.

Additionally, the DataFrame contains the column age, which is not needed.

Which code fragment adds the tax column and removes the age column?

Options:
A.

employees_df = employees_df.withColumn("tax", col("salary") * 0.1).drop("age")

B.

employees_df = employees_df.withColumn("tax", lit(0.1)).drop("age")

C.

employees_df = employees_df.dropField("age").withColumn("tax", col("salary") * 0.1)

D.

employees_df = employees_df.withColumn("tax", col("salary") + 0.1).drop("age")

Questions 3

A data scientist at a financial services company is working with a Spark DataFrame containing transaction records. The DataFrame has millions of rows and includes columns for transaction_id, account_number, transaction_amount, and timestamp. Due to an issue with the source system, some transactions were accidentally recorded multiple times with identical information across all fields. The data scientist needs to remove rows with duplicates across all fields to ensure accurate financial reporting.

Which approach should the data scientist use to deduplicate the orders using PySpark?

Options:
A.

df = df.dropDuplicates()

B.

df = df.groupBy("transaction_id").agg(F.first("account_number"), F.first("transaction_amount"), F.first("timestamp"))

C.

df = df.filter(F.col("transaction_id").isNotNull())

D.

df = df.dropDuplicates(["transaction_amount"])

Questions 4

Which command overwrites an existing JSON file when writing a DataFrame?

Options:
A.

df.write.mode("overwrite").json("path/to/file")

B.

df.write.overwrite.json("path/to/file")

C.

df.write.json("path/to/file", overwrite=True)

D.

df.write.format("json").save("path/to/file", mode="overwrite")

Questions 5

What is the benefit of Adaptive Query Execution (AQE)?

Options:
A.

It allows Spark to optimize the query plan before execution but does not adapt during runtime.

B.

It enables the adjustment of the query plan during runtime, handling skewed data, optimizing join strategies, and improving overall query performance.

C.

It optimizes query execution by parallelizing tasks and does not adjust strategies based on runtime metrics like data skew.

D.

It automatically distributes tasks across nodes in the clusters and does not perform runtime adjustments to the query plan.

Questions 6

Given this code:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 6

.withWatermark("event_time", "10 minutes")

.groupBy(window("event_time", "15 minutes"))

.count()

What happens to data that arrives after the watermark threshold?

Options:

Options:
A.

Records that arrive later than the watermark threshold (10 minutes) will automatically be included in the aggregation if they fall within the 15-minute window.

B.

Any data arriving more than 10 minutes after the watermark threshold will be ignored and not included in the aggregation.

C.

Data arriving more than 10 minutes after the latest watermark will still be included in the aggregation but will be placed into the next window.

D.

The watermark ensures that late data arriving within 10 minutes of the latest event_time will be processed and included in the windowed aggregation.

Questions 7

A developer is trying to join two tables, sales.purchases_fct and sales.customer_dim, using the following code:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 7

fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'))

The developer has discovered that customers in the purchases_fct table that do not exist in the customer_dim table are being dropped from the joined table.

Which change should be made to the code to stop these customer records from being dropped?

Options:
A.

fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'), 'left')

B.

fact_df = cust_df.join(purch_df, F.col('customer_id') == F.col('custid'))

C.

fact_df = purch_df.join(cust_df, F.col('cust_id') == F.col('customer_id'))

D.

fact_df = purch_df.join(cust_df, F.col('customer_id') == F.col('custid'), 'right_outer')

Questions 8

A data engineer needs to persist a file-based data source to a specific location. However, by default, Spark writes to the warehouse directory (e.g., /user/hive/warehouse). To override this, the engineer must explicitly define the file path.

Which line of code ensures the data is saved to a specific location?

Options:

Options:
A.

users.write(path="/some/path").saveAsTable("default_table")

B.

users.write.saveAsTable("default_table").option("path", "/some/path")

C.

users.write.option("path", "/some/path").saveAsTable("default_table")

D.

users.write.saveAsTable("default_table", path="/some/path")

Questions 9

32 of 55.

A developer is creating a Spark application that performs multiple DataFrame transformations and actions. The developer wants to maintain optimal performance by properly managing the SparkSession.

How should the developer handle the SparkSession throughout the application?

Options:
A.

Use a single SparkSession instance for the entire application.

B.

Avoid using a SparkSession and rely on SparkContext only.

C.

Create a new SparkSession instance before each transformation.

D.

Stop and restart the SparkSession after each action.

Questions 10

26 of 55.

A data scientist at an e-commerce company is working with user data obtained from its subscriber database and has stored the data in a DataFrame df_user.

Before further processing, the data scientist wants to create another DataFrame df_user_non_pii and store only the non-PII columns.

The PII columns in df_user are name, email, and birthdate.

Which code snippet can be used to meet this requirement?

Options:
A.

df_user_non_pii = df_user.drop("name", "email", "birthdate")

B.

df_user_non_pii = df_user.dropFields("name", "email", "birthdate")

C.

df_user_non_pii = df_user.select("name", "email", "birthdate")

D.

df_user_non_pii = df_user.remove("name", "email", "birthdate")