Weekend Sale 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sale65best

Free Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Practice Exam with Questions & Answers | Set: 4

Questions 31

A data engineer is working with a large JSON dataset containing order information. The dataset is stored in a distributed file system and needs to be loaded into a Spark DataFrame for analysis. The data engineer wants to ensure that the schema is correctly defined and that the data is read efficiently.

Which approach should the data scientist use to efficiently load the JSON data into a Spark DataFrame with a predefined schema?

Options:
A.

Use spark.read.json() to load the data, then use DataFrame.printSchema() to view the inferred schema, and finally use DataFrame.cast() to modify column types.

B.

Use spark.read.json() with the inferSchema option set to true

C.

Use spark.read.format("json").load() and then use DataFrame.withColumn() to cast each column to the desired data type.

D.

Define a StructType schema and use spark.read.schema(predefinedSchema).json() to load the data.

Databricks Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Premium Access
Questions 32

Given the schema:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 32

event_ts TIMESTAMP,

sensor_id STRING,

metric_value LONG,

ingest_ts TIMESTAMP,

source_file_path STRING

The goal is to deduplicate based on: event_ts, sensor_id, and metric_value.

Options:

Options:
A.

dropDuplicates on all columns (wrong criteria)

B.

dropDuplicates with no arguments (removes based on all columns)

C.

groupBy without aggregation (invalid use)

D.

dropDuplicates on the exact matching fields

Questions 33

The following code fragment results in an error:

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 33

Which code fragment should be used instead?

A)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 33

B)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 33

C)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 33

D)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 33

Options:
Questions 34

A developer wants to test Spark Connect with an existing Spark application.

What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)

Options:
A.

Execute their pyspark shell with the option --remote "https://localhost "

B.

Execute their pyspark shell with the option --remote "sc://localhost"

C.

Set the environment variable SPARK_REMOTE="sc://localhost" before starting the pyspark shell

D.

Add .remote("sc://localhost") to their SparkSession.builder calls in their Spark code

E.

Ensure the Spark property spark.connect.grpc.binding.port is set to 15002 in the application code

Questions 35

A data engineer needs to write a DataFrame df to a Parquet file, partitioned by the column country, and overwrite any existing data at the destination path.

Which code should the data engineer use to accomplish this task in Apache Spark?

Options:
A.

df.write.mode("overwrite").partitionBy("country").parquet("/data/output")

B.

df.write.mode("append").partitionBy("country").parquet("/data/output")

C.

df.write.mode("overwrite").parquet("/data/output")

D.

df.write.partitionBy("country").parquet("/data/output")

Questions 36

Which feature of Spark Connect is considered when designing an application to enable remote interaction with the Spark cluster?

Options:
A.

It provides a way to run Spark applications remotely in any programming language

B.

It can be used to interact with any remote cluster using the REST API

C.

It allows for remote execution of Spark jobs

D.

It is primarily used for data ingestion into Spark from external sources

Questions 37

A Data Analyst is working on the DataFrame sensor_df, which contains two columns:

Which code fragment returns a DataFrame that splits the record column into separate columns and has one array item per row?

A)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 37

B)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 37

C)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 37

D)

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Question 37

Options:
A.

exploded_df = sensor_df.withColumn("record_exploded", explode("record"))

exploded_df = exploded_df.select("record_datetime", "sensor_id", "status", "health")

B.

exploded_df = exploded_df.select(

"record_datetime",

"record_exploded.sensor_id",

"record_exploded.status",

"record_exploded.health"

)

exploded_df = sensor_df.withColumn("record_exploded", explode("record"))

C.

exploded_df = exploded_df.select(

"record_datetime",

"record_exploded.sensor_id",

"record_exploded.status",

"record_exploded.health"

)

exploded_df = sensor_df.withColumn("record_exploded", explode("record"))

D.

exploded_df = exploded_df.select("record_datetime", "record_exploded")

Questions 38

21 of 55.

What is the behavior of the function date_sub(start, days) if a negative value is passed into the days parameter?

Options:
A.

The number of days specified will be added to the start date.

B.

An error message of an invalid parameter will be returned.

C.

The same start date will be returned.

D.

The number of days specified will be removed from the start date.

Questions 39

What is the relationship between jobs, stages, and tasks during execution in Apache Spark?

Options:

Options:
A.

A job contains multiple stages, and each stage contains multiple tasks.

B.

A job contains multiple tasks, and each task contains multiple stages.

C.

A stage contains multiple jobs, and each job contains multiple tasks.

D.

A stage contains multiple tasks, and each task contains multiple jobs.

Questions 40

An engineer wants to join two DataFrames df1 and df2 on the respective employee_id and emp_id columns:

df1: employee_id INT, name STRING

df2: emp_id INT, department STRING

The engineer uses:

result = df1.join(df2, df1.employee_id == df2.emp_id, how='inner')

What is the behaviour of the code snippet?

Options:
A.

The code fails to execute because the column names employee_id and emp_id do not match automatically

B.

The code fails to execute because it must use on='employee_id' to specify the join column explicitly

C.

The code fails to execute because PySpark does not support joining DataFrames with a different structure

D.

The code works as expected because the join condition explicitly matches employee_id from df1 with emp_id from df2