Pre-Summer Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 70track

Free Databricks Databricks-Certified-Professional-Data-Engineer Practice Exam with Questions & Answers

Questions 1

The data architect has mandated that all tables in the Lakehouse should be configured as external (also known as " unmanaged " ) Delta Lake tables.

Which approach will ensure that this requirement is met?

Options:
A.

When a database is being created, make sure that the LOCATION keyword is used.

B.

When configuring an external data warehouse for all table storage, leverage Databricks for all ELT.

C.

When data is saved to a table, make sure that a full file path is specified alongside the Delta format.

D.

When tables are created, make sure that the EXTERNAL keyword is used in the CREATE TABLE statement.

E.

When the workspace is being configured, make sure that external cloud object storage has been mounted.

Databricks Databricks-Certified-Professional-Data-Engineer Premium Access
Questions 2

When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM ' s resources?

Options:
A.

The five Minute Load Average remains consistent/flat

B.

Bytes Received never exceeds 80 million bytes per second

C.

Network I/O never spikes

D.

Total Disk Space remains constant

E.

CPU Utilization is around 75%

Questions 3

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.

Streaming DataFrame df has the following schema:

" device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT "

Code block:

Databricks-Certified-Professional-Data-Engineer Question 3

Choose the response that correctly fills in the blank within the code block to complete this task.

Options:
A.

withWatermark( " event_time " , " 10 minutes " )

B.

awaitArrival( " event_time " , " 10 minutes " )

C.

await( " event_time + ‘10 minutes ' " )

D.

slidingWindow( " event_time " , " 10 minutes " )

E.

delayWrite( " event_time " , " 10 minutes " )

Questions 4

Which REST API call can be used to review the notebooks configured to run as tasks in a multi-task job?

Options:
A.

/jobs/runs/list

B.

/jobs/runs/get-output

C.

/jobs/runs/get

D.

/jobs/get

E.

/jobs/list

Questions 5

Which approach demonstrates a modular and testable way to use DataFrame.transform for ETL code in PySpark?

Options:
A.

class Pipeline:

def transform(self, df):

return df.withColumn( " value_upper " , upper(col( " value " )))

pipeline = Pipeline()

assertDataFrameEqual(pipeline.transform(test_input), expected)

B.

def upper_value(df):

return df.withColumn( " value_upper " , upper(col( " value " )))

def filter_positive(df):

return df.filter(df[ " id " ] > 0)

pipeline_df = df.transform(upper_value).transform(filter_positive)

C.

def upper_transform(df):

return df.withColumn( " value_upper " , upper(col( " value " )))

actual = test_input.transform(upper_transform)

assertDataFrameEqual(actual, expected)

D.

def transform_data(input_df):

# transformation logic here

return output_df

test_input = spark.createDataFrame([(1, " a " )], [ " id " , " value " ])

assertDataFrameEqual(transform_data(test_input), expected)

Questions 6

Which statement describes Delta Lake Auto Compaction?

Options:
A.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 GB.

B.

Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.

C.

Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.

D.

Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.

E.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 MB.

Questions 7

Which Python variable contains a list of directories to be searched when trying to locate required modules?

Options:
A.

importlib.resource path

B.

,sys.path

C.

os-path

D.

pypi.path

E.

pylib.source

Questions 8

Which statement regarding stream-static joins and static Delta tables is correct?

Options:
A.

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of each microbatch.

B.

Each microbatch of a stream-static join will use the most recent version of the static Delta table as of the job ' s initialization.

C.

The checkpoint directory will be used to track state information for the unique keys present in the join.

D.

Stream-static joins cannot use static Delta tables because of consistency issues.

E.

The checkpoint directory will be used to track updates to the static Delta table.

Questions 9

A data governance team at a large enterprise is improving data discoverability across its organization. The team has hundreds of tables in their Databricks Lakehouse with thousands of columns that lack proper documentation. Many of these tables were created by different teams over several years, with missing context about column meanings and business logic. The data governance team needs to quickly generate comprehensive column descriptions for all existing tables to meet compliance requirements and improve data literacy across the organization. They want to leverage modern capabilities to automatically generate meaningful descriptions rather than manually documenting each column, which would take months to complete.

Which approach should the team use in Databricks to automatically generate column comments and descriptions for existing tables?

Options:
A.

Navigate to the table in Databricks Catalog Explorer, select the table schema view, and use the AI Generate option which leverages artificial intelligence to automatically create meaningful column descriptions based on column names, data types, sample values, and data patterns.

B.

Use Delta Lake’s DESCRIBE HISTORY command to analyze table evolution and infer column purposes from historical changes.

C.

Use the DESCRIBE TABLE command to extract existing schema information and manually write descriptions based on column names and data types.

D.

Write custom PySpark code using df.describe() and df.schema to programmatically generate basic statistical descriptions for each column.

Questions 10

A CHECK constraint has been successfully added to the Delta table named activity_details using the following logic:

Databricks-Certified-Professional-Data-Engineer Question 10

A batch job is attempting to insert new records to the table, including a record where latitude = 45.50 and longitude = 212.67.

Which statement describes the outcome of this batch insert?

Options:
A.

The write will fail when the violating record is reached; any records previously processed will be recorded to the target table.

B.

The write will fail completely because of the constraint violation and no records will be inserted into the target table.

C.

The write will insert all records except those that violate the table constraints; the violating records will be recorded to a quarantine table.

D.

The write will include all records in the target table; any violations will be indicated in the boolean column named valid_coordinates.

E.

The write will insert all records except those that violate the table constraints; the violating records will be reported in a warning log.