Pre-Summer Sale Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: 70track

Free Databricks Databricks-Certified-Professional-Data-Engineer Practice Exam with Questions & Answers | Set: 6

Questions 51

A streaming video analytics team ingests billions of events daily into a Unity Catalog-managed Delta table video_events . Analysts run ad-hoc point-lookup queries on columns like user_id, campaign_id, and region. The team manually runs OPTIMIZE video_events ZORDER BY (user_id, campaign_id, region), but still sees poor performance on recent data and dislikes the operational overhead. The team wants a hands-off way to keep hot columns co-located as query patterns evolve.

Options:
A.

Schedule OPTIMIZE/ZORDER to run after each job to improve recent file performance.

B.

Enable Delta caching.

C.

Utilize Liquid Clustering (CLUSTER BY AUTO) and Predictive Optimization.

D.

Enable auto-compaction (optimizeWrite and autoCompact).

Databricks Databricks-Certified-Professional-Data-Engineer Premium Access
Questions 52

A data engineer is implementing Unity Catalog governance for a multi-team environment. Data scientists need interactive clusters for basic data exploration tasks, while automated ETL jobs require dedicated processing.

How should the data engineer configure cluster isolation policies to enforce least privilege and ensure Unity Catalog compliance?

Options:
A.

Use only DEDICATED access mode for both interactive workloads and automated jobs to maximize security isolation.

B.

Allow all users to create any cluster type and rely on manual configuration to enable Unity Catalog access modes.

C.

Configure all clusters with NO ISOLATION_SHARED access mode since Unity Catalog works with any cluster configuration.

D.

Create compute policies with STANDARD access mode for interactive workloads and DEDICATED access mode for automated jobs.

Questions 53

What is a method of installing a Python package scoped at the notebook level to all nodes in the currently active cluster?

Options:
A.

Use and Pip install in a notebook cell

B.

Run source env/bin/activate in a notebook setup script

C.

Install libraries from PyPi using the cluster UI

D.

Use and sh install in a notebook cell

Questions 54

A senior data engineer is planning large-scale data workflows. The task is to identify the considerations that form a foundation for creating scalable data models for managing large datasets. The team has listed Delta Lake capabilities and wants to determine which feature should not be considered as a core factor.

Which key feature can be ignored while evaluating Delta Lake?

Options:
A.

Delta Lake’s ability to process data in both batch and streaming modes seamlessly, providing flexibility in ingestion and processing.

B.

Delta Lake works with various data formats (Parquet, JSON, CSV) and integrates well with Spark and Databricks tools.

C.

Delta Lake optimizes metadata handling, efficiently managing billions of files and facilitating scalability to petabyte-scale datasets.

D.

Delta Lake provides limited support for monitoring and troubleshooting data pipelines, so relevant partner tools have to be identified and set up for enhanced operational efficiency.

Questions 55

A data engineer is using Auto Loader to read incoming JSON data as it arrives. They have configured Auto Loader to quarantine invalid JSON records but notice that over time, some records are being quarantined even though they are well-formed JSON .

The code snippet is:

df = (spark.readStream

.format( " cloudFiles " )

.option( " cloudFiles.format " , " json " )

.option( " badRecordsPath " , " /tmp/somewhere/badRecordsPath " )

.schema( " a int, b int " )

.load( " /Volumes/catalog/schema/raw_data/ " ))

What is the cause of the missing data?

Options:
A.

At some point, the upstream data provider switched everything to multi-line JSON.

B.

The badRecordsPath location is accumulating many small files.

C.

The source data is valid JSON but does not conform to the defined schema in some way.

D.

The engineer forgot to set the option " cloudFiles.quarantineMode " = " rescue " .

Questions 56

A workspace admin has created a new catalog called finance_data and wants to delegate permission management to a finance team lead without giving them full admin rights.

Which privilege should be granted to the finance team lead?

Options:
A.

ALL PRIVILEGES on the finance_data catalog.

B.

Make the finance team lead a metastore admin.

C.

GRANT OPTION privilege on the finance_data catalog.

D.

MANAGE privilege on the finance_data catalog.

Questions 57

A data engineer is building a Lakeflow Declarative Pipelines pipeline to process healthcare claims data. A metadata JSON file defines data quality rules for multiple tables, including:

{

" claims " : [

{ " name " : " valid_patient_id " , " constraint " : " patient_id IS NOT NULL " },

{ " name " : " non_negative_amount " , " constraint " : " claim_amount > = 0 " }

]

}

The pipeline must dynamically apply these rules to the claims table without hardcoding the rules.

How should the data engineer achieve this?

Options:
A.

Load the JSON metadata, loop through its entries, and apply expectations using dlt.expect_all.

B.

Invoke an external API to validate records against the metadata rules.

C.

Reference each expectation with @dlt.expect decorators in the table declaration.

D.

Use a SQL CONSTRAINT block referencing the JSON file path.

Questions 58

A junior developer complains that the code in their notebook isn ' t producing the correct results in the development environment. A shared screenshot reveals that while they ' re using a notebook versioned with Databricks Repos, they ' re using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.

Which approach will allow this developer to review the current logic for this notebook?

Options:
A.

Use Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9

B.

Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.

C.

Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch

D.

Merge all changes back to the main branch in the remote Git repository and clone the repo again

E.

Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository

Questions 59

Incorporating unit tests into a PySpark application requires upfront attention to the design of your jobs, or a potentially significant refactoring of existing code.

Which statement describes a main benefit that offset this additional effort?

Options:
A.

Improves the quality of your data

B.

Validates a complete use case of your application

C.

Troubleshooting is easier since all steps are isolated and tested individually

D.

Yields faster deployment and execution times

E.

Ensures that all steps interact correctly to achieve the desired end result

Questions 60

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:

(spark.readStream

.format( " parquet " )

.load( " /mnt/raw_orders/ " )

.withWatermark( " time " , " 2 hours " )

.dropDuplicates([ " customer_id " , " order_id " ])

.writeStream

.trigger(once=True)

.table( " orders " )

)

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system. If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

Options:
A.

The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.

B.

The orders table will contain only the most recent 2 hours of records and no duplicates will be present.

C.

All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.

D.

Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.