Big Halloween Sale 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sale65best

Free Databricks Databricks-Certified-Professional-Data-Engineer Practice Exam with Questions & Answers | Set: 4

Questions 31

A Delta Lake table representing metadata about content posts from users has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

This table is partitioned by the date column. A query is run with the following filter:

longitude < 20 & longitude > -20

Which statement describes how data will be filtered?

Options:
A.

Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.

B.

No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.

C.

The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.

D.

Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.

E.

The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.

Databricks Databricks-Certified-Professional-Data-Engineer Premium Access
Questions 32

Given the following error traceback (from display(df.select(3*"heartrate"))) which shows AnalysisException: cannot resolve 'heartrateheartrateheartrate', which statement describes the error being raised?

Options:
A.

There is a type error because a DataFrame object cannot be multiplied.

B.

There is a syntax error because the heartrate column is not correctly identified as a column.

C.

There is no column in the table named heartrateheartrateheartrate.

D.

There is a type error because a column object cannot be multiplied.

Questions 33

A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source. That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.

Which describes how Delta Lake can help to avoid data loss of this nature in the future?

Options:
A.

The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer.

B.

Delta Lake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original source.

C.

Delta Lake automatically checks that all fields present in the source data are included in the ingestion layer.

D.

Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible under any circumstance.

E.

Ingestine all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.

Questions 34

A nightly job ingests data into a Delta Lake table using the following code:

The next step in the pipeline requires a function that returns an object that can be used to manipulate new records that have not yet been processed to the next table in the pipeline.

Which code snippet completes this function definition?

def new_records():

Options:
A.

return spark.readStream.table("bronze")

B.

return spark.readStream.load("bronze")

C.

D.

return spark.read.option("readChangeFeed", "true").table ("bronze")

E.

Questions 35

A table in the Lakehouse named customer_churn_params is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

The churn prediction model used by the ML team is fairly stable in production. The team is only interested in making predictions on records that have changed in the past 24 hours.

Which approach would simplify the identification of these changed records?

Options:
A.

Apply the churn model to all rows in the customer_churn_params table, but implement logic to perform an upsert into the predictions table that ignores rows where predictions have not changed.

B.

Convert the batch job to a Structured Streaming job using the complete output mode; configure a Structured Streaming job to read from the customer_churn_params table and incrementally predict against the churn model.

C.

Calculate the difference between the previous model predictions and the current customer_churn_params on a key identifying unique customers before making new predictions; only make predictions on those customers not in the previous predictions.

D.

Modify the overwrite logic to include a field populated by calling spark.sql.functions.current_timestamp() as data are being written; use this field to identify records written on a particular date.

E.

Replace the current overwrite logic with a merge statement to modify only those records that have changed; write logic to make predictions on the changed records identified by the change data feed.

Questions 36

Given the following error traceback:

AnalysisException: cannot resolve 'heartrateheartrateheartrate' given input columns:

[spark_catalog.database.table.device_id, spark_catalog.database.table.heartrate,

spark_catalog.database.table.mrn, spark_catalog.database.table.time]

The code snippet was:

display(df.select(3*"heartrate"))

Which statement describes the error being raised?

Options:
A.

There is a type error because a DataFrame object cannot be multiplied.

B.

There is a syntax error because the heartrate column is not correctly identified as a column.

C.

There is no column in the table named heartrateheartrateheartrate.

D.

There is a type error because a column object cannot be multiplied.

Questions 37

The data engineering team has configured a Databricks SQL query and alert to monitor the values in a Delta Lake table. The recent_sensor_recordings table contains an identifying sensor_id alongside the timestamp and temperature for the most recent 5 minutes of recordings.

The below query is used to create the alert:

The query is set to refresh each minute and always completes in less than 10 seconds. The alert is set to trigger when mean (temperature) > 120. Notifications are triggered to be sent at most every 1 minute.

If this alert raises notifications for 3 consecutive minutes and then stops, which statement must be true?

Options:
A.

The total average temperature across all sensors exceeded 120 on three consecutive executions of the query

B.

The recent_sensor_recordingstable was unresponsive for three consecutive runs of the query

C.

The source query failed to update properly for three consecutive minutes and then restarted

D.

The maximum temperature recording for at least one sensor exceeded 120 on three consecutive executions of the query

E.

The average temperature recordings for at least one sensor exceeded 120 on three consecutive executions of the query

Questions 38

A junior member of the data engineering team is exploring the language interoperability of Databricks notebooks. The intended outcome of the below code is to register a view of all sales that occurred in countries on the continent of Africa that appear in the geo_lookup table.

Before executing the code, running SHOW TABLES on the current database indicates the database contains only two tables: geo_lookup and sales.

Which statement correctly describes the outcome of executing these command cells in order in an interactive notebook?

Options:
A.

Both commands will succeed. Executing show tables will show that countries at and sales at have been registered as views.

B.

Cmd 1 will succeed. Cmd 2 will search all accessible databases for a table or view named countries af: if this entity exists, Cmd 2 will succeed.

C.

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable representing a PySpark DataFrame.

D.

Both commands will fail. No new variables, tables, or views will be created.

E.

Cmd 1 will succeed and Cmd 2 will fail, countries at will be a Python variable containing a list of strings.

Questions 39

A user wants to use DLT expectations to validate that a derived table report contains all records from the source, included in the table validation_copy.

The user attempts and fails to accomplish this by adding an expectation to the report table definition.

Which approach would allow using DLT expectations to validate all expected records are present in this table?

Options:
A.

Define a SQL UDF that performs a left outer join on two tables, and check if this returns null values for report key values in a DLT expectation for the report table.

B.

Define a function that performs a left outer join on validation_copy and report and report, and check against the result in a DLT expectation for the report table

C.

Define a temporary table that perform a left outer join on validation_copy and report, and define an expectation that no report key values are null

D.

Define a view that performs a left outer join on validation_copy and report, and reference this view in DLT expectations for the report table

Questions 40

A data engineer has created a transactions Delta table on Databricks that should be used by the analytics team. The analytics team wants to use the table with another tool that requires Apache Iceberg format.

What should the data engineer do?

Options:
A.

Require the analytics team to use a tool that supports Delta table.

B.

Enable uniform on the transactions table to 'iceberg' so that the table can be read as an Iceberg table.

C.

Create an Iceberg copy of the transactions Delta table which can be used by the analytics team.

D.

Convert the transactions Delta table to Iceberg and enable uniform so that the table can be read as a Delta table.