Big Halloween Sale 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: sale65best

Free Databricks Databricks-Certified-Professional-Data-Engineer Practice Exam with Questions & Answers

Questions 1

A junior data engineer is working to implement logic for a Lakehouse table named silver_device_recordings. The source data contains 100 unique fields in a highly nested JSON structure.

The silver_device_recordings table will be used downstream for highly selective joins on a number of fields, and will also be leveraged by the machine learning team to filter on a handful of relevant fields, in total, 15 fields have been identified that will often be used for filter and join logic.

The data engineer is trying to determine the best approach for dealing with these nested fields before declaring the table schema.

Which of the following accurately presents information about Delta Lake and Databricks that may Impact their decision-making process?

Options:
A.

Because Delta Lake uses Parquet for data storage, Dremel encoding information for nesting can be directly referenced by the Delta transaction log.

B.

Tungsten encoding used by Databricks is optimized for storing string data: newly-added native support for querying JSON strings means that string types are always most efficient.

C.

Schema inference and evolution on Databricks ensure that inferred types will always accurately match the data types used by downstream systems.

D.

By default Delta Lake collects statistics on the first 32 columns in a table; these statistics are leveraged for data skipping when executing selective queries.

Databricks Databricks-Certified-Professional-Data-Engineer Premium Access
Questions 2

An external object storage container has been mounted to the location /mnt/finance_eda_bucket.

The following logic was executed to create a database for the finance team:

After the database was successfully created and permissions configured, a member of the finance team runs the following code:

If all users on the finance team are members of the finance group, which statement describes how the tx_sales table will be created?

Options:
A.

A logical table will persist the query plan to the Hive Metastore in the Databricks control plane.

B.

An external table will be created in the storage container mounted to /mnt/finance eda bucket.

C.

A logical table will persist the physical plan to the Hive Metastore in the Databricks control plane.

D.

An managed table will be created in the storage container mounted to /mnt/finance eda bucket.

E.

A managed table will be created in the DBFS root storage container.

Questions 3

A Delta table of weather records is partitioned by date and has the below schema:

date DATE, device_id INT, temp FLOAT, latitude FLOAT, longitude FLOAT

To find all the records from within the Arctic Circle, you execute a query with the below filter:

latitude > 66.3

Which statement describes how the Delta engine identifies which files to load?

Options:
A.

All records are cached to an operational database and then the filter is applied

B.

The Parquet file footers are scanned for min and max statistics for the latitude column

C.

All records are cached to attached storage and then the filter is applied

D.

The Delta log is scanned for min and max statistics for the latitude column

E.

The Hive metastore is scanned for min and max statistics for the latitude column

Questions 4

A table named user_ltv is being used to create a view that will be used by data analysts on various teams. Users in the workspace are configured into groups, which are used for setting up data access using ACLs.

The user_ltv table has the following schema:

email STRING, age INT, ltv INT

The following view definition is executed:

An analyst who is not a member of the marketing group executes the following query:

SELECT * FROM email_ltv

Which statement describes the results returned by this query?

Options:
A.

Three columns will be returned, but one column will be named "redacted" and contain only null values.

B.

Only the email and itv columns will be returned; the email column will contain all null values.

C.

The email and ltv columns will be returned with the values in user itv.

D.

The email, age. and ltv columns will be returned with the values in user ltv.

E.

Only the email and ltv columns will be returned; the email column will contain the string "REDACTED" in each row.

Questions 5

When evaluating the Ganglia Metrics for a given cluster with 3 executor nodes, which indicator would signal proper utilization of the VM's resources?

Options:
A.

The five Minute Load Average remains consistent/flat

B.

Bytes Received never exceeds 80 million bytes per second

C.

Network I/O never spikes

D.

Total Disk Space remains constant

E.

CPU Utilization is around 75%

Questions 6

The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?

Options:
A.

A batch job will update the enriched_itemized_orders_by_account table, replacing only those rows that have different values than the current version of the table, using accountID as the primary key.

B.

The enriched_itemized_orders_by_account table will be overwritten using the current valid version of data in each of the three tables referenced in the join logic.

C.

An incremental job will leverage information in the state store to identify unjoined rows in the source tables and write these rows to the enriched_iteinized_orders_by_account table.

D.

An incremental job will detect if new rows have been written to any of the source tables; if new rows are detected, all results will be recalculated and used to overwrite the enriched_itemized_orders_by_account table.

E.

No computation will occur until enriched_itemized_orders_by_account is queried; upon query materialization, results will be calculated using the current valid version of data in each of the three tables referenced in the join logic.

Questions 7

A data engineer is performing a join operating to combine values from a static userlookup table with a streaming DataFrame streamingDF.

Which code block attempts to perform an invalid stream-static join?

Options:
A.

userLookup.join(streamingDF, ["userid"], how="inner")

B.

streamingDF.join(userLookup, ["user_id"], how="outer")

C.

streamingDF.join(userLookup, ["user_id”], how="left")

D.

streamingDF.join(userLookup, ["userid"], how="inner")

E.

userLookup.join(streamingDF, ["user_id"], how="right")

Questions 8

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.

If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

Options:
A.

Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.

B.

Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.

C.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.

D.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will tail.

E.

Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.

Questions 9

The Databricks workspace administrator has configured interactive clusters for each of the data engineering groups. To control costs, clusters are set to terminate after 30 minutes of inactivity. Each user should be able to execute workloads against their assigned clusters at any time of the day.

Assuming users have been added to a workspace but not granted any permissions, which of the following describes the minimal permissions a user would need to start and attach to an already configured cluster.

Options:
A.

"Can Manage" privileges on the required cluster

B.

Workspace Admin privileges, cluster creation allowed. "Can Attach To" privileges on the required cluster

C.

Cluster creation allowed. "Can Attach To" privileges on the required cluster

D.

"Can Restart" privileges on the required cluster

E.

Cluster creation allowed. "Can Restart" privileges on the required cluster

Questions 10

Which statement describes Delta Lake Auto Compaction?

Options:
A.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 GB.

B.

Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.

C.

Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.

D.

Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.

E.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 MB.