Databricks-Certified-Data-Engineer-Professional試験無料問題集（127題）「Databricks Certified Data Engineer Professional 認定」

出題：1

A user wants to use DLT expectations to validate that a derived table report contains all records from the source, included in the table validation_copy.
The user attempts and fails to accomplish this by adding an expectation to the report table definition.
Which approach would allow using DLT expectations to validate all expected records are present in this table?

A. Define a temporary table that perform a left outer join on validation_copy and report, and define an expectation that no report key values are null

B. Define a view that performs a left outer join on validation_copy and report, and reference this view in DLT expectations for the report table

C. Define a function that performs a left outer join on validation_copy and report and report, and check against the result in a DLT expectation for the report table

D. Define a SQL UDF that performs a left outer join on two tables, and check if this returns null values for report key values in a DLT expectation for the report table.

正解：B 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：2

All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:
key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG There are 5 unique topics being ingested. Only the "registration" topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion.
However, for non-PII information, it would like to retain these records indefinitely.
Which of the following solutions meets the requirements?

A. All data should be deleted biweekly; Delta Lake's time travel functionality should be leveraged to maintain a history of non-PII information.

B. Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries.

C. Separate object storage containers should be specified based on the partition field, allowing isolation at the storage level.

D. Because the value field is stored as binary data, this information is not considered PII and no special precautions should be taken.

E. Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory.

正解：B 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：3

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late-arriving data.
Streaming DataFrame df has the following schema:
"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

A. slidingWindow("event_time", "10 minutes")

B. awaitArrival("event_time", "10 minutes")

C. delayWrite("event_time", "10 minutes")

D. withWatermark("event_time", "10 minutes")

E. await("event_time + `10 minutes'")

正解：D 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：4

Which REST API call can be used to review the notebooks configured to run as tasks in a multi- task job?

A. /jobs/list

B. /jobs/runs/list

C. /jobs/get

D. /jobs/runs/get

E. /jobs/runs/get-output

正解：C 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：5

Which statement describes Delta Lake optimized writes?

A. Optimized writes logical partitions instead of directory partitions partition boundaries are only Get Latest & Actual Certified-Data-Engineer-Professional Exam's Question and Answers from represented in metadata fewer small files are written.

B. A shuffle occurs prior to writing to try to group data together resulting in fewer files instead of each executor writing multiple files based on directory partitions.

C. Before a job cluster terminates, OPTIMIZE is executed on all tables modified during the most recent job.

D. An asynchronous job runs after the write completes to detect if files could be further compacted; yes, an OPTIMIZE job is executed toward a default of 1 GB.

正解：B 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：6

A data engineer is performing a join operating to combine values from a static userlookup table with a streaming DataFrame streamingDF.
Which code block attempts to perform an invalid stream-static join?

A. streamingDF.join(userLookup, ["userid"], how="inner")

B. userLookup.join(streamingDF, ["user_id"], how="right")

C. userLookup.join(streamingDF, ["userid"], how="inner")

D. streamingDF.join(userLookup, ["user_id"], how="outer")

E. streamingDF.join(userLookup, ["user_id"], how="left")

正解：D 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：7

A Delta Lake table in the Lakehouse named customer_parsams is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.
Immediately after each update succeeds, the data engineer team would like to determine the difference between the new version and the previous of the table. Given the current implementation, which method can be used?
Get Latest & Actual Certified-Data-Engineer-Professional Exam's Question and Answers from

A. Execute a query to calculate the difference between the new version and the previous version using Delta Lake's built-in versioning and time travel functionality.

B. Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.

C. Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.

D. Parse the Delta Lake transaction log to identify all newly written data files.

正解：A 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：8

The data governance team is reviewing code used for deleting records for compliance with GDPR. They note the following logic is used to delete records from the Delta Lake table named users.

Assuming that user_id is a unique identifying key and that delete_requests contains all users that have requested deletion, which statement describes whether successfully executing the above logic guarantees that the records to be deleted are no longer accessible and why?

A. No; files containing deleted records may still be accessible with time travel until a vacuum command is used to remove invalidated data files.

B. No; the Delta cache may return records from previous versions of the table until the cluster is restarted.

C. Yes; the Delta cache immediately updates to reflect the latest data files recorded to disk.

D. Yes; Delta Lake ACID guarantees provide assurance that the delete command succeeded fully and permanently purged these records.

E. No; the Delta Lake delete command only provides ACID guarantees when combined with the merge into command.

正解：A 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：9

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:
item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING
The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.
A junior data engineer suggests converting this data to Delta Lake will improve query performance.
Which response to the junior data engineer s suggestion is correct?

A. ZORDER ON review will need to be run to see performance gains.

B. The Delta log creates a term matrix for free text fields to support selective filtering.

C. Delta Lake statistics are not optimized for free text fields with high cardinality.

D. Delta Lake statistics are only collected on the first 4 columns in a table.

E. Text data cannot be stored with Delta Lake.

正解：C 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

出題：10

A new data engineer notices that a critical field was omitted from an application that writes its Kafka source to Delta Lake. This happened even though the critical field was in the Kafka source.
That field was further missing from data written to dependent, long-term storage. The retention threshold on the Kafka service is seven days. The pipeline has been in production for three months.
Which describes how Delta Lake can help to avoid data loss of this nature in the future?

A. Ingestine all raw data and metadata from Kafka to a bronze Delta table creates a permanent, replayable history of the data state.Get Latest & Actual Certified-Data-Engineer-Professional Exam's Question and Answers from

B. Data can never be permanently dropped or deleted from Delta Lake, so data loss is not possible under any circumstance.

C. Delta Lake schema evolution can retroactively calculate the correct value for newly added fields, as long as the data was in the original source.

D. The Delta log and Structured Streaming checkpoints record the full history of the Kafka producer.

E. Delta Lake automatically checks that all fields present in the source data are included in the ingestion layer.

正解：A 解答を投票する

解説: (GoShiken メンバーにのみ表示されます)

Databricks-Certified-Data-Engineer-Professional試験無料問題集「Databricks Certified Data Engineer Professional 認定」