Databricks-Certified-Professional-Data-Engineer試験無料問題集「Databricks Certified Professional Data Engineer 認定」

When scheduling Structured Streaming jobs for production, which configuration automatically recovers from query failures and keeps costs low?

解説: (GoShiken メンバーにのみ表示されます)
Review the following error traceback:

Which statement describes the error being raised?

解説: (GoShiken メンバーにのみ表示されます)
A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non- overlapping five-minute interval. Incremental state information should be maintained for 10 minutes for late- arriving data.
Streaming DataFrame df has the following schema:
"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:

Choose the response that correctly fills in the blank within the code block to complete this task.

解説: (GoShiken メンバーにのみ表示されます)
Which statement characterizes the general programming model used by Spark Structured Streaming?

解説: (GoShiken メンバーにのみ表示されます)
A user wants to use DLT expectations to validate that a derived table report contains all records from the source, included in the table validation_copy.
The user attempts and fails to accomplish this by adding an expectation to the report table definition.
Which approach would allow using DLT expectations to validate all expected records are present in this table?

解説: (GoShiken メンバーにのみ表示されます)
A junior data engineer is working to implement logic for a Lakehouse table namedsilver_device_recordings.
The source data contains 100 unique fields in a highly nested JSON structure.
Thesilver_device_recordingstable will be used downstream to power several production monitoring dashboards and a production model. At present, 45 of the 100 fields are being used in at least one of these applications.
The data engineer is trying to determine the best approach for dealing with schema declaration given the highly-nested structure of the data and the numerous fields.
Which of the following accurately presents information about Delta Lake and Databricks that may impact their decision-making process?

解説: (GoShiken メンバーにのみ表示されます)
The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a filed run_id.
Which statement describes what the number alongside this field represents?

解説: (GoShiken メンバーにのみ表示されます)
Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.
Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?

The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables.
Which approach will ensure that this requirement is met?

解説: (GoShiken メンバーにのみ表示されます)
The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.
The following logic is used to process these records.
MERGE INTO customers
USING (
SELECT updates.customer_id as merge_ey, updates .*
FROM updates
UNION ALL
SELECT NULL as merge_key, updates .*
FROM updates JOIN customers
ON updates.customer_id = customers.customer_id
WHERE customers.current = true AND updates.address <> customers.address ) staged_updates ON customers.customer_id = mergekey WHEN MATCHED AND customers. current = true AND customers.address <> staged_updates.
address THEN
UPDATE SET current = false, end_date = staged_updates.effective_date
WHEN NOT MATCHED THEN
INSERT (customer_id, address, current, effective_date, end_date)
VALUES (staged_updates.customer_id, staged_updates.address, true, staged_updates.effective_date, null) Which statement describes this implementation?
* The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

解説: (GoShiken メンバーにのみ表示されます)