An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:
Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.
If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?
- Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.
- Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.
- Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.
- Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will fail.
- Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.
Reveal Solution Next Question