A Spark job is taking longer than expected. Using the Spark UI, a data engineer notes that the Min, Median, and Max Durations for tasks in a particular stage show the minimum and median time to complete a task as roughly the same, but the max duration for a task to be roughly 100 times as long as the minimum.
Which situation is causing increased duration of the overall job?
- Task queueing resulting from improper thread pool assignment.
- Spill resulting from attached volume storage being too small.
- Network latency due to some cluster nodes being in different regions from the source data
- Skew caused by more data being assigned to a subset of spark-partitions.
- Credential validation errors while pulling data from an external system.
Reveal Solution Next Question