Which of the following is a viable way to improve Spark's performance when dealing with large amounts of data, given that there is only a single application running on the cluster?
- Increase values for the properties spark.default.parallelism and spark.sql.shuffle.partitions
- Decrease values for the properties spark.default.parallelism and spark.sql.partitions
- Increase values for the properties spark.sql.parallelism and spark.sql.partitions
- Increase values for the properties spark.sql.parallelism and spark.sql.shuffle.partitions
- Increase values for the properties spark.dynamicAllocation.maxExecutors, spark.default.parallelism, and spark.sql.shuffle.partitions
Answer(s): A
Explanation:
Decrease values for the properties spark.default.parallelism and spark.sql.partitions No, these values need to be increased.
Increase values for the properties spark.sql.parallelism and spark.sql.partitions Wrong, there is no property spark.sql.parallelism.
Increase values for the properties spark.sql.parallelism and spark.sql.shuffle.partitions See above.
Increase values for the properties spark.dynamicAllocation.maxExecutors, spark.default.parallelism, and spark.sql.shuffle.partitions
The property spark.dynamicAllocation.maxExecutors is only in effect if dynamic allocation is enabled, using the spark.dynamicAllocation.enabled property. It is disabled by default. Dynamic allocation can be useful when to run multiple applications on the same cluster in parallel. However, in this case there is only a single application running on the cluster, so enabling dynamic allocation would not yield a performance benefit.
More info: Practical Spark Tips For Data Scientists | Experfy.com and Basics of Apache Spark Configuration Settings | by Halil Ertan | Towards Data Science (https://bit.ly/3gA0A6w ,
https://bit.ly/2QxhNTr)
Reveal Solution Next Question