apache spark - Optimizing PySpark Job with Large Parquet Data and High Disk Usage

I’m currently working on optimizing a PySpark job that involves a couple of aggregations across large datasets. I’m fairly new to processing large-scale data and am encountering issues with disk usage and job efficiency. Here are the details:

Chosen cluster:

•   Worker Nodes: 6
•   Cores per Worker: 48
•   Memory per Worker: 384 GB

Data:

•   Table A: 158 GB
•   Table B: 300 GB
•   Table C: 32 MB

Process:

1.  Read dfs from delta tables
2.  Perform a broadcast join between Table B and the small Table C.
3.  The resulting DataFrame is then joined with Table A, on three different column id, family, part_id
4.  The final job includes upsert operations into the destination.
5.  The destination table is partitioned by id, family, *date*

Only thing comes to my mind is to update cluster with more disk optimized instances, My question is how can I interpret storage tab and find a way to understand optimizing this job.

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1745611992a4636000.html

apache spark - Optimizing PySpark Job with Large Parquet Data and High Disk Usage - Stack Overflow

发表回复

评论列表（0条）

联系我们

400-800-8888

apache spark - Optimizing PySpark Job with Large Parquet Data and High Disk Usage - Stack Overflow

相关推荐