I’m currently working on optimizing a PySpark job that involves a couple of aggregations across large datasets. I’m fairly new to processing large-scale data and am encountering issues with disk usage and job efficiency. Here are the details:
Chosen cluster:
• Worker Nodes: 6
• Cores per Worker: 48
• Memory per Worker: 384 GB
Data:
• Table A: 158 GB
• Table B: 300 GB
• Table C: 32 MB
Process:
1. Read dfs from delta tables
2. Perform a broadcast join between Table B and the small Table C.
3. The resulting DataFrame is then joined with Table A, on three different column id, family, part_id
4. The final job includes upsert operations into the destination.
5. The destination table is partitioned by id, family, *date*
Only thing comes to my mind is to update cluster with more disk optimized instances, My question is how can I interpret storage tab and find a way to understand optimizing this job.
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745611992a4636000.html
评论列表(0条)