apache spark - Pyspark java UDF java.lang.OutOfMemoryError: Requested array size exceeds VM limit. SQLSTATE: 39000 - Stack Overf

admin•2025-04-22 03:12:48•questions•阅读1

I am running pyspark code with JAVA UDF in databricks.I have r6id.xlarge (32g) driver and node of r6id

I am running pyspark code with JAVA UDF in databricks. I have r6id.xlarge (32g) driver and node of r6id.4xlarge (128) worker. I am reading only one file and my java UDF is just calling an open source X12 java lib to parse the file as a whole Sample code below, this works for files which are less then 100MB

df = spark.read.format('text').option('wholetext', True).load("s3://xxxx/xxxxxxx")
spark.udf.registerJavaFunction("x12_parser","com.abc", pst.StringType())
df.select(expr(x12_parser(input_columns)))

Whenever I parse big file (just one file), I will get error java.lang.OutOfMemoryError: Requested array size exceeds VM limit. When I parse this file locally, if I increase my heap size to 20g it will work otherwise same error. but as my worker node is way larger than this.(I am in databricks so no need to configure executor memory and set -Xmx is not permitted) I also tried to call my function directly like below

import boto3
s3 = boto3.client('s3')
bucket_name = 'xxxx'
key = 'xxxxxxx'
response = s3.get_object(Bucket=bucket_name, Key=key)
contents = response['Body'].read().decode('utf-8')

parser_class = spark._jvm.abc.x12_parser()
output = parser_class.call(contents)

This will work fine even when my driver is 4 times smaller than worker without touch java heap size. I tried to play with some spark setting like network timeout and spark.executor.extraJavaOptions -Xms20g -XX:+UseCompressedOops but none of them works.

I can't explain why with a huge worker I can't process the same file I can on much smaller driver or my local

df = spark.read.format('text').option('wholetext', True).load("s3://xxxx/xxxxxxx")
spark.udf.registerJavaFunction("x12_parser","com.abc", pst.StringType())
df.select(expr(x12_parser(input_columns)))

import boto3
s3 = boto3.client('s3')
bucket_name = 'xxxx'
key = 'xxxxxxx'
response = s3.get_object(Bucket=bucket_name, Key=key)
contents = response['Body'].read().decode('utf-8')

parser_class = spark._jvm.abc.x12_parser()
output = parser_class.call(contents)

I can't explain why with a huge worker I can't process the same file I can on much smaller driver or my local

Share Improve this question asked Feb 4 at 18:09 milton 1113 silver badges12 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default 0

The problem is that the Java UDF is executed on executors, not the driver. Executors process data in parallel. Despite the big memory of your workers, the heap size per executor may be not enough to handle such a large file. Furthermore, Spark's wholetext option loads a whole file as one row, therefore making things worse in terms of in-memory footprint when the UDF operates on it.

if you are using boto3 which will work on directly from driver it wont allow distributed execution. suggest you to use

braodcast variable for file content and then process it val fileContent = spark.sparkContext.broadcast(content) and then create dataframe with it. and then use that in your udf
set spark executor memroy ,memoryoverhead appropriately with number of core 4-5

it seems like there is some internal UDF memory limitation. I stopped using UDF and change the code to do map from java side

发布者：admin，转转请注明出处：http://www.yc00.com/questions/1745240580a4618158.html

admin

questions
javascript - Node.js cluster - optimal number of workers - Stack Overflow
I have 4 cores and ran this code according to this example :var cluster = require('cluster');
admin
36分钟前
10
questions
javascript - Passing loop variable as argument in element function - Stack Overflow
Forgive the title, wasn't sure what to put.I have some code like:var links=document.getElementsByT
admin
33分钟前
20
questions
jquery - multisite registration: check existing subdomains while typing
I wish to ask how to build a real time control while a user attempt to register his website in a wordpress multisite ins
admin
33分钟前
10
questions
mql5 - VS Code autocomplete with MQL 5 - Stack Overflow
I have installed VS Code for writing MQL Expert Advisors along with MQL Tools extension.I have no prob
admin
32分钟前
10
questions
javascript - button color change by tab key press - Stack Overflow
I have a button and I want it to change to green when I access it by tab key. I have written the code f
admin
31分钟前
00
questions
javascript - How to set a default value for const? - Stack Overflow
I got a const in meteor containing color and text variables:const exampleTemplate = (text, color) =>
admin
31分钟前
10
questions
javascript - How can I change the colour of a textarea's text depending on whether it is the placeholder or not? - Stack
I basically want to recreate the HTML5 "placeholder" attribute, using JavaScript, so that it
admin
30分钟前
10
questions
Customizer section gone after adding second
I added a section to the customizer and everything worked fine. But after I added a second one the first one no longer a
admin
29分钟前
10
questions
javascript - How to customise the css fields of an alert box? - Stack Overflow
My requirement is to customize the alert box with css attributes. Currently I am achieving something si
admin
25分钟前
10
questions
contact form 7 in wordpress how to include javascript src in body - Stack Overflow
I'm using wordpress, and contact form 7 plugin.What's i'm trying to acplish is that som
admin
25分钟前
00
questions
javascript - Anchor tag not working inside List Item - Stack Overflow
<body><section class="main"><form class="search" action="&qu
admin
24分钟前
00
questions
Enqueuing Script in functions.php vs on the page
Does it matter if I enqueue a script on just the page I need it to be used?As opposed to functions.php?Or, what is the
admin
22分钟前
10
questions
javascript - IE11 security certificate issue when using iFrame - Stack Overflow
I am getting the below error when I am trying to open one new window using iFrame and IE11, the new win
admin
16分钟前
00
questions
increment php variable in javascript function - Stack Overflow
Can anyone tell me how to increment a php variable inside a javascript function.I have a javascript fun
admin
11分钟前
00
questions
javascript - jQuery: Follow cursor with delay - Stack Overflow
I want a div to follow the cursor-movement with a short delay like this: As you can see, the 'fol
admin
11分钟前
00
questions
javascript - Java Script: How can i pull the HSL value when a colour is selected from input type = 'color'? - St
I am so blank on this idek where to start. I am trying to make functions that will manipulate the H S L
admin
8分钟前
00
questions
javascript - Tumblr OAuth authorization "Missing or invalid oauth_verifier." message solution for chrome exten
So I had this problem with getting 400 from .I use this Google Chrome OAuth tutorial page and just cop
admin
5分钟前
00
questions
javascript - How to change Stripe's currency display on website? - Stack Overflow
I have done a easy website with stripe checkout,the javascript is pretty simple:<scriptsrc=".j
admin
3分钟前
00
questions
javascript - Set innerHTML in CoffeeScript - Stack Overflow
Is there any alternative to JS 'innerHTML' property in CoffeeScript?In JS, you would end up w
admin
1分钟前
00
questions
javascript - How to assign an object to local storage instated of assigning with every single item? - Stack Overflow
I have an object that I want to assign to the local storage , I mean can you loop over it ? any idea ho
admin
9秒前
00

发表回复

评论列表（0条）

暂无评论

apache spark - Pyspark java UDF java.lang.OutOfMemoryError: Requested array size exceeds VM limit. SQLSTATE: 39000 - Stack Overf

2 Answers 2

发表回复

评论列表（0条）

联系我们

400-800-8888

apache spark - Pyspark java UDF java.lang.OutOfMemoryError: Requested array size exceeds VM limit. SQLSTATE: 39000 - Stack Overf

2 Answers 2

相关推荐

发表回复

评论列表（0条）

联系我们

400-800-8888