python - Apache Spark: "java.net.SocketException: Connection reset" Error on Windows - Stack Overflow

I'm trying to set up Apache Spark on Windows 10 but keep getting errors when running spark from VS

I'm trying to set up Apache Spark on Windows 10 but keep getting errors when running spark from VSCode:

# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('PySpark Sample DataFrame').getOrCreate()

# Define Schema
col_schema = ["Language", "Version"]

# Prepare Data
Data = (("Jdk","17.0.12"), ("Python", "3.11.9"), ("Spark", "3.5.1"),   \
    ("Hadoop", "3.3 and later"), ("Winutils", "3.6"),  \
  )

# Create DataFrame
df = spark.createDataFrame(data = Data, schema = col_schema)
df.printSchema()
df.show(5,truncate=False)

. So far I've:

  • Used to Python 3.11.8

  • Downgraded to JDK 17:

Still I get the following error:

PS C:\...> & c:/.../python.exe "c:/.../test.py"
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/16 16:40:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root
 |-- Language: string (nullable = true)
 |-- Version: string (nullable = true) 

24/11/16 16:40:42 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]   
java.SocketException: Connection reset
        at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:328)     
        at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)
        at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808)       
        at java.base/java.Socket$SocketInputStream.read(Socket.java:966)       
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:263)
        at java.base/java.io.DataInputStream.readInt(DataInputStream.java:393)     
        at .apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
        at .apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
        at .apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
        at .apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at .apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at .apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at .apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
        at .apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
        at .apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
        at .apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
        at .apache.spark.rdd.MapPartitionsRDDpute(MapPartitionsRDD.scala:52)
        at .apache.spark.rdd.RDDputeOrReadCheckpoint(RDD.scala:367)
        at .apache.spark.rdd.RDD.iterator(RDD.scala:331)
        at .apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
        at .apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
        at .apache.spark.scheduler.Task.run(Task.scala:141)
        at .apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
        at .apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
        at .apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
        at .apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
        at .apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:842)
24/11/16 16:40:42 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (Laptop executor driver): java.SocketException: Connection reset
        at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:328)
        at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)
        at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808)
        at java.base/java.Socket$SocketInputStream.read(Socket.java:966)
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:263)
        at java.base/java.io.DataInputStream.readInt(DataInputStream.java:393)
        at .apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
        at .apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
        at .apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
        at .apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at .apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at .apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at .apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
        at .apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
        at .apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
        at .apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
        at .apache.spark.rdd.MapPartitionsRDDpute(MapPartitionsRDD.scala:52)
        at .apache.spark.rdd.RDDputeOrReadCheckpoint(RDD.scala:367)
        at .apache.spark.rdd.RDD.iterator(RDD.scala:331)
        at .apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
        at .apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
        at .apache.spark.scheduler.Task.run(Task.scala:141)
        at .apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
        at .apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
        at .apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
        at .apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
        at .apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:842)

24/11/16 16:40:42 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "c:\...\test.py", line 18, in <module>
    df.show(5,truncate=False)
  File "C:\...\Python\Python312\Lib\site-packages\pyspark\sql\dataframe.py", line 947, in show
    print(self._show_string(n, truncate, vertical))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\Python\Python312\Lib\site-packages\pyspark\sql\dataframe.py", line 978, in _show_string
    return self._jdf.showString(n, int_truncate, vertical)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\Python\Python313\Lib\site-packages\py4j\java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
                   ^^^^^^^^^^^^^^^^^
  File "C:\...\Python\Python312\Lib\site-packages\pyspark\errors\exceptions\captured.py", line 179, in deco
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "C:\...\Python\Python312\Lib\site-packages\py4j\protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o42.showString.
: .apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (Laptop executor driver): java.SocketException: Connection reset
        at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:328)
        at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)
        at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808)
        at java.base/java.Socket$SocketInputStream.read(Socket.java:966)
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:263)
        at java.base/java.io.DataInputStream.readInt(DataInputStream.java:393)
        at .apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
        at .apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
        at .apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
        at .apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at .apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at .apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at .apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
        at .apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
        at .apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
        at .apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
        at .apache.spark.rdd.MapPartitionsRDDpute(MapPartitionsRDD.scala:52)
        at .apache.spark.rdd.RDDputeOrReadCheckpoint(RDD.scala:367)
        at .apache.spark.rdd.RDD.iterator(RDD.scala:331)
        at .apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
        at .apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
        at .apache.spark.scheduler.Task.run(Task.scala:141)
        at .apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
        at .apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
        at .apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
        at .apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
        at .apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:842)

Driver stacktrace:
        at .apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
        at .apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
        at .apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at .apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
        at .apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
        at .apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
        at scala.Option.foreach(Option.scala:407)
        at .apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
        at .apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
        at .apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
        at .apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
        at .apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at .apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:989)
        at .apache.spark.SparkContext.runJob(SparkContext.scala:2393)
        at .apache.spark.SparkContext.runJob(SparkContext.scala:2414)
        at .apache.spark.SparkContext.runJob(SparkContext.scala:2433)
        at .apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:530)
        at .apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:483)
        at .apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:61)
        at .apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4333)
        at .apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3316)
        at .apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4323)
        at .apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
        at .apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4321)
        at .apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
        at .apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
        at .apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
        at .apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
        at .apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
        at .apache.spark.sql.Dataset.withAction(Dataset.scala:4321)
        at .apache.spark.sql.Dataset.head(Dataset.scala:3316)
        at .apache.spark.sql.Dataset.take(Dataset.scala:3539)
        at .apache.spark.sql.Dataset.getRows(Dataset.scala:280)
        at .apache.spark.sql.Dataset.showString(Dataset.scala:315)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4jmands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4jmands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.SocketException: Connection reset
        at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:328)
        at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)
        at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808)
        at java.base/java.Socket$SocketInputStream.read(Socket.java:966)
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:263)
        at java.base/java.io.DataInputStream.readInt(DataInputStream.java:393)
        at .apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
        at .apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
        at .apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
        at .apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at .apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at .apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at .apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
        at .apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
        at .apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
        at .apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
        at .apache.spark.rdd.MapPartitionsRDDpute(MapPartitionsRDD.scala:52)
        at .apache.spark.rdd.RDDputeOrReadCheckpoint(RDD.scala:367)
        at .apache.spark.rdd.RDD.iterator(RDD.scala:331)
        at .apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
        at .apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
        at .apache.spark.scheduler.Task.run(Task.scala:141)
        at .apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
        at .apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
        at .apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
        at .apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
        at .apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        ... 1 more

PS C:\...\> SUCCESS: The process with PID 24468 (child process of PID 3840) has been terminated.
SUCCESS: The process with PID 3840 (child process of PID 29208) has been terminated.
SUCCESS: The process with PID 29208 (child process of PID 29332) has been terminated.

Software Versions:

  • Windows: 10.0.19045.5011 x64

  • JDK: 17.0.12

  • Python: 3.11.8

  • Spark: 3.5.3

  • Hadoop (Winutils): 3.3.6 (.3.6/bin)

I'm trying to set up Apache Spark on Windows 10 but keep getting errors when running spark from VSCode:

# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName('PySpark Sample DataFrame').getOrCreate()

# Define Schema
col_schema = ["Language", "Version"]

# Prepare Data
Data = (("Jdk","17.0.12"), ("Python", "3.11.9"), ("Spark", "3.5.1"),   \
    ("Hadoop", "3.3 and later"), ("Winutils", "3.6"),  \
  )

# Create DataFrame
df = spark.createDataFrame(data = Data, schema = col_schema)
df.printSchema()
df.show(5,truncate=False)

. So far I've:

  • Used to Python 3.11.8 https://stackoverflow/a/78157930/5653263

  • Downgraded to JDK 17: https://stackoverflow/a/76432417/5653263

Still I get the following error:

PS C:\...> & c:/.../python.exe "c:/.../test.py"
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/16 16:40:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root
 |-- Language: string (nullable = true)
 |-- Version: string (nullable = true) 

24/11/16 16:40:42 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]   
java.SocketException: Connection reset
        at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:328)     
        at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)
        at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808)       
        at java.base/java.Socket$SocketInputStream.read(Socket.java:966)       
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:263)
        at java.base/java.io.DataInputStream.readInt(DataInputStream.java:393)     
        at .apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
        at .apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
        at .apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
        at .apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at .apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at .apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at .apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
        at .apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
        at .apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
        at .apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
        at .apache.spark.rdd.MapPartitionsRDDpute(MapPartitionsRDD.scala:52)
        at .apache.spark.rdd.RDDputeOrReadCheckpoint(RDD.scala:367)
        at .apache.spark.rdd.RDD.iterator(RDD.scala:331)
        at .apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
        at .apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
        at .apache.spark.scheduler.Task.run(Task.scala:141)
        at .apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
        at .apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
        at .apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
        at .apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
        at .apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:842)
24/11/16 16:40:42 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (Laptop executor driver): java.SocketException: Connection reset
        at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:328)
        at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)
        at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808)
        at java.base/java.Socket$SocketInputStream.read(Socket.java:966)
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:263)
        at java.base/java.io.DataInputStream.readInt(DataInputStream.java:393)
        at .apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
        at .apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
        at .apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
        at .apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at .apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at .apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at .apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
        at .apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
        at .apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
        at .apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
        at .apache.spark.rdd.MapPartitionsRDDpute(MapPartitionsRDD.scala:52)
        at .apache.spark.rdd.RDDputeOrReadCheckpoint(RDD.scala:367)
        at .apache.spark.rdd.RDD.iterator(RDD.scala:331)
        at .apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
        at .apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
        at .apache.spark.scheduler.Task.run(Task.scala:141)
        at .apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
        at .apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
        at .apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
        at .apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
        at .apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:842)

24/11/16 16:40:42 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
  File "c:\...\test.py", line 18, in <module>
    df.show(5,truncate=False)
  File "C:\...\Python\Python312\Lib\site-packages\pyspark\sql\dataframe.py", line 947, in show
    print(self._show_string(n, truncate, vertical))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\Python\Python312\Lib\site-packages\pyspark\sql\dataframe.py", line 978, in _show_string
    return self._jdf.showString(n, int_truncate, vertical)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\Python\Python313\Lib\site-packages\py4j\java_gateway.py", line 1322, in __call__
    return_value = get_return_value(
                   ^^^^^^^^^^^^^^^^^
  File "C:\...\Python\Python312\Lib\site-packages\pyspark\errors\exceptions\captured.py", line 179, in deco
    return f(*a, **kw)
           ^^^^^^^^^^^
  File "C:\...\Python\Python312\Lib\site-packages\py4j\protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o42.showString.
: .apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (Laptop executor driver): java.SocketException: Connection reset
        at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:328)
        at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)
        at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808)
        at java.base/java.Socket$SocketInputStream.read(Socket.java:966)
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:263)
        at java.base/java.io.DataInputStream.readInt(DataInputStream.java:393)
        at .apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
        at .apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
        at .apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
        at .apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at .apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at .apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at .apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
        at .apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
        at .apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
        at .apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
        at .apache.spark.rdd.MapPartitionsRDDpute(MapPartitionsRDD.scala:52)
        at .apache.spark.rdd.RDDputeOrReadCheckpoint(RDD.scala:367)
        at .apache.spark.rdd.RDD.iterator(RDD.scala:331)
        at .apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
        at .apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
        at .apache.spark.scheduler.Task.run(Task.scala:141)
        at .apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
        at .apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
        at .apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
        at .apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
        at .apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:842)

Driver stacktrace:
        at .apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
        at .apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
        at .apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
        at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at .apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
        at .apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
        at .apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
        at scala.Option.foreach(Option.scala:407)
        at .apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
        at .apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
        at .apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
        at .apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
        at .apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at .apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:989)
        at .apache.spark.SparkContext.runJob(SparkContext.scala:2393)
        at .apache.spark.SparkContext.runJob(SparkContext.scala:2414)
        at .apache.spark.SparkContext.runJob(SparkContext.scala:2433)
        at .apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:530)
        at .apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:483)
        at .apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:61)
        at .apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4333)
        at .apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3316)
        at .apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4323)
        at .apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
        at .apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4321)
        at .apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
        at .apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
        at .apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
        at .apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
        at .apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
        at .apache.spark.sql.Dataset.withAction(Dataset.scala:4321)
        at .apache.spark.sql.Dataset.head(Dataset.scala:3316)
        at .apache.spark.sql.Dataset.take(Dataset.scala:3539)
        at .apache.spark.sql.Dataset.getRows(Dataset.scala:280)
        at .apache.spark.sql.Dataset.showString(Dataset.scala:315)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4jmands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4jmands.CallCommand.execute(CallCommand.java:79)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:842)
Caused by: java.SocketException: Connection reset
        at java.base/sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:328)
        at java.base/sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:355)
        at java.base/sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:808)
        at java.base/java.Socket$SocketInputStream.read(Socket.java:966)
        at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:244)
        at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:263)
        at java.base/java.io.DataInputStream.readInt(DataInputStream.java:393)
        at .apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
        at .apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
        at .apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
        at .apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
        at .apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at .apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at .apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
        at .apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)
        at .apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:893)
        at .apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:893)
        at .apache.spark.rdd.MapPartitionsRDDpute(MapPartitionsRDD.scala:52)
        at .apache.spark.rdd.RDDputeOrReadCheckpoint(RDD.scala:367)
        at .apache.spark.rdd.RDD.iterator(RDD.scala:331)
        at .apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)
        at .apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)
        at .apache.spark.scheduler.Task.run(Task.scala:141)
        at .apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)
        at .apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)
        at .apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)
        at .apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)
        at .apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        ... 1 more

PS C:\...\> SUCCESS: The process with PID 24468 (child process of PID 3840) has been terminated.
SUCCESS: The process with PID 3840 (child process of PID 29208) has been terminated.
SUCCESS: The process with PID 29208 (child process of PID 29332) has been terminated.

Software Versions:

  • Windows: 10.0.19045.5011 x64

  • JDK: 17.0.12

  • Python: 3.11.8

  • Spark: 3.5.3

  • Hadoop (Winutils): 3.3.6 (https://github/cdarlint/winutils/tree/master/hadoop-3.3.6/bin)

Share Improve this question edited Nov 19, 2024 at 14:24 Alex Bridges asked Nov 19, 2024 at 14:19 Alex BridgesAlex Bridges 191 silver badge3 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

From your running output C:...\Python\Python312\Lib\site-packages\pyspark\sql\dataframe.py", line 947,

you still using python 3.12 to run the code, it caused the problem.

I'm using conda to create a 3.11.8 environment to test your code, it should works like below:

(base) R:\>conda create -n connection-rest2 python=3.11.8

(base) R:\>conda activate connection-rest2

(connection-rest2) R:\>python -V
Python 3.11.8

(connection-rest2) R:\>pip install pyspark

(connection-rest2) R:\>python test.py
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/30 21:24:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root
 |-- Language: string (nullable = true)
 |-- Version: string (nullable = true)

+--------+-------------+
|Language|Version      |
+--------+-------------+
|Jdk     |17.0.12      |
|Python  |3.11.9       |
|Spark   |3.5.1        |
|Hadoop  |3.3 and later|
|Winutils|3.6          |
+--------+-------------+

By the way, you will get "Missing Python executable 'python3'", you should copy python.exe to python3.exe in the 'connection-rest2' virtual env directory to fix that.

发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745555270a4632781.html

相关推荐

发表回复

评论列表(0条)

  • 暂无评论

联系我们

400-800-8888

在线咨询: QQ交谈

邮件:admin@example.com

工作时间:周一至周五,9:30-18:30,节假日休息

关注微信