I'm trying to run a PySpark job on Google Cloud Dataproc that reads data from BigQuery, processes it, and writes it back. However, the job keeps failing with the following error:
java.util.ServiceConfigurationError: .apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated
Caused by: java.lang.IllegalStateException: This connector was made for Scala null, it was not meant to run on Scala 2.12
What I Have Tried So Far:
1️⃣ Verified Dataproc Version and Spark Version
I checked the Dataproc image version and Scala/Spark version using:
gcloud dataproc clusters describe cluster-be90 --region us-central1 | grep imageVersion
It returned:
imageVersion: 2.2.43-debian12
And the cluster uses:
- Spark 3.5.1 / Scala 2.12.18
2️⃣ Used Correct BigQuery Connector JAR
Initially, I used:
gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12.jar
This JAR was missing, so I tried:
gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.41.1.jar
Still getting the same error.
3️⃣ Manually Uploaded the JAR to Cloud Storage
I tried downloading and uploading the JAR manually:
wget .12-0.41.1.jar
gsutil cp spark-bigquery-with-dependencies_2.12-0.41.1.jar gs://my-bucket/libs/
Then ran the job using:
gcloud dataproc jobs submit pyspark gs://my-bucket/dataproc_bigquery.py \
--cluster=cluster-be90 \
--region=us-central1 \
--jars=gs://my-bucket/libs/spark-bigquery-with-dependencies_2.12-0.41.1.jar
Still no luck.
4️⃣ Verified IAM Permissions for Dataproc Service Account
I granted sufficient roles to my service account,
Yet, the error persists.
What Else Can I Try?
I've tried everything I could think of. Could it be an issue with Scala 2.12 or Dataproc image compatibility? Or is there a different way to integrate BigQuery with Dataproc?
Any help would be greatly appreciated!
My log :
25/01/29 16:27:46 INFO SparkEnv: Registering MapOutputTracker
25/01/29 16:27:46 INFO SparkEnv: Registering BlockManagerMaster
25/01/29 16:27:46 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/01/29 16:27:46 INFO SparkEnv: Registering OutputCommitCoordinator
25/01/29 16:27:46 INFO MetricsConfig: Loaded properties from hadoop-metrics2.properties
25/01/29 16:27:46 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
25/01/29 16:27:46 INFO MetricsSystemImpl: google-hadoop-file-system metrics system started
25/01/29 16:27:47 INFO DataprocSparkPlugin: Registered 188 driver metrics
25/01/29 16:27:47 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at my-cluster.local./10.128.0.2:8032
25/01/29 16:27:47 INFO AHSProxy: Connecting to Application History server at my-cluster.local./10.128.0.2:10200
25/01/29 16:27:48 INFO Configuration: resource-types.xml not found
25/01/29 16:27:48 INFO ResourceUtils: Unable to find 'resource-types.xml'.
25/01/29 16:27:49 INFO YarnClientImpl: Submitted application application_1234567890123_0009
25/01/29 16:27:50 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at my-cluster.local./10.128.0.2:8030
25/01/29 16:27:51 INFO GhfsGlobalStorageStatistics: periodic connector metrics: {gcs_api_client_non_found_response_count=1, gcs_api_client_side_error_count=1, gcs_api_time=316, gcs_api_total_request_count=2, gcs_connector_time=398, gcs_list_file_request=1, gcs_list_file_request_duration=158, gcs_list_file_request_max=158, gcs_list_file_request_mean=158, gcs_list_file_request_min=158, gcs_metadata_request=1, gcs_metadata_request_duration=158, gcs_metadata_request_max=158, gcs_metadata_request_mean=158, gcs_metadata_request_min=158, gs_filesystem_create=3, gs_filesystem_initialize=2, op_get_file_status=1, op_get_file_status_duration=398, op_get_file_status_max=398, op_get_file_status_mean=398, op_get_file_status_min=398, uptimeSeconds=6} [CONTEXT ratelimit_period="5 MINUTES" ]
25/01/29 16:27:51 INFO GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
25/01/29 16:27:52 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://my-temp-bucket/dataproc-job-history/application_1234567890123_0009.inprogress [CONTEXT ratelimit_period="1 MINUTES" ]
Traceback (most recent call last):
File "/tmp/1b7be59758994608b4125ab846fb6826/submitting_job.py", line 14, in <module>
.load()
^^^^^^
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 314, in load
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.util.ServiceConfigurationError: .apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:582)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:804)
at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:722)
...
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.newInstance(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:780)
... 31 more
25/01/29 16:27:54 INFO DataprocSparkPlugin: Shutting down driver plugin.
I'm trying to run a PySpark job on Google Cloud Dataproc that reads data from BigQuery, processes it, and writes it back. However, the job keeps failing with the following error:
java.util.ServiceConfigurationError: .apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated
Caused by: java.lang.IllegalStateException: This connector was made for Scala null, it was not meant to run on Scala 2.12
What I Have Tried So Far:
1️⃣ Verified Dataproc Version and Spark Version
I checked the Dataproc image version and Scala/Spark version using:
gcloud dataproc clusters describe cluster-be90 --region us-central1 | grep imageVersion
It returned:
imageVersion: 2.2.43-debian12
And the cluster uses:
- Spark 3.5.1 / Scala 2.12.18
2️⃣ Used Correct BigQuery Connector JAR
Initially, I used:
gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12.jar
This JAR was missing, so I tried:
gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.41.1.jar
Still getting the same error.
3️⃣ Manually Uploaded the JAR to Cloud Storage
I tried downloading and uploading the JAR manually:
wget https://storage.googleapis/spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.41.1.jar
gsutil cp spark-bigquery-with-dependencies_2.12-0.41.1.jar gs://my-bucket/libs/
Then ran the job using:
gcloud dataproc jobs submit pyspark gs://my-bucket/dataproc_bigquery.py \
--cluster=cluster-be90 \
--region=us-central1 \
--jars=gs://my-bucket/libs/spark-bigquery-with-dependencies_2.12-0.41.1.jar
Still no luck.
4️⃣ Verified IAM Permissions for Dataproc Service Account
I granted sufficient roles to my service account,
Yet, the error persists.
What Else Can I Try?
I've tried everything I could think of. Could it be an issue with Scala 2.12 or Dataproc image compatibility? Or is there a different way to integrate BigQuery with Dataproc?
Any help would be greatly appreciated!
My log :
25/01/29 16:27:46 INFO SparkEnv: Registering MapOutputTracker
25/01/29 16:27:46 INFO SparkEnv: Registering BlockManagerMaster
25/01/29 16:27:46 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
25/01/29 16:27:46 INFO SparkEnv: Registering OutputCommitCoordinator
25/01/29 16:27:46 INFO MetricsConfig: Loaded properties from hadoop-metrics2.properties
25/01/29 16:27:46 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
25/01/29 16:27:46 INFO MetricsSystemImpl: google-hadoop-file-system metrics system started
25/01/29 16:27:47 INFO DataprocSparkPlugin: Registered 188 driver metrics
25/01/29 16:27:47 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at my-cluster.local./10.128.0.2:8032
25/01/29 16:27:47 INFO AHSProxy: Connecting to Application History server at my-cluster.local./10.128.0.2:10200
25/01/29 16:27:48 INFO Configuration: resource-types.xml not found
25/01/29 16:27:48 INFO ResourceUtils: Unable to find 'resource-types.xml'.
25/01/29 16:27:49 INFO YarnClientImpl: Submitted application application_1234567890123_0009
25/01/29 16:27:50 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at my-cluster.local./10.128.0.2:8030
25/01/29 16:27:51 INFO GhfsGlobalStorageStatistics: periodic connector metrics: {gcs_api_client_non_found_response_count=1, gcs_api_client_side_error_count=1, gcs_api_time=316, gcs_api_total_request_count=2, gcs_connector_time=398, gcs_list_file_request=1, gcs_list_file_request_duration=158, gcs_list_file_request_max=158, gcs_list_file_request_mean=158, gcs_list_file_request_min=158, gcs_metadata_request=1, gcs_metadata_request_duration=158, gcs_metadata_request_max=158, gcs_metadata_request_mean=158, gcs_metadata_request_min=158, gs_filesystem_create=3, gs_filesystem_initialize=2, op_get_file_status=1, op_get_file_status_duration=398, op_get_file_status_max=398, op_get_file_status_mean=398, op_get_file_status_min=398, uptimeSeconds=6} [CONTEXT ratelimit_period="5 MINUTES" ]
25/01/29 16:27:51 INFO GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
25/01/29 16:27:52 INFO GoogleHadoopOutputStream: hflush(): No-op due to rate limit (RateLimiter[stableRate=0.2qps]): readers will *not* yet see flushed data for gs://my-temp-bucket/dataproc-job-history/application_1234567890123_0009.inprogress [CONTEXT ratelimit_period="1 MINUTES" ]
Traceback (most recent call last):
File "/tmp/1b7be59758994608b4125ab846fb6826/submitting_job.py", line 14, in <module>
.load()
^^^^^^
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 314, in load
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o72.load.
: java.util.ServiceConfigurationError: .apache.spark.sql.sources.DataSourceRegister: Provider com.google.cloud.spark.bigquery.BigQueryRelationProvider could not be instantiated
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:582)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:804)
at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:722)
...
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.newInstance(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:780)
... 31 more
25/01/29 16:27:54 INFO DataprocSparkPlugin: Shutting down driver plugin.
Share
Improve this question
edited Jan 29 at 17:06
Doug Stevenson
319k36 gold badges456 silver badges473 bronze badges
Recognized by Google Cloud Collective
asked Jan 29 at 16:46
Shima KShima K
1562 gold badges3 silver badges16 bronze badges
1 Answer
Reset to default 0Quick Fix: Got the answer from member of tge google proc anization:
Submit job (no --jars
)
gcloud dataproc jobs submit pyspark gs://your-bucket/script.py \
--cluster=your-cluster --region=your-region
发布者:admin,转转请注明出处:http://www.yc00.com/questions/1745289456a4620764.html
评论列表(0条)