2023-10-10

Dataproc serverless does not seem to make use of spark property to connect to external hive metastore

I have a GCP postgres instance that serves as an external hive metastore for a Dataproc cluster. I would like to be able to utilize this metastore for Dataproc serverless jobs. Experimenting with serverless and by following documentation, I am already able to:

  • leverage the service account, subnetwork URI to access project resources
  • connect to PHS associated with the Dataproc cluster
  • build and push a custom image to container registry to be pulled by spark jobs

I thought the spark property "spark.hadoop.hive.metastore.uris" would allow serverless spark jobs to connect to the thrift server used by the Dataproc cluster, but it does not seem to even try to make the connection and instead errors with:

Required table missing : "DBS" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables"

The non-serverless Dataproc spark jobs log:

INFO hive.metastore: Trying to connect to metastore with URI thrift://cluster-master-node:9083

as it successfully makes connection.



No comments:

Post a Comment