Vertex AI Pipelines. Batch Prediction 'Error state: 5.'
I have been trying to run a Vertex AI pipeline using Kubeflow Pipelines and the google-cloud-pipeline-components library. The pipeline is entirely custom container components with the exception of the batch predictions.
The code for my pipeline is of the following form:
# GCP infrastructure resources
from google.cloud import aiplatform, storage
from google_cloud_pipeline_components import aiplatform as gcc_aip
# kubeflow resources
import kfp
from kfp.v2 import dsl, compiler
from kfp.v2.dsl import component, pipeline
train_container_uri = '<insert custom docker image in gcr for training code>'
@pipeline(name="<pipeline name>", pipeline_root=pipeline_root_path)
def my_ml_pipeline():
# run the preprocessing workflow using a custom kfp component and get the outputs
preprocess_op = preprocess_component()
train_path, test_path = preprocess_op.outputs['Train Data'], preprocess_op.outputs['Test Data']
# path to string for gcs uri containing train data
train_path_text = preprocess_op.outputs['Train Data GCS Path']
# create training dataset on Vertex AI from the preprocessing outputs
train_set_op = gcc_aip.TabularDatasetCreateOp(
project='<insert gcs project id>',
display_name='<insert display name>', location='us-west1',
gcs_source = train_path_text
)
train_set = train_set_op.outputs['dataset']
# custom training op
training_op = gcc_aip.CustomContainerTrainingJobRunOp(
project='<insert gcp project id>',
display_name='<insert display name>',
location='us-west1',
dataset=train_set,
container_uri=train_container_uri,
staging_bucket=bucket_name,
model_serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/tf2-gpu.2-11:latest',
model_display_name='<insert model name>',
machine_type='n1-standard-4')
model_output = training_op.outputs['model']
# batch prediction op
batch_prediction_op = gcc_aip.ModelBatchPredictOp(
project='<insert gcp project id>',
job_display_name='<insert name of job>',
location='us-west1',
model=model_output,
gcs_source_uris=['gs://<bucket name>/<directory>/name_of_file.csv'],
instances_format='csv',
gcs_destination_output_uri_prefix='gs://<bucket name>/<directory>/',
machine_type='n1-standard-4',
accelerator_count=2,
accelerator_type='NVIDIA_TESLA_P100')
(For security and non-disclosure reasons, I can't input any specific paths or gcp projects, just trust that I inputted those correctly)
My initial preprocessing and training components seem to work fine (model is uploaded to registry, training job succeeded, preprocessed data appears in GCS buckets as seemingly necessary). However, my pipeline fails to finish when it gets to the batch prediction phase.
The error log terminates with the following error:
ValueError: Job failed with value error in error state: 5.
Additionally, I have a picture of the logs (the traceback only contains references to the google-cloud-pipeline-components library, none of my specific code). This is seen here:
This error is presumably within the scope of the ModelBatchPredictOp() method.
I don't even know where to begin, but could anyone give any pointers as to what error state 5 means? I know it's a ValueError so it must have received either an invalid value in the method or in the model. However, I've ran the model on the exact same dataset locally, so I assume that it is an invalid input into the method. However, I have checked every input into the ModelBatchPredictOp(). Has anyone gotten this error state before? Any help is appreciated.
Using google-cloud-pipeline-components==1.0.42, google-cloud-aiplatform==1.24.1, kfp==1.8.18. My model is trained on TensorFlow 2.11.1, Python 3.10 in both my custom docker images and the script used to run the pipeline. Thank you in advance!
Edit 1 (2023-05-10):
I've looked up on the GitHub repo, it seems that my ValueError has the following description:
// Some requested entity (e.g., file or directory) was not found.
//
// Note to server developers: if a request is denied for an entire class
// of users, such as gradual feature rollout or undocumented allowlist,
// `NOT_FOUND` may be used. If a request is denied for some users within
// a class of users, such as user-based access control, `PERMISSION_DENIED`
// must be used.
//
// HTTP Mapping: 404 Not Found
NOT_FOUND = 5;
(The error code is detailed here https://github.com/googleapis/googleapis/blob/master/google/rpc/code.proto)
(The exception leading to my error message was surfaced here https://github.com/kubeflow/pipelines/blob/master/components/google-cloud/google_cloud_pipeline_components/container/v1/gcp_launcher/job_remote_runner.py)
Now the question is, where in my ModelBatchPredictOp() is a file or directory missing? I've checked to make sure that all of these gcs paths that I've inputted are correct and lead to the expected locations. Any further thoughts?
Edit 2 (2023-05-10):
I noticed that the output of the ModelBatchPredictOp() component throws me a json detailing some errors. This is what the error body is:
{
"error": {
"code": 401,
"message": "Request is missing required authentication credential. Expected OAuth 2 access token, login cookie or other valid authentication credential. See https://developers.google.com/identity/sign-in/web/devconsole-project.",
"status": "UNAUTHENTICATED",
"details": [
{
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
"reason": "CREDENTIALS_MISSING",
"domain": "googleapis.com",
"metadata": {
"service": "aiplatform.googleapis.com",
"method": "google.cloud.aiplatform.v1.JobService.GetBatchPredictionJob"
}
}
]
}
}
However, I have provided the necessary IAM roles to every relevant service agent/account (according to this: https://cloud.google.com/vertex-ai/docs/general/access-control). So still trying to figure out where my pipeline is missing credentials.

Comments
Post a Comment