Kubeflow on GCP
install gcloud, kubectl, docker
Example Project - MNIST DEPLOYMENT
wget https://github.com/kubeflow/examples/archive/master.zip
Set ENV
// available project ids can be listed with the following command:
// gcloud projects list
PROJECT_ID=<YOUR_CHOSEN_PROJECT_ID>
gcloud config set project $PROJECT_ID
ZONE=us-central1-a
DEPLOYMENT_NAME=mnist-deployment
cd ./mnist
WORKING_DIR=$(pwd)
# install kustomize from GitHub
# Enable GKE API
GUI Set Up Kubeflow Cluster
deploy.kubeflow.cloud
Check GCP Deployment
https://console.cloud.google.com/dm
Set up Kubectl
gcloud container clusters get-credentials \
$DEPLOYMENT_NAME --zone $ZONE --project $PROJECT_ID
kubectl config set-context $(kubectl config current-context) --namespace=kubeflow
kubectl get all
Training
- After
model.py
training completes, it will upload model to a path - i.e. create and use Cloud Storage bucket
// bucket name can be anything, but must be unique across all projects
BUCKET_NAME=${DEPLOYMENT_NAME}-${PROJECT_ID}
// create the GCS bucket
gsutil mb gs://${BUCKET_NAME}/
# Test new container image locally
docker run -it $IMAGE_PATH
# Successful log then push to Google Container Registry
//allow docker to access our GCR registry
gcloud auth configure-docker --quiet
//push container to GCR
docker push $IMAGE_PATH
Train on Cluster
# Run training job
cd $WORKING_DIR/training/GCS
# kustomize to config YAML manifests
kustomize edit add configmap mnist-map-training \
--from-literal=name=my-train-1
# some default training params
kustomize edit add configmap mnist-map-training \
--from-literal=trainSteps=200
kustomize edit add configmap mnist-map-training \
--from-literal=batchSize=100
kustomize edit add configmap mnist-map-training \
--from-literal=learningRate=0.01
# Config manifests to use custom bucket and training image
kustomize edit set image training-image=$IMAGE_PATH
kustomize edit add configmap mnist-map-training \
--from-literal=modelDir=gs://${BUCKET_NAME}/my-model
kustomize edit add configmap mnist-map-training \
--from-literal=exportDir=gs://${BUCKET_NAME}/my-model/export
# Beware of training code need permissions to R/W to storage bucket - kubeflow solves it by creating a service account within Project as part of deployment: verify
gcloud --project=$PROJECT_ID iam service-accounts list | grep $DEPLOYMENT_NAME
# This service should be auto-granted to R/W to storage bucket; kubeflow also added Kubernetes Secrets called 'user-gcp-sa' to cluster, containing credentials needed to authenticate as this service account within cluster:
kubectl describe secret user-gcp-sa
# Access storage bucket from inside training conatiner, set credential env to point to json file contained in secret
kustomize edit add configmap mnist-map-training \
--from-literal=secretName=user-gcp-sa
kustomize edit add configmap mnist-map-training \
--from-literal=secretMountPath=/var/secrets
kustomize edit add configmap mnist-map-training \
--from-literal=GOOGLE_APPLICATION_CREDENTIALS=/var/secrets/user-gcp-sa.json
# Kustomize to build new customized YAML files:
kustomize build . | kubectl apply -f -
# pipe to deploy to cluster
# Now a new tf-job on cluster called my-train-1-chief-0
kubectl describe tfjob
# python log
kubectl logs -f my-train-1-chief-0
# once train done, query bucket's data
gsutil ls -r gs://${BUCKET_NAME}/my-model/export
Note: The model is actually saving two outputs:
a set of checkpoints to resume training later if desired
A directory called export, which holds the model in a format that can be read by a TensorFlow Serving component
Serving
cd $WORKING_DIR/serving/GCS
# TF Serving files in manifests, simply point the compoenent to bucket where model data is stored - will spin up a server to handle requests - unlike tf-job, no custom container needed for server process - instead all info server needs stored in the model file
# set name for service
kustomize edit add configmap mnist-map-serving \
--from-literal=name=mnist-service
# point server at trained model in bucket
kustomize edit add configmap mnist-map-serving \
--from-literal=modelBasePath=gs://${BUCKET_NAME}/my-model/export
# deploy
kustomize build . | kubectl apply -f -
# check
kubectl describe service mnist-service
Deploying UI
So far:
- Trained model in bucket
- TF Server hosting it
- Deploy final piece of system: web interface (web-ui)
a simple flask server hosting HTML/CSS/JavaScript files using mnist_client.py having following code through gRPC:
from grpc.beta import implementations
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2 as psp
# create gRPC stub
channel = implementations.insecure_channel(server_host, server_port)
stub = psp.beta_create_PredictionService_stub(channel)
# build request
request = predict_pb2.PredictRequest()
request.model_spec.name = server_name
request.model_spec.signature_name = 'serving_default'
request.inputs['x'].CopyFrom(
tf.contrib.util.make_tensor_proto(image, shape=image.shape))
# retrieve results
result = stub.Predict(request, timeout)
resultVal = result.outputs["classes"].int_val[0]
scores = result.outputs['predictions'].float_val
version = result.outputs["classes"].int_val[0]
# deploy web ui
cd $WORKING_DIR/front
# no customisation required, deploy directly
kustomize build . | kubectl apply -f -
# Service added to ClusterIP, meaning it cannot be accessed from outside the cluster! Need to set up direct connection to the cluster
kubectl port-forward svc/web-ui 8080:80
# Cloud Shell 'Preview on port 8080'
Web interface a simple HTML/JS wrapper around the TF Serving component doing actual predictions -
Clean Up
gcloud deployment-manager deployments delete $DEPLOYMENT_NAME
gsutil rm -r gs://$BUCKET_NAME
gcloud container images delete us.gcr.io/$PROJECT_ID/kubeflow-train
gcloud container images delete us.gcr.io/$PROJECT_ID/kubeflow-web-ui