
Kubeflow on GCP

End to End Official Doc


install gcloud, kubectl, docker

Example Project - MNIST DEPLOYMENT

wget https://github.com/kubeflow/examples/archive/master.zip


// available project ids can be listed with the following command:
// gcloud projects list

gcloud config set project $PROJECT_ID

cd ./mnist

# install kustomize from GitHub

# Enable GKE API

GUI Set Up Kubeflow Cluster


Check GCP Deployment


Set up Kubectl

gcloud container clusters get-credentials \
    $DEPLOYMENT_NAME --zone $ZONE --project $PROJECT_ID
kubectl config set-context $(kubectl config current-context) --namespace=kubeflow

kubectl get all


// bucket name can be anything, but must be unique across all projects

// create the GCS bucket
gsutil mb gs://${BUCKET_NAME}/

# Test new container image locally 
docker run -it $IMAGE_PATH

# Successful log then push to Google Container Registry
//allow docker to access our GCR registry
gcloud auth configure-docker --quiet

//push container to GCR
docker push $IMAGE_PATH

Train on Cluster


# Run training job 
cd $WORKING_DIR/training/GCS

# kustomize to config YAML manifests
kustomize edit add configmap mnist-map-training \
# some default training params
kustomize edit add configmap mnist-map-training \
kustomize edit add configmap mnist-map-training \
kustomize edit add configmap mnist-map-training \
# Config manifests to use custom bucket and training image
kustomize edit set image training-image=$IMAGE_PATH
kustomize edit add configmap mnist-map-training \
kustomize edit add configmap mnist-map-training \
# Beware of training code need permissions to R/W to storage bucket - kubeflow solves it by creating a service account within Project as part of deployment: verify
gcloud --project=$PROJECT_ID iam service-accounts list | grep $DEPLOYMENT_NAME

# This service should be auto-granted to R/W to storage bucket; kubeflow also added Kubernetes Secrets called 'user-gcp-sa' to cluster, containing credentials needed to authenticate as this service account within cluster:
kubectl describe secret user-gcp-sa

# Access storage bucket from inside training conatiner, set credential env to point to json file contained in secret
kustomize edit add configmap mnist-map-training \
kustomize edit add configmap mnist-map-training \
kustomize edit add configmap mnist-map-training \
# Kustomize to build new customized YAML files:
kustomize build . | kubectl apply -f -
# pipe to deploy to cluster

# Now a new tf-job on cluster called my-train-1-chief-0 
kubectl describe tfjob

# python log
kubectl logs -f my-train-1-chief-0

# once train done, query bucket's data
gsutil ls -r gs://${BUCKET_NAME}/my-model/export

Note: The model is actually saving two outputs:

  1. a set of checkpoints to resume training later if desired

  2. A directory called export, which holds the model in a format that can be read by a TensorFlow Serving component



cd $WORKING_DIR/serving/GCS

# TF Serving files in manifests, simply point the compoenent to bucket where model data is stored - will spin up a server to handle requests - unlike tf-job, no custom container needed for server process - instead all info server needs stored in the model file

# set name for service
kustomize edit add configmap mnist-map-serving \

# point server at trained model in bucket
kustomize edit add configmap mnist-map-serving \
# deploy
kustomize build . | kubectl apply -f -

# check
kubectl describe service mnist-service

Deploying UI


So far:

a simple flask server hosting HTML/CSS/JavaScript files using mnist_client.py having following code through gRPC:

from grpc.beta import implementations
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2 as psp

# create gRPC stub
channel = implementations.insecure_channel(server_host, server_port)
stub = psp.beta_create_PredictionService_stub(channel)

# build request
request = predict_pb2.PredictRequest()
request.model_spec.name = server_name
request.model_spec.signature_name = 'serving_default'
    tf.contrib.util.make_tensor_proto(image, shape=image.shape))

# retrieve results
result = stub.Predict(request, timeout)
resultVal = result.outputs["classes"].int_val[0]
scores = result.outputs['predictions'].float_val
version = result.outputs["classes"].int_val[0]
# deploy web ui

cd $WORKING_DIR/front

# no customisation required, deploy directly
kustomize build . | kubectl apply -f -

# Service added to ClusterIP, meaning it cannot be accessed from outside the cluster!  Need to set up direct connection to the cluster
kubectl port-forward svc/web-ui 8080:80

# Cloud Shell 'Preview on port 8080'

Web interface a simple HTML/JS wrapper around the TF Serving component doing actual predictions -

Clean Up

gcloud deployment-manager deployments delete $DEPLOYMENT_NAME

gsutil rm -r gs://$BUCKET_NAME

gcloud container images delete us.gcr.io/$PROJECT_ID/kubeflow-train
gcloud container images delete us.gcr.io/$PROJECT_ID/kubeflow-web-ui

