Merge branch 'main' into Ishaan/nit-image-params

kaito-project · Jan 14, 2024 · 64050d2 · 64050d2
2 parents f958500 + f38c8b1
commit 64050d2
Show file tree

Hide file tree

Showing 5 changed files with 121 additions and 37 deletions.
diff --git a/.gitignore b/.gitignore
@@ -29,3 +29,5 @@ hack/tools/bin/*
 .DS_Store
 /coverage.txt
 
+# values override file for helm chart installation
+values.override.yaml
diff --git a/README.md b/README.md
@@ -33,10 +33,42 @@ Note that the *gpu-provisioner* is not an open sourced component. It can be repl
 
 
 ## Installation 
-The following guidance assumes **Azure Kubernetes Service(AKS)** is used to host the Kubernetes cluster .
+
+The following guidance assumes **Azure Kubernetes Service(AKS)** is used to host the Kubernetes cluster.
+
+Before you begin, ensure you have the following tools installed:
+
+- [Azure CLI](https://learn.microsoft.com/cli/azure/install-azure-cli) to provision Azure resources
+- [Helm](https://helm.sh) to install this operator
+- [kubectl](https://kubernetes.io/docs/tasks/tools/) to view Kubernetes resources
+- [git](https://git-scm.com/downloads) to clone this repo locally
+
+If you do not already have an AKS cluster, run the following Azure CLI commands to create one:
+
+```bash
+export RESOURCE_GROUP="myResourceGroup"
+export MY_CLUSTER="myCluster"
+export LOCATION="eastus"
+az group create --name $RESOURCE_GROUP --location $LOCATION
+az aks create --resource-group $RESOURCE_GROUP --name $MY_CLUSTER --enable-oidc-issuer --enable-workload-identity --enable-managed-identity --generate-ssh-keys
+```
+
+Connect to the AKS cluster.
+
+```bash
+az aks get-credentials --resource-group $RESOURCE_GROUP --name $MY_CLUSTER
+```
+
+If you do not have `kubectl` installed locally, you can install using the following Azure CLI command.
+
+```bash
+az aks install-cli
+```
 
 #### Enable Workload Identity and OIDC Issuer features
-The *gpu-provisioner* controller requires the [workload identity](https://learn.microsoft.com/en-us/azure/aks/workload-identity-overview?tabs=dotnet) feature to acquire the access token to the AKS cluster.
+The *gpu-provisioner* controller requires the [workload identity](https://learn.microsoft.com/azure/aks/workload-identity-overview?tabs=dotnet) feature to acquire the access token to the AKS cluster. 
+
+> Run the following commands only if your AKS cluster does not already have the Workload Identity and OIDC issuer features enabled.
 
 ```bash
 export RESOURCE_GROUP="myResourceGroup"
@@ -47,39 +79,65 @@ az aks update -g $RESOURCE_GROUP -n $MY_CLUSTER --enable-oidc-issuer --enable-wo
 #### Create an identity and assign permissions
 The identity `kaitoprovisioner` is created for the *gpu-provisioner* controller. It is assigned Contributor role for the managed cluster resource to allow changing `$MY_CLUSTER` (e.g., provisioning new nodes in it).
 ```bash
-export SUBSCRIPTION="mySubscription"
-az identity create --name kaitoprovisioner -g $RESOURCE_GROUP
-export IDENTITY_PRINCIPAL_ID=$(az identity show --name kaitoprovisioner -g $RESOURCE_GROUP --subscription $SUBSCRIPTION --query 'principalId' | tr -d '"')
-export IDENTITY_CLIENT_ID=$(az identity show --name kaitoprovisioner -g $RESOURCE_GROUP --subscription $SUBSCRIPTION --query 'clientId' | tr -d '"')
+export SUBSCRIPTION=$(az account show --query id -o tsv)
+export IDENTITY_NAME="kaitoprovisioner"
+az identity create --name $IDENTITY_NAME -g $RESOURCE_GROUP
+export IDENTITY_PRINCIPAL_ID=$(az identity show --name $IDENTITY_NAME -g $RESOURCE_GROUP --subscription $SUBSCRIPTION --query 'principalId' -o tsv)
+export IDENTITY_CLIENT_ID=$(az identity show --name $IDENTITY_NAME -g $RESOURCE_GROUP --subscription $SUBSCRIPTION --query 'clientId' -o tsv)
 az role assignment create --assignee $IDENTITY_PRINCIPAL_ID --scope /subscriptions/$SUBSCRIPTION/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.ContainerService/managedClusters/$MY_CLUSTER  --role "Contributor"
-
 ```
 
 #### Install helm charts
 Two charts will be installed in `$MY_CLUSTER`: `gpu-provisioner` chart and `workspace` chart.
+
+> Be sure you've cloned this repo and connected to your AKS cluster before attempting to install the Helm charts.
+
+Install the Workspace controller.
+
 ```bash
 helm install workspace ./charts/kaito/workspace
+```
 
-export NODE_RESOURCE_GROUP=$(az aks show -n $MY_CLUSTER -g $RESOURCE_GROUP --query nodeResourceGroup | tr -d '"')
-export LOCATION=$(az aks show -n $MY_CLUSTER -g $RESOURCE_GROUP --query location | tr -d '"')
-export TENANT_ID=$(az account show | jq -r ".tenantId")
-yq -i '(.controller.env[] | select(.name=="ARM_SUBSCRIPTION_ID"))       .value = env(SUBSCRIPTION)'        ./charts/kaito/gpu-provisioner/values.yaml
-yq -i '(.controller.env[] | select(.name=="LOCATION"))                  .value = env(LOCATION)'            ./charts/kaito/gpu-provisioner/values.yaml
-yq -i '(.controller.env[] | select(.name=="ARM_RESOURCE_GROUP"))        .value = env(RESOURCE_GROUP)'      ./charts/kaito/gpu-provisioner/values.yaml
-yq -i '(.controller.env[] | select(.name=="AZURE_NODE_RESOURCE_GROUP")) .value = env(NODE_RESOURCE_GROUP)' ./charts/kaito/gpu-provisioner/values.yaml
-yq -i '(.controller.env[] | select(.name=="AZURE_CLUSTER_NAME"))        .value = env(MY_CLUSTER)'          ./charts/kaito/gpu-provisioner/values.yaml
-yq -i '(.settings.azure.clusterName)                                           = env(MY_CLUSTER)'          ./charts/kaito/gpu-provisioner/values.yaml
-yq -i '(.workloadIdentity.clientId)                                            = env(IDENTITY_CLIENT_ID)'  ./charts/kaito/gpu-provisioner/values.yaml
-yq -i '(.workloadIdentity.tenantId)                                            = env(TENANT_ID)'           ./charts/kaito/gpu-provisioner/values.yaml
-helm install gpu-provisioner ./charts/kaito/gpu-provisioner 
-
+Install the Node provisioner controller.
+```bash
+# get additional values for helm chart install
+export NODE_RESOURCE_GROUP=$(az aks show -n $MY_CLUSTER -g $RESOURCE_GROUP --query nodeResourceGroup -o tsv)
+export LOCATION=$(az aks show -n $MY_CLUSTER -g $RESOURCE_GROUP --query location -o tsv)
+export TENANT_ID=$(az account show --query tenantId -o tsv)
+
+# create a local values override file
+cat << EOF > values.override.yaml
+controller:
+  env:
+  - name: ARM_SUBSCRIPTION_ID
+    value: $SUBSCRIPTION
+  - name: LOCATION
+    value: $LOCATION
+  - name: AZURE_CLUSTER_NAME
+    value: $MY_CLUSTER
+  - name: AZURE_NODE_RESOURCE_GROUP
+    value: $NODE_RESOURCE_GROUP
+  - name: ARM_RESOURCE_GROUP
+    value: $RESOURCE_GROUP
+  - name: LEADER_ELECT
+    value: "false"
+workloadIdentity:
+  clientId: $IDENTITY_CLIENT_ID
+  tenantId: $TENANT_ID
+settings:
+  azure:
+    clusterName: $MY_CLUSTER
+EOF
+
+# install gpu-provisioner using values override file
+helm install gpu-provisioner ./charts/kaito/gpu-provisioner -f values.override.yaml
 ```
 
 #### Create the federated credential
 The federated identity credential between the managed identity `kaitoprovisioner` and the service account used by the *gpu-provisioner* controller is created.
 ```bash
-export AKS_OIDC_ISSUER=$(az aks show -n $MY_CLUSTER -g $RESOURCE_GROUP --subscription $SUBSCRIPTION --query "oidcIssuerProfile.issuerUrl" | tr -d '"')
-az identity federated-credential create --name kaito-federatedcredential --identity-name kaitoprovisioner -g $RESOURCE_GROUP --issuer $AKS_OIDC_ISSUER --subject system:serviceaccount:"gpu-provisioner:gpu-provisioner" --audience api://AzureADTokenExchange --subscription $SUBSCRIPTION
+export AKS_OIDC_ISSUER=$(az aks show -n $MY_CLUSTER -g $RESOURCE_GROUP --subscription $SUBSCRIPTION --query "oidcIssuerProfile.issuerUrl" -o tsv)
+az identity federated-credential create --name kaito-federatedcredential --identity-name $IDENTITY_NAME -g $RESOURCE_GROUP --issuer $AKS_OIDC_ISSUER --subject system:serviceaccount:"gpu-provisioner:gpu-provisioner" --audience api://AzureADTokenExchange --subscription $SUBSCRIPTION
 ```
 Then the *gpu-provisioner* can access the managed cluster using a trust token with the same permissions of the `kaitoprovisioner` identity.
 Note that before finishing this step, the *gpu-provisioner* controller pod will constantly fail with the following message in the log:
@@ -88,6 +146,36 @@ panic: Configure azure client fails. Please ensure federatedcredential has been
 ```
 The pod will reach running state once the federated credential is created.
 
+#### Verify installation
+You can run the following commands to verify the installation of the controllers were successful.
+
+Check status of the Helm chart installations.
+
+```bash
+helm list -n default
+```
+
+Check status of the `workspace`.
+
+```bash
+kubectl describe deploy workspace -n workspace
+```
+
+Check status of the `gpu-provisioner`.
+
+```bash
+kubectl describe deploy gpu-provisioner -n gpu-provisioner
+```
+
+#### Troubleshooting 
+If you see that the `gpu-provisioner` deployment is not running after some time, it's possible that some values incorrect in your `values.ovveride.yaml`. 
+
+Run the following command to check `gpu-provisioner` pod logs for additional details.
+
+```bash
+kubectl logs --selector=app.kubernetes.io\/name=gpu-provisioner -n gpu-provisioner
+```
+
 #### Clean up
 
 ```bash
@@ -128,8 +216,8 @@ $ kubectl get svc workspace-falcon-7b
 NAME                  TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)            AGE
 workspace-falcon-7b   ClusterIP   <CLUSTERIP>  <none>        80/TCP,29500/TCP   10m
 
-$ kubectl run -it --rm --restart=Never curl --image=curlimages/curl sh
-~ $ curl -X POST http://<CLUSTERIP>/chat -H "accept: application/json" -H "Content-Type: application/json" -d "{\"prompt\":\"YOUR QUESTION HERE\"}"
+export CLUSTERIP=$(kubectl get svc workspace-falcon-7b -o jsonpath="{.spec.clusterIPs[0]}") 
+$ kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://$CLUSTERIP/chat -H "accept: application/json" -H "Content-Type: application/json" -d "{\"prompt\":\"YOUR QUESTION HERE\"}"
 ```
 
 ## Usage
@@ -154,7 +242,7 @@ contact [[email protected]](mailto:[email protected]) with any additio
 
 ## Trademarks
 This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
-trademarks or logos is subject to and must follow [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
+trademarks or logos is subject to and must follow [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/legal/intellectualproperty/trademarks/usage/general).
 Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
 Any use of third-party trademarks or logos are subject to those third-party's policies.
 

diff --git a/presets/models/falcon/README.md b/presets/models/falcon/README.md
@@ -1,5 +1,3 @@
-# Falcon
-
 ## Supported Models
 |Model name| Model source | Sample workspace|Kubernetes Workload|Distributed inference|
 |----|:----:|:----:| :----: |:----: |

diff --git a/presets/models/llama2/README.md b/presets/models/llama2/README.md
@@ -1,5 +1,3 @@
-# llama2
-
 ## Supported Models
 |Model name| Model source | Sample workspace|Kubernetes Workload|Distributed inference|
 |----|:----:|:----:| :----: |:----: |
@@ -12,7 +10,7 @@
 
 ### Build llama2 private images
 
-#### 1. Clone Kaito Repository
+#### 1. Clone kaito repository
 ```
 git clone https://github.com/Azure/kaito.git
 ```
@@ -32,8 +30,8 @@ Use the following command to build the llama2 inference service image from the r
 ```
 docker build \
   --file docker/presets/llama-2/Dockerfile \
-  --build-arg LLAMA_WEIGHTS=$LLAMA_WEIGHTS_PATH \
-  --build-arg SRC_DIR=presets/llama2 \
+  --build-arg WEIGHTS_PATH=$LLAMA_WEIGHTS_PATH \
+  --build-arg MODEL_PRESET_PATH=presets/models/llama2 \
   -t $LLAMA_MODEL_NAME:latest .
 ```
 

diff --git a/presets/models/llama2chat/README.md b/presets/models/llama2chat/README.md
@@ -1,5 +1,3 @@
-# llama2chat
-
 ## Supported Models
 |Model name| Model source | Sample workspace|Kubernetes Workload|Distributed inference|
 |----|:----:|:----:| :----: |:----: |
@@ -12,7 +10,7 @@
 
 ### Build llama2chat private images
 
-#### 1. Clone Kaito Repository
+#### 1. Clone kaito repository
 ```
 git clone https://github.com/Azure/kaito.git
 ```
@@ -32,8 +30,8 @@ Use the following command to build the llama2chat inference service image from t
 ```
 docker build \
   --file docker/presets/llama-2/Dockerfile \
-  --build-arg LLAMA_WEIGHTS=$LLAMA_WEIGHTS_PATH \
-  --build-arg SRC_DIR=presets/llama2chat \
+  --build-arg WEIGHTS_PATH=$LLAMA_WEIGHTS_PATH \
+  --build-arg MODEL_PRESET_PATH=presets/models/llama2chat \
   -t $LLAMA_MODEL_NAME:latest .
 ```