## Fine-Tuning with Kaito [Kaito](https://github.com/Azure/kaito) is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. Kaito has the following key differentiations compared to most of the mainstream model deployment methodologies built on top of virtual machine infrastructures: - Manage model files using container images. A http server is provided to perform inference calls using the model library. - Avoid tuning deployment parameters to fit GPU hardware by providing preset configurations. - Auto-provision GPU nodes based on model requirements. - Host large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified. ## Architecture Kaito follows the classic Kubernetes Custom Resource Definition(CRD)/controller design pattern. User manages a `workspace` custom resource which describes the GPU requirements and the inference specification. Kaito controllers will automate the deployment by reconciling the `workspace` custom resource. <div align="left"> <img src="https://github.com/kaito-project/kaito/raw/main/docs/img/arch.png" width=80% title="Kaito architecture" alt="Kaito architecture"> </div> The above figure presents the Kaito architecture overview. Its major components consist of: - **Workspace controller**: It reconciles the `workspace` custom resource, creates `machine` (explained below) custom resources to trigger node auto provisioning, and creates the inference workload (`deployment` or `statefulset`) based on the model preset configurations. - **Node provisioner controller**: The controller's name is *gpu-provisioner* in [gpu-provisioner helm chart](https://github.com/Azure/gpu-provisioner/tree/main/charts/gpu-provisioner). It uses the `machine` CRD originated from [Karpenter](https://sigs.k8s.io/karpenter) to interact with the workspace controller. It integrates with Azure Kubernetes Service(AKS) APIs to add new GPU nodes to the AKS cluster. > Note: The [*gpu-provisioner*](https://github.com/Azure/gpu-provisioner) is an open sourced component. It can be replaced by other controllers if they support [Karpenter-core](https://sigs.k8s.io/karpenter) APIs. ## Overview video [Watch the Kaito Demo](https://www.youtube.com/embed/pmfBSg7L6lE?si=b8hXKJXb1gEZcmAe) ## Installation Please check the installation guidance [here](https://github.com/Azure/kaito/blob/main/docs/installation.md). ## Quick start After installing Kaito, one can try following commands to start a fine-tuning service. ``` apiVersion: kaito.sh/v1alpha1 kind: Workspace metadata: name: workspace-tuning-phi-3 resource: instanceType: "Standard_NC6s_v3" labelSelector: matchLabels: app: tuning-phi-3 tuning: preset: name: phi-3-mini-128k-instruct method: qlora input: urls: - "https://huggingface.co/datasets/philschmid/dolly-15k-oai-style/resolve/main/data/train-00000-of-00001-54e3756291ca09c6.parquet?download=true" output: image: "ACR_REPO_HERE.azurecr.io/IMAGE_NAME_HERE:0.0.1" # Tuning Output ACR Path imagePushSecret: ACR_REGISTRY_SECRET_HERE ``` ```sh $ cat examples/fine-tuning/kaito_workspace_tuning_phi_3.yaml apiVersion: kaito.sh/v1alpha1 kind: Workspace metadata: name: workspace-tuning-phi-3 resource: instanceType: "Standard_NC6s_v3" labelSelector: matchLabels: app: tuning-phi-3 tuning: preset: name: phi-3-mini-128k-instruct method: qlora input: urls: - "https://huggingface.co/datasets/philschmid/dolly-15k-oai-style/resolve/main/data/train-00000-of-00001-54e3756291ca09c6.parquet?download=true" output: image: "ACR_REPO_HERE.azurecr.io/IMAGE_NAME_HERE:0.0.1" # Tuning Output ACR Path imagePushSecret: ACR_REGISTRY_SECRET_HERE $ kubectl apply -f examples/fine-tuning/kaito_workspace_tuning_phi_3.yaml ``` The workspace status can be tracked by running the following command. When the WORKSPACEREADY column becomes `True`, the model has been deployed successfully. ```sh $ kubectl get workspace kaito_workspace_tuning_phi_3.yaml NAME INSTANCE RESOURCEREADY INFERENCEREADY WORKSPACEREADY AGE workspace-tuning-phi-3 Standard_NC6s_v3 True True True 10m ``` Next, one can find the inference service's cluster ip and use a temporal `curl` pod to test the service endpoint in the cluster. ```sh $ kubectl get svc workspace_tuning NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE workspace-tuning-phi-3 ClusterIP <CLUSTERIP> <none> 80/TCP,29500/TCP 10m export CLUSTERIP=$(kubectl get svc workspace-tuning-phi-3 -o jsonpath="{.spec.clusterIPs[0]}") $ kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://$CLUSTERIP/chat -H "accept: application/json" -H "Content-Type: application/json" -d "{\"prompt\":\"YOUR QUESTION HERE\"}" ```