Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KF 1.9 release blog post v1 #150

Merged
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
230 changes: 230 additions & 0 deletions _posts/2024-07-22-kubeflow-1.9-release.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
---
StefanoFioravanzo marked this conversation as resolved.
Show resolved Hide resolved
title: "Kubeflow 1.9: New Tools for Model Management and Training Optimization"
layout: post
toc: false
comments: true
image: images/logo.png
hide: false
categories: [release]
permalink: /kubeflow-1.9-release/
author: "Kubeflow 1.9 Release Team, Stefano Fioravanzo"
---

Kubeflow 1.9 significantly simplifies the development, tuning and management of secure machine learning models and LLMs. Highlights include:
StefanoFioravanzo marked this conversation as resolved.
Show resolved Hide resolved

- **Model Registry**: Centralized management for ML models, versions, and artifacts.
- **Fine-Tune APIs for LLMs**: Simplifies fine-tuning of LLMs with custom datasets.
- **Pipelines**: Consolidation of Tekton and Argo Workflows backends for improved flexibility.
- **Security Enhancements**: Network policies, Oauth2-proxy, and CVE scanning.
- **Integration Upgrades**: Improved integrations with Ray, Seldon, BentoML, and KServe for LLM GPU optimizations.
- **Installation and Documentation**: Streamlined installation, updated platform dependencies, and enhanced documentation.

These updates aim to simplify workflows, improve integration dependencies, and provide Kubernetes-native operational efficiencies for enterprise scale, security, and isolation.


## Model Registry

A model registry provides a central catalog for ML model developers to index and manage models, versions, and ML artifacts metadata. It fills a gap between model experimentation and production activities. It provides a central interface for all stakeholders in the ML lifecycle to collaborate on ML models. Model registry has been [asked by the community](https://blog.kubeflow.org/kubeflow-user-survey-2023/#:~:text=lifecycle%2C%20followed%20by-,model%20registry%20(44%25),-and%20initial%20setup) for a long time and we are delighted to introduce it to the Kubeflow ecosystem.

This initial release includes REST APIs and a Python SDK to track model artifacts and model metadata with a standardized format that can be reused across Kubeflow components, such as to deploy Inference Servers. You can get started by following the [Model Registry tutorial on the Kubeflow website](https://www.kubeflow.org/docs/components/model-registry/overview/), or see a short [demo video](https://www.youtube.com/watch?v=JVxUTkAKsMU) of the Model Registry in action.

We are just getting started. This is an Alpha version and we look forward to feedback. The [model registry working group](https://docs.google.com/document/d/1DmMhcae081SItH19gSqBpFtPfbkr9dFhSMCgs-JKzNo/edit) meets biweekly: you can provide feedback by joining the meeting or directly on the [repository](https://github.com/kubeflow/model-registry/issues).

## Fine-Tune APIs for LLMs

In the rapidly evolving ML/AI landscape, the ability to fine-tune pre-trained models represents a significant leap towards achieving custom solutions with less effort and time. Fine-tuning with custom datasets allows practitioners to adapt large language models (LLMs) to their specific needs.

However, fine-tuning tasks often require extensive manual intervention, including the configuration of training environments and the distribution of data across nodes. The new [Fine-Tune API](https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) aims to simplify this process, offering an easy-to-use Python interface that abstracts away the complexity involved in setting up and executing fine-tuning tasks on distributed systems.

By providing this API, Training Operator not only simplifies the user experience for ML practitioners but also leverages its existing infrastructure for distributed training. You can take advantage of Kubernetes’ ability to dynamically schedule GPU, thus saving on compute resources and cost. Training Operator also gives you fault tolerant guarantees, to save your training procedures from cluster node failures.

## Pipelines

### v1 Feature Parity

We made significant progress towards KFPv1 feature parity by adding more Kubernetes resources to the Pipelines code with the new kfp-kubernetes 1.2.0 Python package. We encourage every KFP user to test the new V2 functionality and plan your migration from V1 to V2. We still have some outstanding features that need to be ported over to V2, please help us to identify what's missing by openining a new issue in the KFP repository.

### Argo Workflows and Tekton Backends Consolidation

The Pipelines Tekton backend has been [merged](https://github.com/kubeflow/pipelines/pull/10678) into the main Kubeflow Pipelines repository. You can now choose what workflow engine to use from the same Pipelines version. This proves the extensibility and flexibility of the KFP v2 architecture, which encourages other contributors to bring support for other workflow engines.

Both Argo Workflows and Tekton provide unique advantages. Argo Workflows is known for its simplicity and ease of use, making it a popular choice for many users. Tekton offers extensive customization options with its pipeline definitions and reusable components, which can be advantageous for integrating into various CI/CD systems. Depending on your specific requirements and preferences, you can leverage the strengths of either Argo Workflows or Tekton to optimize your machine learning workflows.

In this [blog post](https://developer.ibm.com/blogs/awb-tekton-optimizations-for-kubeflow-pipelines-2-0/), you can find more details about the benefits of running KFP with either Tekton or Argo Workflows.

### Argo Workflows Upgrade

Kubeflow Pipelines’s Argo Workflows backend is [upgraded to 3.4.16](https://github.com/kubeflow/pipelines/issues/10469). This upgrade moves the supported version closer to the latest upstream version and brings lots of CVE resolutions. The previous minor version was no longer being patched by the Argo community, so lots of security issues had accumulated over time.

## Katib

Kubeflow 1.9 ships with Katib 0.17, which brings [official support](https://github.com/kubeflow/katib/pull/2315) for ARM64, getting us one step closer to full ARM64 coverage.
StefanoFioravanzo marked this conversation as resolved.
Show resolved Hide resolved

For Data Scientists who submit training jobs with the Python SDK, you can now set the [algorithm settings](https://github.com/kubeflow/katib/pull/2227) and [environment variables](https://github.com/kubeflow/katib/pull/2235) from the tune method. Previously, you had to rely directly on Kubernetes CRD submission for these. You can also take advantage of the latest features from TensorFlow 2.16 and PyTorch 2.2. The team also worked to resolve [environmental conflicts](https://github.com/kubeflow/katib/issues/2346) that prevented the Katib Python SDK to be installed alongside the Kubeflow Python SDK.

There are tons of additional improvements and bug fixes. Check out the full changelog [here](https://github.com/kubeflow/katib/blob/master/CHANGELOG.md).

## Central Dashboard

This release bring several improvements to the Kubeflow Central Dashboard, including:

- Styling improvements to the sidebar, including [grouping](https://github.com/kubeflow/kubeflow/pull/7583) all Kubeflow Pipelines links to reduce clutter
- Significant [improvements](https://github.com/kubeflow/kubeflow/pull/7582) to the “manage contributors” page, including the ability to manage contributors for all profiles that you are the owner of, and see which profiles you have access to, even when you are not the owner
- Allow external services to [parse](https://github.com/kubeflow/kubeflow/pull/7138) the current profile (namespace) by sending the namespace selector value to non-iframed applications
- Significant [updates](https://github.com/kubeflow/kubeflow/pull/7578) to dependencies to reduce CVEs

![Kubeflow notebook images](../images/2024-07-22-kubeflow-1.9-release/dashboard.png)

## Notebooks

With this release, we provide [significant updates](https://github.com/kubeflow/kubeflow/pull/7590) to all example notebook images including PyTorch 2.3.0, Tensorflow 2.15.1 and many other library updates. While you can continue to use the old images, we recommend updating to use the greatest and latest ML libraries.
StefanoFioravanzo marked this conversation as resolved.
Show resolved Hide resolved

Additionally, notebooks images now run with a [non-root SecurityContext](https://github.com/kubeflow/kubeflow/pull/7622), allowing for an improved security.

Take a look at the [changelog](https://github.com/kubeflow/kubeflow/releases/tag/v1.9.0) for a full list of bug fixes and improvements.

While this release was light on new Notebooks features, the Working Group is hard at work on an exciting new project: we are actively developing Notebooks V2, with contributions from various companies, in the [new repository](https://github.com/kubeflow/notebooks/tree/notebooks-v2). Take a look [here](https://github.com/kubeflow/kubeflow/issues/7156) and join our Working Group meetings to get involved!


## Kubeflow Platform (Security and Manifests)

### Security

#### Network Policies

Network policies are enabled for the Kubeflow core services as a second layer of defense before Istio authorization policies. This gives administrators a better network overview and segmentation while also enforcing common enterprise security guidelines.
You can read more about the current implementation and architecture [here](https://github.com/kubeflow/manifests/tree/master/common/networkpolicies).

#### Authentication

Oauth2-proxy replaces oidc-authservice, which brings improved token-based authentication. Machine Learning engineers can now use tokens instead of insecure passwords for CI/CD automation of Kubeflow deployment and maintenance (e.g. using GitHub actions).
You can read more about the current implementation and architecture [here](https://github.com/kubeflow/manifests/tree/master/common/oidc-client/oauth2-proxy).

#### CVE Scanning

With this release we are introducing [automated CVE scanning](https://github.com/kubeflow/manifests/blob/master/hack/trivy_scan.py) with [Trivy](https://github.com/aquasecurity/trivy) on the manifests [master branch](https://github.com/kubeflow/manifests/blob/master/.github/workflows/trivy.yaml). We appreciate contributions to reduce the number of CVEs, the Security Working Group needs help to build a more secure platform. You can find more details about our security scanning process and disclosure policy here. Here are is a summary from June 25th:

![Kubeflow notebook images](../images/2024-07-22-kubeflow-1.9-release/CVE_table.png)

You can find a detailed Security WG roadmap [here](https://github.com/kubeflow/manifests/issues/2598).

### Manifests

#### Installation and documentation improvements

The [documentation](https://github.com/kubeflow/manifests?tab=readme-ov-file#upgrading-and-extending) has been improved and now contains guidelines for upgrading and extending the Kubeflow Platform for administrators. New users can now install kubeflow on their laptop in just a few minutes

Platform dependencies updates:

<table>
<tr>
<td><strong>Component</strong>
</td>
<td><a href="https://kubernetes.io/releases/">Kubernetes</a>
</td>
<td><a href="https://github.com/kubernetes-sigs/kustomize/releases">Kustomize</a>
</td>
<td><a href="https://istio.io/latest/news/releases/">Istio</a>
</td>
<td><a href="https://github.com/dexidp/dex/releases">Dex</a>
</td>
<td><a href="https://cert-manager.io/docs/installation/supported-releases/">Cert-Manager</a>
</td>
<td><a href="https://knative.dev/docs/reference/relnotes/">Knative</a>
</td>
</tr>
<tr>
<td><strong>KF 1.9 Version</strong>
</td>
<td>1.27 - 1.29
</td>
<td>5.2.1+
</td>
<td>1.22.1
</td>
<td>2.39.1
</td>
<td>1.14.5
</td>
<td>1.12.4
</td>
</tr>
</table>


> **NOTE** Kubernetes 1.30+ is also expected to work, but was not officially tested.

> **NOTE** Kustomize 5.2.1+ support with way less warnings. So platform engineers have a modern and supported installation tool chain.

#### Integration with third-party ML tools

We have updated the third-party components in /contrib to provide integration with the broader ML ecosystem.

- [BentoML](https://github.com/kubeflow/manifests/tree/master/contrib/bentoml) 1.2.28 and 1.1.21
- [Seldon](https://github.com/kubeflow/manifests/tree/master/contrib/seldon) 1.18.1
- [Ray](https://github.com/kubeflow/manifests/tree/master/contrib/ray) 2.23 and Kuberay 1.1.1

Find a detailed Manifests WG roadmap [here](https://github.com/kubeflow/manifests/issues/2592).

## KServe

We upgraded to [KServe 0.13](https://kserve.github.io/website/0.13/blog/articles/2024-05-15-KServe-0.13-release/). This release includes:

- **Enhanced Hugging Face Runtime Support**: Hugging Face models are supported out-of-the-box, implementing a [KServe Hugging Face Serving Runtime](https://github.com/kserve/kserve/tree/master/python/huggingfaceserver). Currently supported tasks include sequence classification, token classification, fill mask, text generation, and text to text generation.
- **vLLM Support**: Dedicated runtime support for [vLLM](https://docs.vllm.ai/en/latest/) is now included, streamlining the deployment process for LLMs.
- **OpenAI Schema Integration**: KServe now supports endpoints for generative transformer models, following the OpenAI protocol. This enables KServe to be used directly with OpenAI’s client libraries or third-party tools like LangChain and LlamaIndex.

## Documentation

StefanoFioravanzo marked this conversation as resolved.
Show resolved Hide resolved
MLOps is a complex subject and users have asked for clear, up-to-date and comprehensive documentation. We are happy to announce that we started a restructuring process to better align the various components’ docs to have a similar structure. We are revamping our docs to better align with user expectations and how you expect technical docs to be organized. We will continue to improve the quality and completeness of our docs, by adding new user guides, tutorials, and reference architecture topics.

We are looking for new members who can help us craft complete and high quality documentation. Please get involved by [opening and reviewing PRs](https://github.com/kubeflow/website/pulls) in the Kubeflow website.

## Honorable Mentions

### Google Spark Operator migration to Kubeflow

We’re excited to announce the migration of Google’s Spark Operator to the Kubeflow Spark Operator, marking the launch of a significant addition to the Kubeflow ecosystem. The Kubeflow Spark Operator simplifies the deployment and management of Apache Spark applications on Kubernetes. This announcement isn’t just about a new piece of technology, it’s about building a stronger, open-governed, and more collaborative community around Spark on Kubernetes.

Kubeflow Spark Operator is not yet officially included in the Kubeflow release, but you can install it by following the instructions [here](https://www.kubeflow.org/docs/components/spark-operator/getting-started/).

Read more about Kubeflow Spark operator in the announcement [blog post](https://blog.kubeflow.org/operators/2024/04/15/kubeflow-spark-operator.html).

### Google Summer of Code

This year, Kubeflow was excited to participate in Google Summer of Code (GSoC), attracting a wave of enthusiastic students! Over 250 students joined our Slack channel, eager to learn about contributing to the Kubeflow community and crafting impactful proposals.

We were also fortunate to have a dedicated group of 20 mentors ready to guide these talented individuals. From a pool of nearly 70 proposals, we have selected 10 proposals and were awarded with 9 outstanding students by Google. They are now actively contributing to various Kubeflow features, making a real difference in various Kubeflow components.

StefanoFioravanzo marked this conversation as resolved.
Show resolved Hide resolved
We'll be following their progress and sharing their accomplishments through a series of blog posts in the future, so stay tuned! A big thank you to all the mentors and students who are making Kubeflow's GSoC 2024 a huge success!

## What’s next

The community continues to see growth, especially with the ever growing interest from the CloudNative community in AI topics. We have recently elected the first Kubeflow Steering Committee. This is the first step towards a more mature governance structure and a democratic and open community.

If you want to take a peek into the Kubeflow 1.10 roadmap planning and contribute with your ideas, see [Notebooks](https://github.com/kubeflow/kubeflow/issues/7459), [Manifests & Security](https://github.com/kubeflow/manifests/issues/2763), [Pipelines](https://github.com/kubeflow/pipelines/discussions/10908), [Model Registry](https://github.com/kubeflow/model-registry/issues/175), [Katib](https://github.com/kubeflow/katib/issues/2386), [Training Operator](https://github.com/kubeflow/training-operator/issues/2169).

StefanoFioravanzo marked this conversation as resolved.
Show resolved Hide resolved
## How to get started with 1.9

Head over to our [Getting Started](https://www.kubeflow.org/docs/started/) page to learn more about installation, architecture and quick start examples.
StefanoFioravanzo marked this conversation as resolved.
Show resolved Hide resolved

## Join the Community

We would like to thank everyone for their contribution to Kubeflow 1.9, especially Ricardo Martinelli De Oliveira for his work as the v1.9 Release Manager, all the release team and the working group leads, who relentlessly dedicate their time to this great project.
StefanoFioravanzo marked this conversation as resolved.
Show resolved Hide resolved

*Release team members*: Ricardo Martinelli De Oliveira, Stefano Fioravanzo, Helber Belmiro, Diego Lovison, Ajay Nagar, Mathew Wicks, Steven Irvin, Milos Grubjesic, Andrew Scribner, Julius von Kohout.

*Working Group leads*: Andrey Velichkevich, Ce Gao, Chaoran Yu, Chen Sun, Christian Kadner, Ilias Katsakioris, James Liu, James Wu, Johnu George, Julius von Kohout, Kimonas Sotirchos, Mathew Wicks, Matteo Mortari, Ramesh Reddy, Stefano Fioravanzo, Tommy Li, Vara Bonthu, Yannis Zarkadas, Yuan Tang, Yuki Iwai.

*Kubeflow Steering Committee*: Andrey Velichkevich, Johnu George, Josh Bottum, James Wu, Yuan Tang.

*Participating Distributions*: Charmed Kubeflow (Canonical), IBM IKS, Nutanix, OpenShift AI (RedHat), Oracle Cloud Infrastructure, DeployKF, VMWare, QBO. You can find more details about Kubeflow distributions [here](https://www.kubeflow.org/docs/started/installing-kubeflow/#packaged-distributions).

### Want to help?

The Kubeflow community Working Groups hold open meetings and are always looking for more volunteers and users to unlock the potential of machine learning. If you’re interested in becoming a Kubeflow contributor, please feel free to check out the resources below. We look forward to working with you!

- Visit our [Kubeflow website](https://www.kubeflow.org/docs/about/community/) or Kubeflow GitHub Page
- Join the [Kubeflow Slack channel](https://www.kubeflow.org/docs/about/community/)
- Join the [kubeflow-discuss](https://groups.google.com/g/kubeflow-discuss) mailing list
- Attend our weekly [community meeting](https://www.kubeflow.org/docs/about/community/#kubeflow-community-call)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.