This project showcases how you can work up from a simple RAG pipeline to a more complex setup that involves finetuning embeddings, reranking retrieved documents, and even finetuning the LLM itself. We'll do this all for a use case relevant to ZenML: a question answering system that can provide answers to common questions about ZenML. This will help you understand how to apply the concepts covered in this guide to your own projects.
Contained within this project is all the code needed to run the full pipelines. You can follow along in our guide to understand the decisions and tradeoffs behind the pipeline and step code contained here. You'll build a solid understanding of how to leverage LLMs in your MLOps workflows using ZenML, enabling you to build powerful, scalable, and maintainable LLM-powered applications.
This project contains all the pipeline and step code necessary to follow along with the guide. You'll need a PostgreSQL database to store the embeddings; full instructions are provided below for how to set that up.
We've recently been holding some webinars about this repository and project. Watch the videos below if you want an introduction and context around the code and ideas covered in this project.
This project showcases production-ready pipelines so we use some cloud infrastructure to manage the assets. You can run the pipelines locally using a local PostgreSQL database, but we encourage you to use a cloud database for production use cases.
Make sure you're running from a Python 3.8+ environment. Setup a virtual environment and install the dependencies using the following command:
pip install -r requirements.txt
Depending on your hardware you may run into some issues when running the pip install
command with the
flash_attn
package. In that case running FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE pip install flash-attn --no-build-isolation
could help you. Possibly you might also need to install torch separately.
In order to use the default LLM for this query, you'll need an account and an API key from OpenAI specified as a ZenML secret:
zenml secret create llm-complete --openai_api_key=<your-openai-api-key>
export ZENML_PROJECT_SECRET_NAME=llm-complete
Supabase is a cloud provider that offers a PostgreSQL database. It's simple to use and has a free tier that should be sufficient for this project. Once you've created a Supabase account and organization, you'll need to create a new project.
You'll want to save the Supabase database password as a ZenML secret so that it isn't stored in plaintext. You can do this by running the following command:
zenml secret update llm-complete -v '{"supabase_password": "YOUR_PASSWORD", "supabase_user": "YOUR_USER", "supabase_host": "YOUR_HOST", "supabase_port": "YOUR_PORT"}'
You can get the user, host and port for this database instance by getting the connection string from the Supabase dashboard.
In case Supabase is not an option for you, you can use a different database as the backend.
To run the pipeline, you can use the run.py
script. This script will allow you
to run the pipelines in the correct order. You can run the script with the
following command:
python run.py rag
This will run the basic RAG pipeline, which scrapes the ZenML documentation and stores the embeddings in the Supabase database.
Once the pipeline has run successfully, you can query the assets in the Supabase
database using the --query
flag as well as passing in the model you'd like to
use for the LLM.
When you're ready to make the query, run the following command:
python run.py query "how do I use a custom materializer inside my own zenml steps? i.e. how do I set it? inside the @step decorator?" --model=gpt4
Alternative options for LLMs to use include:
gpt4
gpt35
claude3
claudehaiku
Note that Claude will require a different API key from Anthropic. See the
litellm
docs on how to set
this up.
You'll need to update and add some secrets to make this work with your Hugging Face account. To get your ZenML service account API token and store URL, you can first create a new service account:
zenml service-account create <SERVICE_ACCOUNT_NAME>
For more information on this part of the process, please refer to the ZenML documentation.
Once you have your service account API token and store URL (the URL of your deployed ZenML tenant), you can update the secrets with the following command:
zenml secret update llm-complete --zenml_api_token=<YOUR_ZENML_SERVICE_ACCOUNT_API_TOKEN> --zenml_store_url=<YOUR_ZENML_STORE_URL>
To set the Hugging Face user space that gets used for the Gradio app deployment, you should set an environment variable with the following command:
export ZENML_HF_USERNAME=<YOUR_HF_USERNAME>
export ZENML_HF_SPACE_NAME=<YOUR_HF_SPACE_NAME> # optional, defaults to "llm-complete-guide-rag"
To deploy the RAG pipeline, you can use the following command:
python run.py --deploy
Alternatively, you can run the basic RAG pipeline and deploy it in one go:
python run.py --rag --deploy
This will open a Hugging Face space in your browser where you can interact with the RAG pipeline.
To run the evaluation pipeline, you can use the following command:
python run.py evaluation
You'll need to have first run the RAG pipeline to have the necessary assets in the database to evaluate.
For embeddings finetuning we first generate synthetic data and then finetune the embeddings. Both of these pipelines are described in the LLMOps guide and instructions for how to run them are provided below.
To run the distilabel
synthetic data generation pipeline, you can use the following commands:
pip install -r requirements-argilla.txt # special requirements
python run.py synthetic
You will also need to have set up and connected to an Argilla instance for this to work. Please follow the instructions in the Argilla documentation to set up and connect to an Argilla instance on the Hugging Face Hub. ZenML's Argilla integration documentation will guide you through the process of connecting to your instance as a stack component.
Please use the secret from above to track all the secrets. Here we are also setting a Huggingface write key. In order to make the rest of the pipeline work for you, you will need to change the hf repo urls to a space you have permissions to.
zenml secret update llm-complete -v '{"argilla_api_key": "YOUR_ARGILLA_API_KEY", "argilla_api_url": "YOUR_ARGILLA_API_URL", "hf_token": "YOUR_HF_TOKEN"}'
As with the previous pipeline, you will need to have set up and connected to an Argilla instance for this to work. Please follow the instructions in the Argilla documentation to set up and connect to an Argilla instance on the Hugging Face Hub. ZenML's Argilla integration documentation will guide you through the process of connecting to your instance as a stack component.
The pipeline assumes that your argilla secret is stored within a ZenML secret called argilla_secrets
.
To run the pipeline for finetuning the embeddings, you can use the following commands:
pip install -r requirements-argilla.txt # special requirements
python run.py embeddings
Credit to Phil Schmid for his tutorial on embeddings finetuning with Matryoshka loss function which we adapted for this project.
The basic RAG pipeline will run using a local stack, but if you want to improve the speed of the embeddings step you might want to consider using a cloud orchestrator. Please follow the instructions in documentation on popular integrations (currently available for AWS and GCP) to learn how you can run the pipelines on a remote stack.
If you run the pipeline using a cloud artifact store, logs from all the steps as well as assets like the visualizations will all be shown in the ZenML dashboard.
If you run the pipeline using ZenML Pro you'll have access to the managed dashboard which will allow you to get started quickly. We offer a free trial so you can try out the platform without any cost. Visit the ZenML Pro dashboard to get started.
You can also self-host the ZenML dashboard. Instructions are available in our documentation.
The project loosely follows the recommended ZenML project structure:
.
├── LICENSE # License file
├── README.md # Project documentation
├── __init__.py
├── constants.py # Constants used throughout the project
├── materializers
│ ├── __init__.py
│ └── document_materializer.py # Document materialization logic
├── most_basic_eval.py # Basic evaluation script
├── most_basic_rag_pipeline.py # Basic RAG pipeline script
├── notebooks
│ └── visualise_embeddings.ipynb # Notebook to visualize embeddings
├── pipelines
│ ├── __init__.py
│ ├── generate_chunk_questions.py # Pipeline to generate chunk questions
│ ├── llm_basic_rag.py # Basic RAG pipeline using LLM
│ └── llm_eval.py # Pipeline for LLM evaluation
├── requirements.txt # Project dependencies
├── run.py # Main script to run the project
├── steps
│ ├── __init__.py
│ ├── eval_e2e.py # End-to-end evaluation step
│ ├── eval_retrieval.py # Retrieval evaluation step
│ ├── eval_visualisation.py # Evaluation visualization step
│ ├── populate_index.py # Step to populate the index
│ ├── synthetic_data.py # Step to generate synthetic data
│ ├── url_scraper.py # Step to scrape URLs
│ ├── url_scraping_utils.py # Utilities for URL scraping
│ └── web_url_loader.py # Step to load web URLs
├── structures.py # Data structures used in the project
├── tests
│ ├── __init__.py
│ └── test_url_scraping_utils.py # Tests for URL scraping utilities
└── utils
├── __init__.py
└── llm_utils.py # Utilities related to the LLM
The RAG pipeline relies on code from this Timescale blog that showcased using PostgreSQL as a vector database. We adapted it for our use case and adapted it to work with Supabase.