This project tracks, ranks, and evaluates DevOps AI Assistants across knowledge domains.
📅 Book a time on my calendar or email [email protected] to chat about these benchmarks.
AWS Services (dataset)
Name | Accuracy | Median Duration (s) | Created At |
---|---|---|---|
OpsTower.ai | 92% 🏆 | 29 | 2023-09-17 |
ReleaseAI | 72% | 11 | 2023-09-17 |
AWS CloudWatch Metrics (dataset)
Name | Accuracy | Median Duration (s) | Created At |
---|---|---|---|
OpsTower.ai | 89% 🏆 | 42 | 2023-09-17 |
ReleaseAI | 56% | 20 | 2023-09-18 |
AWS Billing (dataset)
Name | Accuracy | Median Duration (s) | Created At |
---|---|---|---|
OpsTower.ai | 91% 🏆 | 53 | 2023-09-18 |
ReleaseAI | 73% | 23 | 2023-09-18 |
kubectl (dataset)
Name | Accuracy | Median Duration (s) | Created At |
---|---|---|---|
abhishek-ch/kubectl-GPT | 83% 🏆 | 5 | 2023-09-19 |
devinjeon/kubectl-gpt | 50% | 1 | 2023-09-19 |
mico | 17% | 1 | 2023-09-19 |
Metrics:
Accuracy
: The percent of questions that the DevOps AI Assistant answered correctly.Median Duration
: The median duration in seconds that it took the DevOps AI Assistant to answer a question.
A DevOps AI Assistant is an LLM-backed autonomous agent that helps DevOps engineers perform their daily tasks. They connect to external systems like AWS and Kubernetes to perform actions on behalf of the user.
Only includes assistants that can be invoked from the command line or via a REST API, are functional, and are available for immediate use (not in private beta).
Name | Focus | Evaluated? |
---|---|---|
aiac | Terraform, kubectl, AWS | No - code generation only |
aiws | AWS | No - does not decipher command output |
Aptible AI | ? | No |
Argon | Kubernetes | No |
cloud copilot | Azure | No - does not decipher command output |
k8sgpt | Kubernetes | Planned |
kubectl-GPT | kubectl | ✅ |
kubectl-gpt | kubectl | ✅ |
KubeCtl-ai | Kubernetes manifests | No - code generation only |
mico | kubectl | ✅ |
OpsTower.ai | AWS | ✅ |
ReleaseAI | AWS, Kubectl | ✅ |
Terraform AI | Terraform | No - code generation only |
tfgpt | Terraform | No - code generation only |
Open a PR and submit a DevOps AI Assistant for automated evaluation. To be evaluated, the agent must meet the following criteria:
- Can be invoked from the command line or via a REST API.
- Not in private BETA.
See the datasets/ directory for the question datasets. There are 3 columns in each dataset csv file:
question
: The question to ask the DevOps AI Assistantanswer_format
: The expected answer in natural language.reference_functions
: The reference functions that the DevOps AI Assistant should call to answer the question.
List of datasets:
Name | Example Question |
---|---|
aws_cloudwatch_metrics.csv | Were there any Lambda invocations that lasted over 30 seconds in the last day? |
aws_services.csv | Do our ec2 instances have are any unexpected reboots or terminations over the past 7 days? |
aws_billing.csv | Which region has the highest AWS expenses for me over the past 3 months? |
kubectl.csv | How many pods are currently running in the default namespace? |
- Iterate over each question in the dataset and store:
- the answer from the DevOps AI Assistant
- the truth answer derived from evaluating the human-evaluated reference functions with a prompt to summarize the results into an answer.
- Iterate over the answer results, using the dynamic eval prompt to compare the results of the DevOps AI Assistant to the truth answer. This generates a confidence score and a short explanation for background on the score.
- Store the results in the results/ directory.
A critical component of the evaluation process is the dynamic evaluation. It's not feasible to provide a static answer for most questions as the correct answer is environment-specific. For example, the answer to "What is the average CPU utilization across my EC2 instances?" is not a static answer. It depends on the current state of the EC2 instances.
To solve this, I've stored a set of human-evaluated functions to generate the data that provide correct answers. Then, I use an LLM prompt to generate a natural language answer from the data. This would be a poor evaluation process if the LLM provided an incorrect answer based on the returned data, but I have yet to observe significant errors in the LLM's reasoning of the function output.
Please submit a PR if you believe a reference function is incorrect.
Reach out [email protected] if you have general questions about this leaderboard.