Skip to content

opstower-ai/devops-ai-open-leaderboard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DevOps AI Assistant Open Leaderboard

This project tracks, ranks, and evaluates DevOps AI Assistants across knowledge domains.

📅 Book a time on my calendar or email [email protected] to chat about these benchmarks.

🏆 Current Leaderboard

AWS Services (dataset)

Name Accuracy Median Duration (s) Created At
OpsTower.ai 92% 🏆 29 2023-09-17
ReleaseAI 72% 11 2023-09-17

AWS CloudWatch Metrics (dataset)

Name Accuracy Median Duration (s) Created At
OpsTower.ai 89% 🏆 42 2023-09-17
ReleaseAI 56% 20 2023-09-18

AWS Billing (dataset)

Name Accuracy Median Duration (s) Created At
OpsTower.ai 91% 🏆 53 2023-09-18
ReleaseAI 73% 23 2023-09-18

kubectl (dataset)

Name Accuracy Median Duration (s) Created At
abhishek-ch/kubectl-GPT 83% 🏆 5 2023-09-19
devinjeon/kubectl-gpt 50% 1 2023-09-19
mico 17% 1 2023-09-19

Metrics:

  • Accuracy: The percent of questions that the DevOps AI Assistant answered correctly.
  • Median Duration: The median duration in seconds that it took the DevOps AI Assistant to answer a question.

What is a DevOps AI Assistant?

A DevOps AI Assistant is an LLM-backed autonomous agent that helps DevOps engineers perform their daily tasks. They connect to external systems like AWS and Kubernetes to perform actions on behalf of the user.

List of DevOps AI Assistants

Only includes assistants that can be invoked from the command line or via a REST API, are functional, and are available for immediate use (not in private beta).

Name Focus Evaluated?
aiac Terraform, kubectl, AWS No - code generation only
aiws AWS No - does not decipher command output
Aptible AI ? No
Argon Kubernetes No
cloud copilot Azure No - does not decipher command output
k8sgpt Kubernetes Planned
kubectl-GPT kubectl
kubectl-gpt kubectl
KubeCtl-ai Kubernetes manifests No - code generation only
mico kubectl
OpsTower.ai AWS
ReleaseAI AWS, Kubectl
Terraform AI Terraform No - code generation only
tfgpt Terraform No - code generation only

Submit a DevOps AI Assistant for evaluation

Open a PR and submit a DevOps AI Assistant for automated evaluation. To be evaluated, the agent must meet the following criteria:

  1. Can be invoked from the command line or via a REST API.
  2. Not in private BETA.

Question Datasets

See the datasets/ directory for the question datasets. There are 3 columns in each dataset csv file:

  1. question: The question to ask the DevOps AI Assistant
  2. answer_format: The expected answer in natural language.
  3. reference_functions: The reference functions that the DevOps AI Assistant should call to answer the question.

List of datasets:

Name Example Question
aws_cloudwatch_metrics.csv Were there any Lambda invocations that lasted over 30 seconds in the last day?
aws_services.csv Do our ec2 instances have are any unexpected reboots or terminations over the past 7 days?
aws_billing.csv Which region has the highest AWS expenses for me over the past 3 months?
kubectl.csv How many pods are currently running in the default namespace?

Evaluation Process

  1. Iterate over each question in the dataset and store:
  • the answer from the DevOps AI Assistant
  • the truth answer derived from evaluating the human-evaluated reference functions with a prompt to summarize the results into an answer.
  1. Iterate over the answer results, using the dynamic eval prompt to compare the results of the DevOps AI Assistant to the truth answer. This generates a confidence score and a short explanation for background on the score.
  2. Store the results in the results/ directory.

A note on dynamic evaluation

A critical component of the evaluation process is the dynamic evaluation. It's not feasible to provide a static answer for most questions as the correct answer is environment-specific. For example, the answer to "What is the average CPU utilization across my EC2 instances?" is not a static answer. It depends on the current state of the EC2 instances.

To solve this, I've stored a set of human-evaluated functions to generate the data that provide correct answers. Then, I use an LLM prompt to generate a natural language answer from the data. This would be a poor evaluation process if the LLM provided an incorrect answer based on the returned data, but I have yet to observe significant errors in the LLM's reasoning of the function output.

Please submit a PR if you believe a reference function is incorrect.

Contact Info

Reach out [email protected] if you have general questions about this leaderboard.

About

DevOps AI Assistant benchmarks for AWS, Kubernetes, and more

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages