Add documentation for Prometheus metrics in Training Operator (#3894)

* Add Prometheus metrics guild for Training Operator Signed-off-by: Sophie Hsu <[email protected]> * Correct formating in Label description Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Sophie Hsu <[email protected]> * Incorporate feedback: 1. Add configuring metrics port section 2. Remove duplicate sentence 3. Use Note template for the consistent style 4. Move the doc under the user-guides directory Signed-off-by: Sophie Hsu <[email protected]> * Clarify labels information interpretation Co-authored-by: Helber Belmiro <[email protected]> Signed-off-by: Sophie Hsu <[email protected]> * Remove redundant space Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Sophie Hsu <[email protected]> * Update the argument explanation for restricting IP address Signed-off-by: Sophie Hsu <[email protected]> --------- Signed-off-by: Sophie Hsu <[email protected]> Signed-off-by: Sophie Hsu <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]> Co-authored-by: Helber Belmiro <[email protected]> Co-authored-by: Yuki Iwai <[email protected]>
kubeflow · Oct 21, 2024 · ec7c132 · ec7c132
1 parent 987c75e
commit ec7c132
Showing 1 changed file with 72 additions and 0 deletions.
diff --git a/content/en/docs/components/training/user-guides/prometheus.md b/content/en/docs/components/training/user-guides/prometheus.md
@@ -0,0 +1,72 @@
++++
+title = "Prometheus Monitoring"
+description = "Prometheus Metrics for the Training Operator"
+weight = 70
++++
+
+This guide explains how to monitor Kubeflow training jobs using Prometheus metrics. The Training Operator exposes these metrics, providing essential insights into the status of distributed machine learning workloads.
+
+{{< note >}}
+Metrics are only generated in response to specific events. For example, job creation metrics will only appear after a job has been created. If a metric is not visible, it may be because the corresponding event has not occurred yet.
+{{< /note >}}
+
+## Prometheus Metrics for Training Operator
+The Training Operator includes a built-in `/metrics` endpoint exposes Prometheus metrics. This feature is enabled by default and requires no additional configuration for basic use.
+
+### Configuring Metrics Port
+By default, metrics are exposed on port 8080 and can be scraped from any IP address. 
+
+If you want to change the default port for metrics exporting and limit which IP address can scrape the metrics, simply add the `metrics-bind-address` argument. 
+
+**For example**:
+```yaml
+# deployment.yaml for the Training Operator
+spec:
+    containers:
+    - command:
+        - /manager
+        image: kubeflow/training-operator
+        name: training-operator
+        ports:
+        - containerPort: 8080
+        - containerPort: 9443
+            name: webhook-server
+            protocol: TCP
+        args:
+        - "--metrics-bind-address=192.168.1.100:8082" 
+```
+
+**Explanation:**
+
+`--metrics-bind-address=192.168.1.100:8082` specifies that metrics are now available on **port 8082**, restricted to the IP address **192.168.1.100**. Alternatively, you can bind the metrics to all interfaces by using **0.0.0.0:8082**.
+
+
+### Accessing the Metrics
+The method to access these metrics may vary depending on your Kubernetes setup and environment. For example, use the following command for local environments:
+```
+kubectl port-forward -n kubeflow deployment/training-operator 8080:8080
+```
+
+Then you'll see metrics in this format via `http://localhost:8080/metrics`:
+```
+# HELP training_operator_jobs_created_total Counts number of jobs created
+# TYPE training_operator_jobs_created_total counter
+training_operator_jobs_created_total{framework="tensorflow",job_namespace="kubeflow"} 7
+```
+
+## List of Job Metrics
+
+| Metric name                          |  Description                     | Labels                                           |
+|------------------------------------|---------|--------------------------|------------------------------------------------------|
+| `training_operator_jobs_created_total`   |  Total number of jobs created       | `namespace`, `framework`                 |
+| `training_operator_jobs_deleted_total`   |  Total number of jobs deleted       | `namespace`, `framework`                 |
+| `training_operator_jobs_successful_total` |  Total number of successful jobs   |  `namespace`, `framework`                 |
+| `training_operator_jobs_failed_total`    |  Total number of failed jobs       |  `namespace`, `framework` |
+| `training_operator_jobs_restarted_total` |  Total number of restarted jobs   |  `namespace`, `framework`|
+
+Labels information can be interpreted as follows:
+| Label name                          |  Description                     | 
+|------------------------------------|---------|--------------------------|
+| `namespace`   | The Kubernetes namespace where the job is running        |
+| `framework` | The machine learning framework used (e.g. TensorFlow,PyTorch)     | 
+