Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Time Slice SLO #20888

Merged
merged 42 commits into from
Dec 18, 2023
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
9eedf24
Add time slice to left nav
estherk15 Dec 4, 2023
82c8767
Add time slice instructions and images
estherk15 Dec 4, 2023
7538e96
Add uptime calculations page
estherk15 Dec 4, 2023
1d64886
Add uptime calculations to left nav
estherk15 Dec 4, 2023
c7ea2a3
Merge branch 'master' into esther/docs-6808-time-slice-slo
estherk15 Dec 4, 2023
adf9578
Merge branch 'master' into esther/docs-6808-time-slice-slo
estherk15 Dec 6, 2023
27eb837
Standardize use of Time Slice SLO
estherk15 Dec 6, 2023
be1fe17
Remove duplicate file
estherk15 Dec 6, 2023
c38a9cd
Merge branch 'esther/docs-6808-time-slice-slo' of github.com:DataDog/…
estherk15 Dec 6, 2023
a0b5ba4
Merge uptime with time slice
estherk15 Dec 7, 2023
f4a67c0
Merge branch 'master' into esther/docs-6808-time-slice-slo
estherk15 Dec 11, 2023
1294f9d
Add SLO comparison chart
estherk15 Dec 11, 2023
d7a35d4
Merge branch 'master' into esther/docs-6808-time-slice-slo
estherk15 Dec 11, 2023
049744e
Apply code review suggestions
estherk15 Dec 11, 2023
7100145
Merge branch 'esther/docs-6808-time-slice-slo' of github.com:DataDog/…
estherk15 Dec 11, 2023
4d251f7
Merge branch 'esther/docs-6808-time-slice-slo' of github.com:DataDog/…
estherk15 Dec 11, 2023
79a973f
Merge branch 'esther/docs-6808-time-slice-slo' of github.com:DataDog/…
estherk15 Dec 11, 2023
95ab6d2
Update content/en/service_management/service_level_objectives/_index.md
estherk15 Dec 11, 2023
6389e6c
Apply suggestions from code review
estherk15 Dec 12, 2023
efc364f
Apply suggestions from code review, removed commented examples
estherk15 Dec 12, 2023
a47fdc7
Add time slice to left nav
estherk15 Dec 4, 2023
54d4b96
Add time slice instructions and images
estherk15 Dec 4, 2023
490229a
Add uptime calculations page
estherk15 Dec 4, 2023
35bc038
Add uptime calculations to left nav
estherk15 Dec 4, 2023
34abbfc
Standardize use of Time Slice SLO
estherk15 Dec 6, 2023
6a277e6
Remove duplicate file
estherk15 Dec 6, 2023
3daa809
Merge uptime with time slice
estherk15 Dec 7, 2023
15b3754
Add SLO comparison chart
estherk15 Dec 11, 2023
24aa777
Apply code review suggestions
estherk15 Dec 11, 2023
43033ad
Update content/en/service_management/service_level_objectives/_index.md
estherk15 Dec 11, 2023
b7548a8
Apply suggestions from code review
estherk15 Dec 12, 2023
62b88dc
Apply suggestions from code review, removed commented examples
estherk15 Dec 12, 2023
36c20d3
Add examples with images
estherk15 Dec 12, 2023
24d2e8e
Resolve merge conflicts
estherk15 Dec 13, 2023
ee5b98b
Merge branch 'master' into esther/docs-6808-time-slice-slo
estherk15 Dec 13, 2023
ecd395f
minor changes
roxanne-moslehi Dec 13, 2023
99ed33a
API info comparison chart
roxanne-moslehi Dec 13, 2023
a9ccc35
update comparison chart
roxanne-moslehi Dec 14, 2023
772793a
update comparison chart again
roxanne-moslehi Dec 14, 2023
eff235b
fix status correction info
roxanne-moslehi Dec 14, 2023
1764a80
update SLO definitions
roxanne-moslehi Dec 15, 2023
61b31a6
calendar view info
roxanne-moslehi Dec 15, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions config/_default/menus/menus.en.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1328,16 +1328,21 @@ main:
parent: slos
identifier: slos_metric
weight: 2
- name: Time Slice SLOs
url: service_management/service_level_objectives/time_slice/
parent: slos
identifier: slos_time_slice
weight: 3
- name: Error Budget Alerts
url: service_management/service_level_objectives/error_budget/
parent: slos
identifier: error_budget
weight: 3
weight: 4
- name: Burn Rate Alerts
url: service_management/service_level_objectives/burn_rate/
parent: slos
identifier: burn_rate
weight: 4
weight: 5
- name: Guides
url: service_management/service_level_objectives/guide/
parent: slos
Expand Down
104 changes: 60 additions & 44 deletions content/en/service_management/service_level_objectives/_index.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ disable_toc: true

{{< whatsnext desc="General guides:">}}
{{< nextlink href="/service_management/service_level_objectives/guide/slo-checklist" >}}SLO Checklist{{< /nextlink >}}
{{< nextlink href="/service_management/service_level_objectives/guide/slo_types_comparison" >}}SLO Type Comparison{{< /nextlink >}}
{{< /whatsnext >}}

{{< whatsnext desc="Dashboard guides:">}}
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
title: SLO Type Comparison
kind: Guide
further_reading:
- link: "/service_management/service_level_objectives/"
tag: "Documentation"
text: "Overview of Service Level Objectives"
- link: "/service_management/service_level_objectives/metric/"
tag: "Documentation"
text: "Metric-based SLOs"
- link: "/service_management/service_level_objectives/monitor/"
tag: "Documentation"
text: "Monitor-based SLOs"
---

## Overview

When creating SLOs, you can choose from the following types:

**Metric-based SLOs**: can be used for count-based data streams, the SLI is based on the sum of good events divided by the sum of total events.

Check notice on line 20 in content/en/service_management/service_level_objectives/guide/slo_types_comparison.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/en/service_management/service_level_objectives/guide/slo_types_comparison.md#L20

[Datadog.sentencelength] Try to keep your sentence length to 25 words or fewer.
Raw output
{"message": "[Datadog.sentencelength] Try to keep your sentence length to 25 words or fewer.", "location": {"path": "content/en/service_management/service_level_objectives/guide/slo_types_comparison.md", "range": {"start": {"line": 20, "column": 3}}}, "severity": "INFO"}
estherk15 marked this conversation as resolved.
Show resolved Hide resolved

**Time Slice SLOs** or **Monitor-based SLOs**: can be be used for time-based data sets, the SLI is based on the amount of time your system exhibits good behavior divided by the total time.

Check notice on line 22 in content/en/service_management/service_level_objectives/guide/slo_types_comparison.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/en/service_management/service_level_objectives/guide/slo_types_comparison.md#L22

[Datadog.sentencelength] Try to keep your sentence length to 25 words or fewer.
Raw output
{"message": "[Datadog.sentencelength] Try to keep your sentence length to 25 words or fewer.", "location": {"path": "content/en/service_management/service_level_objectives/guide/slo_types_comparison.md", "range": {"start": {"line": 22, "column": 3}}}, "severity": "INFO"}
- Time Slice SLOs: do not require a Datadog monitor, you can try out different metric filters and thresholds and instantly explore downtime during SLO creation.
- Monitor-based SLOs: must be based on a new or existing Datadog monitor, any adjustments must be made to the underlying monitor (cannot be done through SLO creation).


estherk15 marked this conversation as resolved.
Show resolved Hide resolved
## Comparison chart

| | **Metric-based SLO** | **Monitor-based SLO** | **Time Slice SLO** |
|-----------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
| **Supported data types** | Metrics with type of count, rate, or distribution | Metric Monitor types, Synthetic Monitors, and Service Checks | All metric types (including gauge metrics) |
| **Group functionality** | Unlimited groups per SLO | Limited to 20 monitor groups per SLO | Up to 5,000 groups per SLO |
| **SLO details side panel History (up to 90 days of historical data)** | Can set custom time windows to view SLO info | Cannot set custom time windows to view SLO info | Can set custom time windows to view SLO info |
| **SLO alerting ([Error Budget][1] or [Burn Rate][2] Alerts)** | Available for all metric-based SLOs | Available for SLOs based on Metric Monitor types only (not available for Synthetic Monitors or Service Checks) | Not yet available for Time Slice SLOs |
estherk15 marked this conversation as resolved.
Show resolved Hide resolved
| [**SLO Status Corrections**][3] | Correction periods are ignored from SLO status and error budget calculations | Correction periods are ignored from SLO status and error budget calculations | Correction periods are counted as uptime in SLO status and error budget calculations |
| [**SLO Widgets**][4] | Available for Metric-based SLOs | Available for all Monitor-based SLO types | Available for Time Slice SLOs |
| [**SLO Data Source**][5] | Available for this SLO type (up to 15 months of historical data available) | Not yet available for this SLO type | Not yet available for Time Slice SLOs |
estherk15 marked this conversation as resolved.
Show resolved Hide resolved
| **Handling missing data in the SLO calculation** | Missing data is ignored in SLO status and error budget calculations | Missing data is handled based on the [underlying Monitor's configuration][6] | Missing data is treated as uptime in SLO status and error budget calculations |
| [**Uptime Calculations**][7] | N/A | Uptime is calculated by looking at discrete time chunks, not rolling time windows<br><br>If groups are present, overall uptime requires *all* groups to have uptime | Uptime calculations are based on the underlying Monitor <br><br>If groups are present, overall uptime requires *all* groups to have uptime|
| **Calendar Vew on SLO Manage Page** | Available | Not available | Available |
estherk15 marked this conversation as resolved.
Show resolved Hide resolved

estherk15 marked this conversation as resolved.
Show resolved Hide resolved
estherk15 marked this conversation as resolved.
Show resolved Hide resolved


## Further reading

{{< partial name="whats-next/whats-next.html" >}}

[1]: https://docs.datadoghq.com/service_management/service_level_objectives/error_budget/
[2]: https://docs.datadoghq.com/service_management/service_level_objectives/burn_rate/
[3]: https://docs.datadoghq.com/service_management/service_level_objectives/#slo-status-corrections
[4]: https://docs.datadoghq.com/service_management/service_level_objectives/#slo-widgets
[5]: https://docs.datadoghq.com/dashboards/guide/slo_data_source/
[6]: https://docs.datadoghq.com/service_management/service_level_objectives/monitor/#missing-data
[7]: /service_management/service_level_objectives/time_slice/#uptime-calculations
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
title: Time Slice SLOs
kind: documentation
further_reading:
- link: "service_management/service_level_objectives/"
tag: "Documentation"
text: "Overview of Service Level Objectives"
---

{{< jqmath-vanilla >}}

## Overview

Time Slice SLOs allow you to measure reliability using a custom definition of uptime. You define uptime as a condition over a metric timeseries. For example, you can create a latency SLO by defining uptime as whenever p95 latency is less than 1 second.

Time Slice SLOs are a convenient alternative to Monitor-based SLOs. You can create an uptime SLO without going through a monitor so you don't have to create and maintain both a monitor and an SLO.

Check notice on line 16 in content/en/service_management/service_level_objectives/time_slice.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/en/service_management/service_level_objectives/time_slice.md#L16

[Datadog.sentencelength] Try to keep your sentence length to 25 words or fewer.
Raw output
{"message": "[Datadog.sentencelength] Try to keep your sentence length to 25 words or fewer.", "location": {"path": "content/en/service_management/service_level_objectives/time_slice.md", "range": {"start": {"line": 16, "column": 69}}}, "severity": "INFO"}
estherk15 marked this conversation as resolved.
Show resolved Hide resolved

## Create a Time Slice SLO

Check warning on line 18 in content/en/service_management/service_level_objectives/time_slice.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/en/service_management/service_level_objectives/time_slice.md#L18

[Datadog.headings] 'Create a Time Slice SLO' should use sentence-style capitalization.
Raw output
{"message": "[Datadog.headings] 'Create a Time Slice SLO' should use sentence-style capitalization.", "location": {"path": "content/en/service_management/service_level_objectives/time_slice.md", "range": {"start": {"line": 18, "column": 4}}}, "severity": "WARNING"}

You can create a Time Slice SLO through the following ways:
- [Create an SLO from the create page](#create-an-slo-from-the-create-page)
- [Export an existing Monitor-based SLO](#export-an-existing-monitor-slo)
- [Import from a monitor](#import-from-a-monitor)

### Create an SLO from the create page

{{< img src="service_management/service_level_objectives/time_slice/create_and_configuration.png" alt="Configuration options to create a Time Slice SLO" style="width:100%;" >}}

1. Navigate to Service Management > SLOs
estherk15 marked this conversation as resolved.
Show resolved Hide resolved
1. Click **+ New SLO** to open up the Create SLO page.
estherk15 marked this conversation as resolved.
Show resolved Hide resolved
1. Select **By Time Slices** to define your SLo measurement.
estherk15 marked this conversation as resolved.
Show resolved Hide resolved
1. Define your uptime condition by choosing a metric query, comparator and threshold. For example, to define uptime as whenever p95 latency is less than 1s. Alternatively, you can [import the uptime from a monitor](#import-from-a-monitor).
estherk15 marked this conversation as resolved.
Show resolved Hide resolved
1. Choose your timeframe and target
estherk15 marked this conversation as resolved.
Show resolved Hide resolved
1. Name and tag your SLO.
1. Click **Create**.

### Export an existing monitor SLO

<div class="alert alert-info">Only metric monitor SLOs can be exported. Non-metric monitors or multi-monitor SLOs cannot be exported</div>
estherk15 marked this conversation as resolved.
Show resolved Hide resolved

Create a Time Slice SLO by exporting an existing Monitor-based SLO. From a monitor SLO click **Export to Time Slice SLO**.
estherk15 marked this conversation as resolved.
Show resolved Hide resolved

{{< img src="service_management/service_level_objectives/time_slice/export_monitor_slo.png" alt="On a Monitor-based SLO detail side panel, the button to Export to Time Slice is highlighted" style="width:90%;" >}}

### Import from a monitor

<div class="alert alert-info">Only metric monitor SLOs appear in the monitor selection for import. </div>

From the Create or Edit SLO page, under *Define your SLI*, click **Import from Monitor** and select from the dropdown or search in the monitor selector.

Check notice on line 49 in content/en/service_management/service_level_objectives/time_slice.md

View workflow job for this annotation

GitHub Actions / vale

[vale] content/en/service_management/service_level_objectives/time_slice.md#L49

[Datadog.sentencelength] Try to keep your sentence length to 25 words or fewer.
Raw output
{"message": "[Datadog.sentencelength] Try to keep your sentence length to 25 words or fewer.", "location": {"path": "content/en/service_management/service_level_objectives/time_slice.md", "range": {"start": {"line": 49, "column": 1}}}, "severity": "INFO"}
estherk15 marked this conversation as resolved.
Show resolved Hide resolved

**Note**: Time Slice SLOs do not support rolling periods. Rolling periods do not transfer from a monitor query to a Time Slice query.

{{< img src="service_management/service_level_objectives/time_slice/import_from_monitor.png" alt="Highlighted option to Import From Monitor in the Define your SLI section of an SLO configuration" style="width:90%;" >}}

## Uptime calculations

To calculate the uptime percentage for a Time Slice SLOs, Datadog cuts the timeseries into equal-duration intervals, called "slices". The length of the interval is 5 minutes and not configurable. The space and time aggregation are determined by the metric query. For more information on time and space aggregation, see the [metrics][1] documentation.

For each slice, there is a single value for the timeseries, and the uptime condition (such as `value < 1`) is evaluated for each slice. If the condition is met, the slice is considered uptime, otherwise it is considered downtime.

### Groups and overall uptime

Time Slice SLOs allows you to track uptime for individual groups, where groups are defined in the "group by" portion of the metric query.
estherk15 marked this conversation as resolved.
Show resolved Hide resolved

When groups are present, uptime is calculated for each individual group. However, overall uptime works differently. In order to match existing monitor SLO functionality, Time Slice SLOs use the same definition of overall uptime. When **all** groups have uptime, it is considered overall uptime. Conversely, if **any** group has downtime, it is considered overall downtime. Overall uptime is always less than the uptime for any individual group.

<!-- In the example above, environment "staging" has 5 minutes of downtime over a 24-hour period, resulting in approximately 99.652% of uptime.
estherk15 marked this conversation as resolved.
Show resolved Hide resolved

$$ (1440-5)/1440 *100 = ~99.652% $$

Environment "dev" also had 5 minutes of downtime, resulting in the same uptime. That means that overall downtime (such as when either datacenter staging or dev had downtime) was 10 minutes since there is no overlap. This results in approximately 99.305% uptime.

$$ (1440-10)/1440 *100 = ~99.652% $$ -->

### Corrections

Time Slice SLOs count correction periods as uptime in all calculations. Since the total time remains constant, the error budget is always a fixed amount of time as well. This is a significant simplification and improvement over how corrections are handled for monitor-based SLOs.

For monitor-based SLOs, corrections are periods that are removed from the calculation. If a one-day-long correction is added to a 7-day SLO, 1 hour of downtime counts as 0.7% instead of 0.6%
estherk15 marked this conversation as resolved.
Show resolved Hide resolved

$$ 60/8640 *100 = ~0.7% $$

Instead of
estherk15 marked this conversation as resolved.
Show resolved Hide resolved

$$ 60/10080 *100 = ~0.6% $$

The effects on error budget can be unusual. Removing time from an uptime SLO causes time dilation, where each minute of downtime represents a larger fraction of the total time.

### Missing data

In Time Slice SLOs, missing data is always treated as uptime. While missing data is treated as uptime, it is gray on the timeline visualization.

## Further Reading

{{< partial name="whats-next/whats-next.html" >}}

[1]: /metrics/#time-and-space-aggregation
estherk15 marked this conversation as resolved.
Show resolved Hide resolved
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading