Skip to content

Commit

Permalink
Bug 1683042 - Document Perfherder's data retention policy
Browse files Browse the repository at this point in the history
  • Loading branch information
ionutgoldan authored Jan 13, 2021
1 parent bd74f86 commit df7fe18
Show file tree
Hide file tree
Showing 3 changed files with 56 additions and 2 deletions.
51 changes: 51 additions & 0 deletions docs/data_cycling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Data retention policies

## On Perfherder

On a daily basis, Perfherder expires data for several reasons:

* data provides less value as it grows older
* data accumulates very fast (>1 million new data points are ingested daily)
* query latency degrades in time
* database is rather limited (in terms of storage capacity & scalability)

To ensure persistence of the most relevant performance data, Perfherder' s cycling algorithm takes a more aggressive approach towards the less relevant one. It employs multiple expiring strategies, each one specialized on deleting specific sets of data.

Basically, not all data is deleted in the same way. Some data sets can be kept for longer time than others.

Data targeted for removal includes:

* data points
* series (AKA performance signatures; they collect data points sharing same characteristics)
* alerts
* alert summaries

Generally, the daily cycling starts by removing data points first, using all of its defined strategies. Then it continues with removing series, alerts & alert summaries using a garbage collection approach.

### Cycling strategies

All following strategies target the `performance_datum` table, which stores the performance data points.

#### Generic

Removes data points older than 1 year.

#### Try data

Removes data points originating from try pushes, that are older than 6 weeks.

#### Not actively sheriffed

Removes data points from repositories other than autoland, mozilla-central, mozilla-beta, fenix & reference-browser, which are older than 6 months.

#### Stalled data

Removes data points from series which haven't been ingesting new ones for the last 4 months.

### Garbage collection

Removes performance signatures which no longer has any data points linked to them. This cascades to the linked alerts, as they don't make sense without a parent series.

Removes alert summaries which no longer has any alerts linked to them.

These kinds of data pertain to the `performance_signature`, `performance_alert` & `performance_alert_summary` table respectively.
3 changes: 2 additions & 1 deletion docs/infrastructure/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,7 @@ There are also tasks that are run on a schedule, triggered via either:
The tasks it currently runs are:

- `update_bugscache` (hourly)
- `cycle_data` (daily)
- [cycle_data] (daily)
- `run_intermittents_commenter` (daily)

2. The `celery_scheduler` dyno
Expand All @@ -91,6 +91,7 @@ There are also tasks that are run on a schedule, triggered via either:
[adjusting scheduled tasks]: administration.md#adjusting-scheduled-tasks
[one-off dynos]: https://devcenter.heroku.com/articles/one-off-dynos
[deps of bug 1176492]: https://bugzilla.mozilla.org/showdependencytree.cgi?id=1176492&hide_resolved=1
[cycle_data]: ../data_cycling.md

## Deployment lifecycle

Expand Down
4 changes: 3 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,9 @@ nav:
- Architecture: 'infrastructure/architecture.md'
- Administration: 'infrastructure/administration.md'
- Troubleshooting: 'infrastructure/troubleshooting.md'
- Accessing data: 'accessing_data.md'
- Data policies:
- Accessing data: 'accessing_data.md'
- Data retention: 'data_cycling.md'
- Submitting data: 'submitting_data.md'
- SETA: 'seta.md'
- Manual test cases: 'testcases.md'

0 comments on commit df7fe18

Please sign in to comment.