This repository has been archived by the owner on Nov 6, 2019. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The web-platform-tests project has delegated the task of results
collection from Firefox and Chrome to the Taskcluster service. It is no
longer necessary for this project to collect results from those
browsers.
Decommission the infrastructure which was dedicated solely to collection
from Firefox and Chrome and persist machines which are used to collect
from Edge and Safari via Sauce Labs. Ensure that the Buildbot master
does not schedule further collection effort from Firefox and Chrome.
One of the Ecosystem Infra team's Goals for Q4 of 2018 was to streamline and generally improve the infrastructure by which we collect test results for wpt.fyi. This patch furthers a key result towards that goal, "Complete migration of Chrome/Firefox runs to Taskcluster." Specifically:
In the process of conducting this research, I recognized that "authenticity" is necessary but not sufficient. In order to cancel results coming from this project, we also need confidence in the availability of results from Taskcluster.
The following report describes where we stand in terms of both metrics. We can use that data to decide whether we are ready to decommission this collection (via merging the pull request), and if we are not ready, we can have a more concrete discussion about the preconditions for doing so.
Report: Viability of full commitment to Taskcluster for results collection in WPT
The wpt/results-collector project has been uploading test results to https://wpt.fyi for over a year. During the summer, WPT integrated with the Taskcluster service to collect results for Chrome and Firefox. Shortly after, wpt.fyi began ingesting those results.
Our goal is to decommission the collection from the results-collector project, but before we do, we would like some assurance that this will not result in a degradation of service. Specifically, that means:
This report defines metrics for both of these traits, reports the current values of those metrics, and documents the process by which the metrics were designed and gathered.
Accuracy
Metric: percentage of tests which have the same status and subtest results in Buildbot and Taskcluster (for the same WPT revision and browser release).
Score: 89.95%
The results for the same browsers and the same revision of WPT do indeed differ, as can be seen using the "diff" visualization on wpt.fyi:
https://wpt.fyi/results/?product=firefox[buildbot,stable]@e94ae4b34e&product=firefox[taskcluster,stable]@e94ae4b34e&diff
https://wpt.fyi/results/?product=firefox[buildbot,experimental]@e94ae4b34e&product=firefox[taskcluster,experimental]@e94ae4b34e&diff
https://wpt.fyi/results/?product=chrome[buildbot,stable]@e94ae4b34e&product=chrome[taskcluster,stable]@e94ae4b34e&diff
https://wpt.fyi/results/?product=chrome[buildbot,experimental]@e94ae4b34e&product=chrome[taskcluster,experimental]@e94ae4b34e&diff
Many of these discrepancies can be explained by two differences in the way each system runs the tests:
In gh-13013, we're discussing the trade-offs of configuring Taskcluster to restart as aggressively as results-collection. As for test scheduling: ideally, execution order should not effect test outcome. The fact that it does is a problem that is largely test-specific and orthogonal to any comparison of the two systems.
The discrepancies resulting from those two distinctions are therefore less relevant for the purposes of this evaluation. To help focus on the other potential differences, I configured Taskcluster to mimic the results-collection project's approach on a dedicated branch of wpt. Here are the comparisons of results produced by results-collection and that altered Taskcluster configuration:
https://staging.wpt.fyi/results/?products=firefox[stable,buildbot]@e94ae4b34e,firefox[stable,jugglinmike]&diff
(184 tests with differing statuses, 34 tests with differing subtest results)
https://staging.wpt.fyi/results/?products=firefox[experimental,buildbot]@e94ae4b34e,firefox[experimental,jugglinmike]&diff
(830 tests with differing statuses, 32 tests with differing subtest results)
https://staging.wpt.fyi/results/?products=chrome[stable,buildbot]@e94ae4b34e,chrome[stable,jugglinmike]&diff
(32 tests with differing statuses, 35 tests with differing subtest results)
https://staging.wpt.fyi/results/?products=chrome[experimental,buildbot]@e94ae4b34e,chrome[experimental,jugglinmike]&diff
(36 tests with differing statuses, 61 tests with differing subtest results)
These numbers were used to calculate the score reported above:
Note that a few days passed between the original results-collection dataset and the Taskcluster-mimicking-results-collection dataset. For the experimental channels (i.e. Firefox Nightly and Chrome Dev), the browser under test is not a control across those trials, making direct comparison a little suspect. I've been focused on the stable releases for this reason.
The largest disparities come from
html/
in Firefox andcss/
in both browsers. Thehtml/
differences are partially due to one particularly memory-intensive test (the limited memory of the Buildbot workers cause them to time out inconsistently--see issue 616 in the results-collection project). Thecss/
disparities can be at least partially attributed to subtle differences in graphic rendering. My best guess is that this comes from differences in hardware. That makes the tests flaky. Just like the tests that are influenced by execution order, these are problems but not something that should block this effort. An additional source of discrepancy appears to be caused by some aspect of Docker; we're investigating that via issue 14485 in the web-platform-tests repository.I've been researching each of the differences, and so far, they have all been due to flakiness. In other words, they are not caused by any aspect of either system:
[resource-timing] Test behavior with cached assets[resource-timing] Avoid race conditionThere are still more discrepancies to investigate. It's difficult to predict if a given discrepancy indicates an environmental error, but there are diminishing returns with resolving one-off stability problems.
All these concessions suggest that a score of 89.95% is a baseline and that the actual accuracy is substantially higher.
Reliability
Metric: percentage of commits landed via non-fast-forward merges for which there are Taskcluster-computed results on https://wpt.fyi.
Score: 93.71%
I wrote a Node.js script (included below) to inspect commits made to WPT's
master
branch. For each commit it encounters, the script queries wpt.fyi for results uploaded by Taskcluster for Chrome ("dev" channel) and Firefox ("Nightly" channel). It labels each commit based on the results it finds.Some commits are landed to
master
via a fast-forward merge. Such commits will not trigger work in Taskcluster, and they will not be considered by the wpt.fyi "revision announcer". A lack of results for those commits is therefore not cause for concern, so the script does not consider them when calculating reliability.For the 500 commits between 2018-11-07 and 2018-12-12 (4052654..c8c9db4):
Indicating a failure rate of ~14.14%.
In the following visualizations, each commit is rendered on a time scale. A value of "1" indicates that results are available, a value of "0.5" indicates that results are not available because the commit was introduced via a fast-forward merge, and a value of "0" indicates that results are not available for an unknown reason.
Over the course of this quarter, we collaborated with the Taskcluster maintainers to diagnose some reliability issues with that system's GitHub.com integration (see, e.g. bug 1499576 and bug 1507254). The last of those issues were resolved on 2018-11-28. If we limit the calculation to the smaller set of commits merged to
master
since that date, the score improves significantly.For the 192 commits between 2018-11-28 and 2018-12-12 (4052654..09972ca):
Indicating a failure rate of ~6.29%.
Reliability measurement script