Cancel collection from Firefox and Chrome #640

jugglinmike · 2018-12-17T18:52:20Z

The web-platform-tests project has delegated the task of results
collection from Firefox and Chrome to the Taskcluster service. It is no
longer necessary for this project to collect results from those
browsers.

Decommission the infrastructure which was dedicated solely to collection
from Firefox and Chrome and persist machines which are used to collect
from Edge and Safari via Sauce Labs. Ensure that the Buildbot master
does not schedule further collection effort from Firefox and Chrome.

One of the Ecosystem Infra team's Goals for Q4 of 2018 was to streamline and generally improve the infrastructure by which we collect test results for wpt.fyi. This patch furthers a key result towards that goal, "Complete migration of Chrome/Firefox runs to Taskcluster." Specifically:

Results collection: identify and resolve any issues which detract from authenticity of Taskcluster relative to Buildbot

Results collection: once authenticity is found to be equivalent or superior to Buildbot, decommission Buildbot configuration for Firefox and Chrome

In the process of conducting this research, I recognized that "authenticity" is necessary but not sufficient. In order to cancel results coming from this project, we also need confidence in the availability of results from Taskcluster.

The following report describes where we stand in terms of both metrics. We can use that data to decide whether we are ready to decommission this collection (via merging the pull request), and if we are not ready, we can have a more concrete discussion about the preconditions for doing so.

Report: Viability of full commitment to Taskcluster for results collection in WPT

The wpt/results-collector project has been uploading test results to https://wpt.fyi for over a year. During the summer, WPT integrated with the Taskcluster service to collect results for Chrome and Firefox. Shortly after, wpt.fyi began ingesting those results.

Our goal is to decommission the collection from the results-collector project, but before we do, we would like some assurance that this will not result in a degradation of service. Specifically, that means:

Accuracy - the data produced by the two systems are equivalent
Availability - the Taskcluster service consistently uploads data

This report defines metrics for both of these traits, reports the current values of those metrics, and documents the process by which the metrics were designed and gathered.

Accuracy

Metric: percentage of tests which have the same status and subtest results in Buildbot and Taskcluster (for the same WPT revision and browser release).

Score: 89.95%

The results for the same browsers and the same revision of WPT do indeed differ, as can be seen using the "diff" visualization on wpt.fyi:

Firefox's stable release:
https://wpt.fyi/results/?product=firefox[buildbot,stable]@e94ae4b34e&product=firefox[taskcluster,stable]@e94ae4b34e&diff
Firefox's Nightly release:
https://wpt.fyi/results/?product=firefox[buildbot,experimental]@e94ae4b34e&product=firefox[taskcluster,experimental]@e94ae4b34e&diff
Chrome's stable release:
https://wpt.fyi/results/?product=chrome[buildbot,stable]@e94ae4b34e&product=chrome[taskcluster,stable]@e94ae4b34e&diff
Chrome's Dev release:
https://wpt.fyi/results/?product=chrome[buildbot,experimental]@e94ae4b34e&product=chrome[taskcluster,experimental]@e94ae4b34e&diff

Many of these discrepancies can be explained by two differences in the way each system runs the tests:

results-collection restarts the browser under test following every failure, error, and timeout. Taskcluster does not.
results-collection segments WPT into 20 "chunks" of all test types. Taskcluster segments the suite into 15 "chunks" of testharness.js tests, 10 of reftests, and 1 of wdspec tests.

In gh-13013, we're discussing the trade-offs of configuring Taskcluster to restart as aggressively as results-collection. As for test scheduling: ideally, execution order should not effect test outcome. The fact that it does is a problem that is largely test-specific and orthogonal to any comparison of the two systems.

The discrepancies resulting from those two distinctions are therefore less relevant for the purposes of this evaluation. To help focus on the other potential differences, I configured Taskcluster to mimic the results-collection project's approach on a dedicated branch of wpt. Here are the comparisons of results produced by results-collection and that altered Taskcluster configuration:

Firefox's stable release:
https://staging.wpt.fyi/results/?products=firefox[stable,buildbot]@e94ae4b34e,firefox[stable,jugglinmike]&diff
(184 tests with differing statuses, 34 tests with differing subtest results)
Firefox's Nightly release:
https://staging.wpt.fyi/results/?products=firefox[experimental,buildbot]@e94ae4b34e,firefox[experimental,jugglinmike]&diff
(830 tests with differing statuses, 32 tests with differing subtest results)
Chrome's stable release:
https://staging.wpt.fyi/results/?products=chrome[stable,buildbot]@e94ae4b34e,chrome[stable,jugglinmike]&diff
(32 tests with differing statuses, 35 tests with differing subtest results)
Chrome's Dev release:
https://staging.wpt.fyi/results/?products=chrome[experimental,buildbot]@e94ae4b34e,chrome[experimental,jugglinmike]&diff
(36 tests with differing statuses, 61 tests with differing subtest results)

These numbers were used to calculate the score reported above:

(((36+61)/3096) + ((32+35)/3096) + ((830+32)/3096) + ((184+34)/3096)) / 4

Note that a few days passed between the original results-collection dataset and the Taskcluster-mimicking-results-collection dataset. For the experimental channels (i.e. Firefox Nightly and Chrome Dev), the browser under test is not a control across those trials, making direct comparison a little suspect. I've been focused on the stable releases for this reason.

The largest disparities come from html/ in Firefox and css/ in both browsers. The html/ differences are partially due to one particularly memory-intensive test (the limited memory of the Buildbot workers cause them to time out inconsistently--see issue 616 in the results-collection project). The css/ disparities can be at least partially attributed to subtle differences in graphic rendering. My best guess is that this comes from differences in hardware. That makes the tests flaky. Just like the tests that are influenced by execution order, these are problems but not something that should block this effort. An additional source of discrepancy appears to be caused by some aspect of Docker; we're investigating that via issue 14485 in the web-platform-tests repository.

I've been researching each of the differences, and so far, they have all been due to flakiness. In other words, they are not caused by any aspect of either system:

There are still more discrepancies to investigate. It's difficult to predict if a given discrepancy indicates an environmental error, but there are diminishing returns with resolving one-off stability problems.

All these concessions suggest that a score of 89.95% is a baseline and that the actual accuracy is substantially higher.

Reliability

Metric: percentage of commits landed via non-fast-forward merges for which there are Taskcluster-computed results on https://wpt.fyi.

Score: 93.71%

I wrote a Node.js script (included below) to inspect commits made to WPT's master branch. For each commit it encounters, the script queries wpt.fyi for results uploaded by Taskcluster for Chrome ("dev" channel) and Firefox ("Nightly" channel). It labels each commit based on the results it finds.

Some commits are landed to master via a fast-forward merge. Such commits will not trigger work in Taskcluster, and they will not be considered by the wpt.fyi "revision announcer". A lack of results for those commits is therefore not cause for concern, so the script does not consider them when calculating reliability.

For the 500 commits between 2018-11-07 and 2018-12-12 (4052654..c8c9db4):

Both         386
Chrome only  0
Firefox only 2
Neither      63
Skipped      49

Indicating a failure rate of ~14.14%.

In the following visualizations, each commit is rendered on a time scale. A value of "1" indicates that results are available, a value of "0.5" indicates that results are not available because the commit was introduced via a fast-forward merge, and a value of "0" indicates that results are not available for an unknown reason.

Over the course of this quarter, we collaborated with the Taskcluster maintainers to diagnose some reliability issues with that system's GitHub.com integration (see, e.g. bug 1499576 and bug 1507254). The last of those issues were resolved on 2018-11-28. If we limit the calculation to the smaller set of commits merged to master since that date, the score improves significantly.

For the 192 commits between 2018-11-28 and 2018-12-12 (4052654..09972ca):

Both         164
Chrome only  0
Firefox only 1
Neither      10
Skipped      17

Indicating a failure rate of ~6.29%.

Reliability measurement script

'use strict';
require('replay');

const { get: httpGet } = require('http');
const { get: httpsGet } = require('https');
const { parse: parseUrl } = require('url');

const UA = 'wpt-results-verifier';
const repo = 'web-platform-tests/wpt';

class HTTPError extends Error {
  constructor(statusCode, body) {
    super(statusCode + '\n' + body);
    this.statusCode = statusCode;
  }
}

const get = (url) => {
  return new Promise((resolve, reject) => {
    const options = parseUrl(url);
    const protoGet = options.protocol === 'http:' ? httpGet : httpsGet;

    options.headers = { 'User-Agent': UA };

    protoGet(options, function(response) {
      if (response.statusCode >= 300 && response.statusCode < 400) {
        get(response.headers.location).then(resolve, reject);
        return;
      }

      let body = '';
      response.on('error', reject);
      response.on('data', (chunk) => body += chunk);
      response.on('end', () => {
        if (response.statusCode < 200 || response.statusCode >= 400) {
          reject(new HTTPError(response.statusCode, body));
          return;
        }
        resolve(body);
      });
    }).on('error', reject);
  });
};
const getJson = async (url) => JSON.parse(await get(url));
const getCommits = async (start, total) => {
  const commits = [];

  while (commits.length < total) {
    const moreCommits = await getJson(
      `https://api.github.com/repos/${repo}/commits?sha=${start}`
    );
    const first = moreCommits.shift();
    if (commits.length === 0) {
      commits.push(first);
    }
    commits.push(...moreCommits);

    start = commits[commits.length - 1].sha;
  }

  return commits.slice(0, total);
};
const checkResults = async (commit) => {
  const result = { chrome: false, firefox: false };
  try {
    const runs = await getJson(
      'https://wpt.fyi/api/runs?label=taskcluster&sha=' + commit.sha
    );
    const browsers = runs.map((run) => run.browser_name);
    result.chrome = browsers.includes('chrome');
    result.firefox = browsers.includes('firefox');
  } catch (error) {
    if (!(error instanceof HTTPError) || error.statusCode !== 404) {
      throw error;
    }
  }

  return result;
};
const hasStatus = async (sha) => {
  const statuses = await getJson(
    `https://api.github.com/repos/${repo}/commits/${sha}/statuses`
  );
  return statuses.length > 0;
};

(async () => {
  const fromRef = process.argv[2];
  const total = parseInt(process.argv[3], 10);
  const byAvailability = {
    both: [],
    chromeOnly: [],
    firefoxOnly: [],
    neither: []
  };
  const skipped = [];

  console.error(`Inspecting ${total} commits from ${fromRef}...`);

  console.log('date,sha,chrome,firefox');

  for (const commit of await getCommits(fromRef, total)) {
    const { firefox, chrome } = await checkResults(commit);
    if (firefox) {
      if (chrome) {
        byAvailability.both.push(commit);
      } else {
        byAvailability.firefoxOnly.push(commit);
      }
    } else {
      if (chrome) {
        byAvailability.ChromeOnly.push(commit);
      } else {
        if (await hasStatus(commit.sha)) {
          byAvailability.neither.push(commit);
        } else {
          skipped.push(commit);
        }
      }
    }

    const missingScore = skipped.includes(commit) ? 0.5 : 0;

    console.log([
      commit.commit.committer.date.slice(0, -1),
      commit.sha,
      chrome ? 1 : missingScore,
      firefox ? 1 : missingScore
    ].join(','));
  }

  {
    const total = Object.keys(byAvailability)
      .reduce((sum, key) => sum + byAvailability[key].length, 0) +
      skipped.length;
    const failed = total - byAvailability.both.length - skipped.length;
    console.log(`
Both         ${byAvailability.both.length}
Chrome only  ${byAvailability.chromeOnly.length}
Firefox only ${byAvailability.firefoxOnly.length}
Neither      ${byAvailability.neither.length}
Skipped      ${skipped.length}
---------------
Total        ${total}

Failure rate: ${(100 * failed / (total - skipped.length)).toFixed(2)}%
`.trim());
  }
})().catch((err) => {
  console.error(err);
  process.exitCode = 1;
});

The web-platform-tests project has delegated the task of results collection from Firefox and Chrome to the Taskcluster service. It is no longer necessary for this project to collect results from those browsers. Decommission the infrastructure which was dedicated solely to collection from Firefox and Chrome and persist machines which are used to collect from Edge and Safari via Sauce Labs. Ensure that the Buildbot master does not schedule further collection effort from Firefox and Chrome.

jugglinmike requested a review from foolip December 17, 2018 18:52

This was referenced Jan 29, 2019

Use print() function in both Python 2 and Python 3 #641

Open

Triage Safari differences between Azure Pipelines and Buildbot setup #646

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cancel collection from Firefox and Chrome #640

Cancel collection from Firefox and Chrome #640

jugglinmike commented Dec 17, 2018

Cancel collection from Firefox and Chrome #640

Are you sure you want to change the base?

Cancel collection from Firefox and Chrome #640

Conversation

jugglinmike commented Dec 17, 2018

Report: Viability of full commitment to Taskcluster for results collection in WPT

Accuracy

Reliability