MLPerf Training Rules

Table of Contents

1. Overview
- 1.1. Definitions (read this section carefully)
2. General rules
3. Benchmarks
4. Divisions
- 4.1. Closed Division
- 4.2. Open Division
5. Basics
6. Data Set
7. RL Environment
8. Model
9. Training Loop
10. Software Adoption
11. Run Results
12. Benchmark Results
13. Appendix: Benchmark Specific Rules
14. Appendix: Allowed Optimizers

1. Overview

This document describes how to implement the MLPerf Training Suite using an ML framework and how to use that implementation to measure the performance of an ML software framework or hardware.

There are seperate rules for the submission, review, and publication process for all MLPerf benchmarks here.

The MLPerf name and logo are trademarks. In order to refer to a result using the MLPerf name, the result must conform to the letter and spirit of the rules specified in this document. The MLPerf organization reserves the right to solely determine if a use of its name or logo is acceptable.

1.1. Definitions (read this section carefully)

The following definitions are used throughout this document:

Performance always refers to execution speed.

Quality always refers to a model’s ability to produce “correct” outputs.

A system consists of a defined set of hardware resources such as processors, memories, disks, and interconnect. It also includes specific versions of all software such as operating system, compilers, libraries, and drivers that significantly influences the running time of a benchmark, excluding the ML framework.

A framework is a specific version of a software library or set of related libraries, possibly with associated offline compiler, for training ML models using a system. Examples include specific versions of Caffe2, MXNet, PaddlePaddle, pyTorch, or TensorFlow.

A benchmark is an abstract problem that can be solved using ML by training a model based on a specific dataset or simulation environment to a target quality level.

A suite is a specific set of benchmarks.

A division is a set of rules for implementing benchmarks from a suite to produce a class of comparable results.

A reference implementation is a specific implementation of a benchmark provided by the MLPerf organization.

A benchmark implementation is an implementation of a benchmark in a particular framework by a user under the rules of a specific division.

A submission implementation set is a set of benchmark implementations for one or more benchmarks from a suite under the rules of a specific division using the same framework.

A run is a complete execution of an implementation on a system, training a model from initialization to the quality target.

A run result is the wallclock time required for a run.

A reference result is the result provided by the MLPerf organization for each reference implementation

A benchmark result is the mean of a benchmark-specific number of run results, dropping the highest and lowest results. The result is then normalized to the reference result for that benchmark. Normalization is of the form (reference result / benchmark result) such that a better benchmark result produces a higher number.

A submission result set is a one benchmark result for each benchmark implementation in a submission implementation set.

A submission is a submission implementation set and a corresponding submission result set.

A custom summary result is the weighted geometric mean of an arbitrary set of results from a specific submission. MLPerf endorses this methodology for computing custom summary results but does not endorse any official summary result.

2. General rules

The following rules apply to all benchmark implementations.

2.1. Strive to be fair

Benchmarking should be conducted to measure the framework and system performance as fairly as possible. Ethics and reputation matter.

2.2. System and framework must be consistent

The same system and framework must be used for a submission result set. Note that the reference implementations do not all use the same framework.

2.3. Benchmark detection is not allowed

The framework and system should not detect and behave differently for benchmarks.

2.4. Pre-training is not allowed

Unless part of the definition of a benchmark, the implementation should not encode any information about the content of the dataset or a successful model’s state in any form. High-level statistical information about the dataset, such as distribution of sizes, may be used.

For benchmarks which are defined as starting from a fixed set of weights, such as a checkpoint or backbone, the implementation should start from the weights provided in the benchmark reference definition, or if that is not posssible, provide information and code sufficient for reproducing how those starting weights were obtained. For v0.7, sets of weights used in v0.6 are allowed.

3. Benchmarks

The benchmark suite consists of the benchmarks shown in the following table.

Area	Problem	Dataset
Vision	Image classification	ImageNet
	Image segmentation (medical)	KiTS19
	Object detection (light weight)	COCO
	Object detection (heavy weight)	COCO
Language	Speech recognition	LibriSpeech
	NLP	Wikipedia 2020/01/01
Commerce	Recommendation	1TB Click Logs
Research	Reinforcement learning	Go

The MLPerf organization provides a reference implementation of each benchmark, which includes the following elements:

Code that implements the model in a framework.

A plain text “README.md” file that describes:

Problem
- Dataset/Environment
- Publication/Attribution
- Data preprocessing
- Training and test data separation
- Training data order
- Test data order
- Simulation environment (RL models only)
- Steps necessary for reproducing the initial set of weights, if an initial set of non-standard weights is used. For v0.7, weights from v0.6 may be used without this information.
- Publication/Attribution
- List of layers
- Weight and bias initialization
- Loss function
- Optimizer
Quality
- Quality metric
- Quality target
- Evaluation frequency (training items between quality evaluations)
- Evaluation thoroughness (test items per quality evaluation)
Directions
- Steps to configure machine
- Steps to download and verify data
- Steps to run and time

A “download_dataset” script that downloads the dataset.

A “verify_dataset” script that verifies the dataset against the checksum.

A “run_and_time” script that executes the benchmark and reports the wall-clock time.

4. Divisions

There are two divisions of the benchmark suite, the Closed division and the Open division.

4.1. Closed Division

The Closed division requires using the same preprocessing, model, training method, and quality target as the reference implementation.

The closed division models and quality targets are:

Area	Problem	Model	Target
Vision	Image classification	ResNet-50 v1.5	75.90% classification
	Image segmentation (medical)	U-Net3D	0.908 Mean DICE score
	Object detection (light weight)	SSD	23.0% mAP
	Object detection (heavy weight)	Mask R-CNN	0.377 Box min AP and 0.339 Mask min AP
Language	Speech recognition	RNN-T	0.058 Word Error Rate
	NLP	BERT	0.712 Mask-LM accuracy
Commerce	Recommendation	DLRM	0.8025 AUC
Research	Reinforcement learning	Mini Go (based on Alpha Go paper)	50% win rate vs. checkpoint

Closed division benchmarks must be referred to using the benchmark name plus the term Closed, e.g. “for the Image Classification Closed benchmark, the system achieved a result of 7.2.”

4.2. Open Division

The Open division allows using arbitrary training data, preprocessing, model, and/or training method. However, the Open division still requires using supervised or reinforcement machine learning in which a model is iteratively improved based on training data, simulation, or self-play.

Open division benchmarks must be referred to using the benchmark name plus the term Open, e.g. “for the Image Classification Open benchmark, the system achieved a result of 7.2.”

5. Basics

5.1. Random numbers

CLOSED: Random numbers must be generated using stock random number generators.

Random number generators may be seeded from the following sources:

Clock
System source of randomness, e.g. /dev/random or /dev/urandom
Another random number generator initialized with an allowed seed

Random number generators may be initialized repeatedly in multiple processes or threads. For a single run, the same seed may be shared across multiple processes or threads.

OPEN: Any random number generation may be used.

5.2. Numerical formats

CLOSED: The numerical formats fp64, fp32, tf32, fp16, bfloat16, Graphcore FLOAT 16.16, int8, uint8, int4, and uint4 are pre-approved for use. Additional formats require explicit approval. Scaling may be added where required to compensate for different precision.

OPEN: Any format and scaling may be used.

5.3. Epoch numbering

Epochs should always be numbered from 1.

5.4. Result rounding

Public results should be rounded normally.

6. Data Set

6.1. Data State at Start of Run

CLOSED: Each reference implementation includes a script to download the input dataset and script to verify the dataset using a checksum. The data must then be preprocessed in a manner consistent with the reference implementation, excepting any transformations that must be done for each run (e.g. random transformations). The data may also be reformatted for the target system provided that the reformatting does not introduce new information or introduce duplicate copies of data.

OPEN: Any public dataset may be used for training the model, however the evaluation data must be drawn from the benchmark dataset in a manner consistent with the reference.

You must flush the cache or restart the system prior to benchmarking. Data can start on any durable storage system such as local disks and cloud storage systems. This explicitly excludes RAM.

6.2. Preprocessing During the Run

Only preprocessing that must be done for each run (e.g. random transformations) must be timed.

CLOSED: The same preprocessing steps as the reference implementation must be used.

OPEN: Any preprocessing steps are allowed for training data. However, each datum must be preprocessed individually in a manner that is not influenced by any other data. The evaluation data must be preprocessed in a manner consistent with reference.

6.3. Data Representation

CLOSED: Images must have the same size as in the reference implementation. Mathematically equivalent padding of images is allowed.

CLOSED: For benchmarks with sequence inputs, you may choose a length N and either truncate all examples to length N or throw out all examples which exceed length N. This must be done uniformly for all examples. This may only be done on the training set and not the evaluation set.

CLOSED: Two ways to represent the Mask R-CNN mask are permitted. One is a polygon and the other is a scalable bitmask.

OPEN: The closed division data representations restrictions only apply at the start of the run. Data may be represented in an arbitrary fashion during the run.

6.4. Other Uses of Data

Input encoding data, such as language vocabulary, or the set of possible labels may used during pre-processing or execution without counting as "touching the training data" for timing purposes.

6.5. Training and Test Sets

CLOSED: If applicable, the dataset must be separated into training and test sets in the same manner as the reference implementation.

OPEN: If applicable, the test dataset must be extracted in the same manner as the reference implementation. The training data set may not contain data that appears in the test set.

6.6. Training Data Order

CLOSED: the training and test data must be traversed in the same conceptual order as the reference implementation. For instance, the data might be traversed sequentially or randomly with uniform distribution. Batch size, shard size, and the random number generator will affect order.

Where data pipelines randomly order data, arbitrary sharding, batching, and packing are allowed provided that (1) the data is still overall randomly ordered and not ordered to improve convergence and (2) each datum still appears exactly once.

For DLRM the submissions are allowed to use a preshuffled dataset and are not obligated to shuffle the data once more during training. However, the reference implementation uses both preshuffled data and an approximate "batch shuffle" performed on-the-fly. Reference runs should also use a different seed in each run, so that the order of the training batches in each reference run is different. Even though the submissions are allowed to not shuffle the data on-the-fly, they are obligated to match the convergence behavior of the reference which does perform on-the-fly "batch-shuffle". Using a preshuffled dataset with a hand-crafted, advantageous data ordering is disallowed.

OPEN: The training data may be traversed in any order. The test data must be traversed in the same order as the reference implementation.

7. RL Environment

CLOSED: The implementation must use the same RL algorithm and simulator or game as the reference implementation, with the same parameters.

OPEN: The implementation may use a different RL algorithm but must use the same simulator or game with the same parameters. If the reference implementation generates all data online, the Open division implementation must also generate all data online.

It is allowed and encouraged to parallelize and otherwise optimize (e.g. by implementing in a compiled language) the RL environment provided that the semantics are preserved.

8. Model

CLOSED: The benchmark implementation must use the same model as the reference implementation, as defined by the remainder of this section.

OPEN: The benchmark implementation may use a different model.

8.1. Graph Definition

CLOSED: Each of the current frameworks has a graph that describes the operations performed during the forward propagation of training. The frameworks automatically infer and execute the corresponding back-propagation computations from this graph. Benchmark implementations must use the same graph as the reference implementation.

8.2. Weight and Bias Initialization

CLOSED: Weights and biases must be initialized using the same constant or random value distribution as the reference implementation, unless a pre-trained set of weights, such as a checkpoint or backbone, is used by the reference.

OPEN: Weights and biases must be initialized using a consistent constant or random value distribution.

8.3. Graph Execution

CLOSED: Frameworks are free to optimize the non-weight parts of the computation graph provided that the changes are mathematically equivalent. So optimizations and graph / code transformations of the flavor of dead code elimination, common subexpression elimination, loop-invariant code motion, and recomputation of node state are entirely allowed.

OPEN: Frameworks are free to alter the graph.

9. Training Loop

9.1. Hyperparameters

CLOSED:

By default, the hyperparameters must be the same as the reference.

Hyperparameters include the optimizer used and values like the regularization norms and weight decays.

The implementation of the optimizer must match the optimizer specified in the Appendex: Allowed Optimizer. The Appendex lists which optimizers in the popular deep learning frameworks are compliant by default. If a submission uses an alternate implementation, the submitter must describe the optimizer’s equation and demonstrate equivalence with the approved optimizers on that list.

The following table lists the tunable hyperparameters for each allowed model,optimizer combination. The value of each tunable hyperparameter must meet the listed constraint.

The MLPerf verifier scripts checks all hyperparameters except those with names marked with asterisks. If a hyperparameter is marked with one asterisk, it must be checked manually. If a hyperparameter is marked with two asterisks, it is also not logged and it must be checked manually in the code. If the verifier and the constraints in this table differ, the verifier (specifically, the version on the date of submission unless otherwise decided by the review committee) is the source of truth.

Model	Optimizer	Name	Constraint	Definition	Reference Code
bert	lamb	global_batch_size	unconstrained	The glboal batch size for training.	--train_batch_size
bert	lamb	opt_base_learning_rate	unconstrained	The base learning rate.	--learning_rate
bert	lamb	opt_epsilon	unconstrained	adam epsilon	reference code
bert	lamb	opt_learning_rate_training_steps	unconstrained	Step at which your reach the lowest learning late	reference code
bert	lamb	opt_learning_rate_warmup_steps	unconstrained	"num_warmup_steps"	reference code
bert	lamb	num_warmup_steps	unconstrained	Number of steps for linear warmup.	--num_warmup_steps
bert	lamb	start_warmup_step	unconstrained	--start_warmup_step	--start_warmup_step
bert	lamb	opt_lamb_beta_1	unconstrained	adam beta1	reference code
bert	lamb	opt_lamb_beta_2	unconstrained	adam beta2	reference code
bert	lamb	opt_lamb_weight_decay_rate	unconstrained	Weight decay	reference code
dlrm	sgd	global_batch_size	unconstrained	global batch size
dlrm	sgd	opt_base_learning_rate	unconstrained	base learning rate, this should be the learning rate after warm up and before decay	reference code
dlrm	sgd	opt_learning_rate_warmup_steps	unconstrained	Number to steps go from 0 to sgd_opt_base_learning_rate with a linear warmup	See PR (From Intel and NV, TODO Link)
dlrm	sgd	lr_decay_start_steps	unconstrained	step at which you start poly decay	See PR (From Intel and NV, TODO Link)
dlrm	sgd	sgd_opt_base_learning_rate	unconstrained	learning rate at the start of poly decay	See PR (From Intel and NV, TODO Link)
dlrm	sgd	sgd_opt_learning_rate_decay_poly_power	2	power of the poly decay	See PR (From Intel and NV, TODO Link)
dlrm	sgd	sgd_opt_learning_rate_decay_steps	unconstrained	the step at which you reach the end learning rate	See PR (From Intel and NV, TODO Link)
maskrcnn	sgd	global_batch_size	arbitrary constant	global version of reference SOLVER.IMS_PER_BATCH	reference code
maskrcnn	sgd	opt_learning_rate_decay_factor*	fixed to reference (0.1)	learning rate decay factor	reference code
maskrcnn	sgd	opt_learning_rate_decay_steps*	(60000, 80000) * (1 + K / 10) * 16 / global_batch_size where K is integer	Steps at which learning rate is decayed	reference code
maskrcnn	sgd	opt_base_learning_rate	0.02 * K for any integer K	base learning rate, this should be the learning rate after warm up and before decay	reference code
maskrcnn	sgd	max_image_size*	fixed to reference	Maximum size of the longer side	reference code
maskrcnn	sgd	min_image_size*	fixed to reference	Maximum size of the shorter side	reference code
maskrcnn	sgd	num_image_candidates*	1000 or 1000 * batches per chip	tunable number of region proposals for given batch size	reference code
maskrcnn	sgd	opt_learning_rate_warmup_factor	unconstrained	the constant factor applied at learning rate warm up	reference code
maskrcnn	sgd	opt_learning_rate_warmup_steps	unconstrained	number of steps for learning rate to warm up	reference code
maskrcnn	sgd	num_image_candidates*	(1000 or 2000) or (1000 * batches per chip)	tunable number of region proposals for given batch size	reference code
minigo	sgd	train_batch_size	integer > 0	Batch size to use for training	reference code
minigo	sgd	lr_boundaries	unconstrained	The number of steps at which the learning rate will decay	reference code
minigo	sgd	lr_rates	unconstrained	The different learning rates	reference code
minigo	sgd	actual_selfplay_games_per_generation	integer >= 8192 (min_selfplay_games_per_generation)	"NOT A HYPERPARAMETER, CANNOT BE 'BORROWED' during review" Implicit (LOG ONLY) - total number of games played per epoch; many parameters can impact this, varies per iteration	N/A
minigo	sgd	min_selfplay_games_per_generation*	fixed to reference (8192)	Minimum number of games to play for each training iteration	reference code
resnet	lars	lars_opt_base_learning_rate	arbitrary constant	Base "plr" in the PR linked.	reference code
resnet	lars	lars_opt_end_learning_rate*	fixed to reference	end learning rate for polynomial decay, implied mathemetically from other HPs	N/A
resnet	lars	lars_opt_learning_rate_decay_poly_power*	fixed to reference	power of polynomial decay, no link needed since not tunable	N/A
resnet	lars	lars_epsilon*	Fixed to reference	epsilon in reference	reference code
resnet	lars	lars_opt_learning_rate_warmup_epochs	arbitrary constant	w_epochs in PR	reference code
resnet	lars	lars_opt_momentum	0.9 for batch<32k, otherwise arbitrary constant	momentum in reference	reference code
resnet	lars	lars_opt_weight_decay	(0.0001 * 2 ^ N) where N is any integer	weight_decay in reference	reference code
resnet	lars	lars_opt_learning_rate_decay_steps	unconstrained	num_epochs in reference	reference code
resnet	lars	global_batch_size	unconstrained	global batch size in reference	reference code
resnet	lars	label smoothing**	0 or 0.1	TODO	TODO
resnet	lars	truncated norm initialization**	boolean	TODO	TODO
resnet	sgd	global_batch_size	arbitrary constant	reference --batch_size	See LARS
resnet	sgd	sgd_opt_base_learning_rate	0.001 * k where is an integer	the learning rate	See LARS
resnet	sgd	sgd_opt_end_learning_rate	10^-4	end learning rate for polynomial decay, implied mathemetically from other HPs	See LARS
resnet	sgd	sgd_opt_learning_rate_decay_poly_power	2	power of polynomial decay, no link needed since not tunable	See LARS
resnet	sgd	sgd_opt_learning_rate_decay_steps	integer >= 0	num_epochs in reference	See LARS
resnet	sgd	sgd_opt_weight_decay	(0.0001 * 2 ^ N) where N is any integer	Weight decay, same as LARS.	See LARS
resnet	sgd	sgd_opt_momentum	0.9	Momentum for SGD.	See LARS
resnet	sgd	model_bn_span	arbitrary constant	number of samples whose statistics a given BN layer uses to normalize a training minibatch (may be just the portion of global_batch_size per device, but also may be aggregated over several devices)	See LARS
resnet	sgd	opt_learning_rate_warmup_epochs	integer >= 0	number of epochs needed for learning rate warmup	See LARS
resnet	sgd	label smoothing**	0 or 0.1	TODO	TODO
resnet	sgd	truncated norm initialization**	boolean	TODO	TODO
resnet	lars/sgd	opt_name	"lars" or "sgd"	The optimizer that was used.
rnnt	lamb	global_batch_size	unconstrained	reference --batch_size	See reference code
rnnt	lamb	opt_name	"lamb"	The optimizer that was used.	See reference code
rnnt	lamb	opt_base_learning_rate	unconstrained	base learning rate, this should be the learning rate after warm up and before decay	See reference code
rnnt	lamb	opt_lamb_epsilon	1e-9	LAMB epsilon	See reference code
rnnt	lamb	opt_lamb_learning_rate_decay_poly_power	unconstrained	Exponential decay rate	See reference code
rnnt	lamb	opt_lamb_learning_rate_hold_epochs	unconstrained	Number of epochs when LR schedule keeps the base learning rate value	See reference code
rnnt	lamb	opt_learning_rate_warmup_epochs	unconstrained	Number of epochs when LR linearly increases from 0 to base learning rate	See reference code
rnnt	lamb	opt_weight_decay	1e-3	L2 weight decay	See reference code
rnnt	lamb	opt_lamb_beta_1	unconstrained	LAMB beta 1	See reference code
rnnt	lamb	opt_lamb_beta_2	unconstrained	LAMB beta 2	See reference code
rnnt	lamb	opt_gradient_clip_norm	1 or inf	Gradients are clipped above this norm threshold.	See reference code
rnnt	lamb	opt_gradient_accumulation_steps	unconstrained	Numer of fwd/bwd steps between optimizer step.	See reference code
rnnt	lamb	opt_learning_rate_alt_decay_func	True	whether to use alternative learning rate decay function	See reference code
rnnt	lamb	opt_learning_rate_alt_warmup_func	True	whether to use alternative learning rate warmup function	See reference code
rnnt	lamb	opt_lamb_learning_rate_min	1e-5	LR schedule doesn’t set LR values below this threshold	See reference code
rnnt	lamb	train_samples	unconstrained	Number of training samples after filtering out samples longer than data_train_max_duration	See reference code
rnnt	lamb	eval_samples	2703	Number of evaluation samples	See reference code
rnnt	lamb	data_train_max_duration	unconstrained	Samples longer than this number of seconds are not included to training dataset	See reference code
rnnt	lamb	data_train_num_buckets	6	Training dataset is split to this number of buckets	See reference code
rnnt	lamb	data_train_speed_perturbation_min	0.85	Input audio is resampled to a random rample rate not less than this fraction of original sample rate.	See reference code
rnnt	lamb	data_train_speed_perturbation_max	1.15	Input audio is resampled to a random rample rate not greater than this fraction of original sample rate.	See reference code
rnnt	lamb	data_spec_augment_freq_n	2	Number of masks for frequency bands	See reference code
rnnt	lamb	data_spec_augment_freq_min	0	Minimum number of frequencies in a single mask	See reference code
rnnt	lamb	data_spec_augment_freq_max	20	Maximum number of frequencies in a single mask	See reference code
rnnt	lamb	data_spec_augment_time_n	10	Number of masks for time band	See reference code
rnnt	lamb	data_spec_augment_time_min	0	Minimum number of masked time steps as a fraction of all steps	See reference code
rnnt	lamb	data_spec_augment_time_max	0.03	Maximum number of masked time steps as a fraction of all steps	See reference code
rnnt	lamb	model_eval_ema_factor	unconstrained	Smoothing factor for Exponential Moving Average	See reference code
rnnt	lamb	model_weights_initialization_scale	unconstrained	After random initialization of weight and bias tensors, all are scaled with this factorAfter random initialization of weight and bias tensors, all are scaled with this factor	See reference code
ssd	sgd	global_batch_size	arbitrary constant	reference --batch-size	reference code
ssd	sgd	model_bn_span	integer >= 1	number of samples whose statistics a given BN layer uses to normalize a training minibatch (may be just the portion of global_batch_size per device, but also may be aggregated over several devices)	reference code
ssd	sgd	opt_learning_rate_warmup_factor	Integer >= 0	the constant factor applied at learning rate warm up	reference code
ssd	sgd	opt_learning_rate_warmup_steps	integer >= 1	number of steps for learning rate to warm up	reference code
ssd	sgd	opt_weight_decay	arbitrary constant	L2 weight decay	reference code
ssd	sgd	opt_base_learning_rate	unconstrained	base learning rate, this should be the learning rate after warm up and before decay	reference code
ssd	sgd	max_samples	1 or 50	maximum number of samples attempted when generating a training patch for a given IoU choice	reference code
ssd	sgd	opt_learning_rate_decay_boundary_epochs	[40, 50] * (1 + k/10) for some integer k	Epochs at which the learning rate decays	reference code
unet3d	sgd	global_batch_size	unconstrained	global batch size	reference --batch_size
unet3d	sgd	opt_base_learning_rate	unconstrained	base learning rate	reference --learning_rate
unet3d	sgd	opt_momentum	unconstrained	SGD momentum	reference --momentum
unet3d	sgd	opt_learning_rate_warmup_steps	unconstrained	number of epochs needed for learning rate warmup	reference --lr_warmup_epochs
unet3d	sgd	opt_initial_learning_rate	unconstrained	initial learning rate (for LR warm up)	reference --init_learning_rate
unet3d	sgd	opt_learning_rate_decay_steps	unconstrained	epochs at which the learning rate decays	reference --lr_decay_epochs
unet3d	sgd	opt_learning_rate_decay_factor	unconstrained	factor used for learning rate decay	reference --lr_decay_factor
unet3d	sgd	opt_weight_decay	unconstrained	L2 weight decay	reference --weight_decay
unet3d	sgd	training_oversampling	fixed to reference	training oversampling	reference --oversampling
unet3d	sgd	training_input_shape	fixed to reference	training input shape	reference --input_shape
unet3d	sgd	evaluation_overlap	fixed to reference	evaluation sliding window overlap	reference --overlap
unet3d	sgd	evaluation_input_shape	fixed to reference	evaluation input shape	reference --val_input_shape
unet3d	sgd	data_train_samples	fixed to reference	number of training samples	N/A
unet3d	sgd	data_eval_samples	fixed to reference	number of evaluation samples	N/A

OPEN: Hyperparameters and optimizer may be freely changed.

9.2. Hyperparameter Borrowing

Submitters are expected to use their best efforts to submit with optimal hyperparameters for their system. The intent of Hyperparameter Borrowing is to allow a submitter to update their submission to reflect what they would have submitted had they known about more optimal hyperparameters before submitting, without knowing any other info (ie the performance of other submissions).

During the review period as described in the Submission Rules, a submitter may replace the hyperparameters, once per benchmark entry, in their implementation of a benchmark with hyperparameters from another submitter’s implementation of the same benchmark. By default, they may change batch size (local batch size, global batch size, batchnorm span), but must replace all other hyperparameters as a group.

With evidence that the resulting model, using the same batch size as the other submitter’s implementation, converges worse in terms of epochs required, the submitter may make a minimum number of additional hyperparameter changes for the purpose of improving convergence and achieving comparable, but not better, convergence in epochs compared to the other submitter’s implementation, but preserving any difference in convergence that may exist due to precision choices. In this situation, the other submitter’s implementation is considered the reference, and the new submitter must match the convergence behavior of the other submitter in a similar way as we compare any submission to the reference.

A resubmission of a benchmark with borrowed hyperparameters must use the same software (with the exceptions listed in the Software Adoption section of this document), system and system configuration (accelerators, NICs etc) as the original submission. The largest scale submission for a benchmark from a given system may be resubmitted with borrowed hyperparameters using a change of scale on that system, but only if the new scale is either larger, or enables the resubmission to achieve a faster run result. In addition, the new scale must not be larger than the largest scale used in an original submission of at least one of the benchmarks on that system in this round.

9.3. Loss function

CLOSED: The same loss function used in the reference implementation must be used.

OPEN: Any loss function may be used. Do not confuse the loss function with target quality measure.

9.4. Quality measure

Each run must reach a target quality level on the reference implementation quality measure. By default, the time to evaluate the quality is included in the wallclock time. However, if the reference implementation generates timestamped checkpoints and evaluates the quality after the clock has been stopped, then an implementation may either perform evaluation on-the-clock or generate timestamped checkpoints, evaluate them after the clock has been stopped, and update the clock stopped time to the timestamp of the first passing checkpoint. The checkpoint timestamp may be any time after the last weight value included in the checkpoint is updated.

CLOSED: The same quality measure as the reference implementation must be used. The quality measure must be evaluated at least as frequently (in terms of number of training items between test sets) and at least as thoroughly (in terms of number of tests per set) as in the reference implementation. Typically, a test consists of comparing the output of one forward pass through the network with the desired output from the test set.

Area	Problem	Model	Evaluation frequency
Vision	Image classification	Resnet-50 v1.5	Every 4 epochs with offset 0 or 1 or 2 or 3
	Image segmentation (medical)	U-Net3D	Starting at `CEILING(1000168/samples_per_epoch)` epochs, then every `CEILING(20168/samples_per_epoch)` epochs. Where `samples_per_epoch` is the number of samples processed in a given epoch assuming that in the case of uneven batches the last batch is padded, e.g. `CEILING(168/global_batch_size) * global_batch_size`.
	Object detection (light weight)	SSD	Fixed at epochs=40, 50, 55, 60, 65, 70, 75, 80
	Object detection (heavy weight)	Mask R-CNN	Every 1 epoch
Language	Speech recognition	RNN-T	Every 1 epoch
	NLP	BERT	Starting at 3M samples, then every 500K samples
Commerce	Recommendation	DLRM	Every 102400 samples
Research	Reinforcement learning	Mini Go	Every 1 epoch

OPEN: An arbitrary stopping criteria may be used, including but not limited to the closed quality measure, a different quality measure, the number of epochs, or a fixed time. However, the reported results must include the geometric mean of the final quality as measured by the closed quality measure.

Check points can be created at the discretion of submitter. No check points are required to be produced or retained.

9.5. Equivalence exceptions

The CLOSED division allows limited exemptions to mathematical equivalence between implementations for pragmatic purposes, including:

Different methods can be used to add color jitter as long as the methods are of a similar distribution and magnitude to the reference.
If data set size is not evenly divisible by batch size, one of several techniques may be used. The last batch in an epoch may be composed of the remaining samples in the epoch, may be padded, or may be a mixed batch composed of samples from the end of one epoch and the start of the next. If the mixed batch technique is used, quality for the ending epoch must be evaluated after the mixed batch. If the padding technique is used, the first batch may be padded instead of the last batch.
Values introduced for padding purposes may be reflected in batch norm computations.
Adam optimizer implementations may use the very small value epsilon to maintain mathematical stability in slightly different ways, provided that methods are reviewed and approved in advance. One such method involves squaring the value of epsilon and moving epsilon inside the square root in the parameter update equation.
Distributed batch normalization is allowed.

Additional exemptions need to be explicitly requested and approved in advance. In general, exemptions may be approved for techniques that are common industry practice, introduce small differences that would be difficult to engineer around relative to their significance, and do not substantially decrease the required computation. Over time, MLPerf should seek to help the industry converge on standards and remove exemptions.

The OPEN division does not restrict mathematical equivalence.

10. Software Adoption

For a given round of MLPerf, the "canonical version" of a software component shall be defined as the public version as of 14 days before submission. If the software is open source, the canonical version shall be the one compiled with the default compilation options. If a system software provider submits with a component whose version is other than the canonical version, then other submitters using the same component are allowed to update their submission to use that version. Those other submitters must resubmit with the updated system software before the resubmission deadline during the review period. Software adoption applies only to system software, only to the version used by the software provider’s submission, and explicitly does not cover benchmark implementations. Benchmark implementations should be borrowed as a whole only if the software provider’s submission introduces new APIs.

11. Run Results

A run result consists of a wall-clock timing measurement for a contiguous period that includes model initialization in excess of a maximum initialization time, any data preprocessing required to be on the clock, using the dataset to train the model, and quality evaluation unless specified otherwise for the benchmark.

Prior to starting the clock, a system may use a maximum of 20 minutes of model initialization time. Model initialization time begins when the system first begins to construct or execute the model. This maximum initialization time is intended to ensure that model initialization is not disproportionate on large systems intended to run much larger models, and may be adjusted in the future with sufficient evidence.

The clock must start before any part of the system touches the dataset or when the maximum model initialization time is exceeded. The clock may be stopped as soon as any part of the system determines target accuracy has been reached. The clock may not be paused during the run.

12. Benchmark Results

Each benchmark result is based on a set of run results. The number of results for each benchmark is based on a combination of the variance of the benchmark result, the cost of each run, and the likelihood of convergence.

Area	Problem	Number of Runs
Vision	Image classification	5
	Image segmentation (medical)	40
	Object detection (light weight)	5
	Object detection (heavy weight)	5
Language	NLP	10
	Speech recognition	10
Commerce	Recommendation	5
Research	Reinforcement learning	10

Each benchmark result is computed by dropping the fastest and slowest runs, then taking the mean of the remaining times. For this purpose, a single non-converging run may be treated as the slowest run and dropped. A benchmark result is invalid if there is more than one non-converging run.

In the case of UNET3D, due to large variance, 40 runs are required. Out of the 40 runs, the 4 fastest and 4 slowest are dropped. There can be maximum of 4 non-converging runs. A run is classified as non-converged if the target quality metric is not reached within CEILING(10000*168/samples_in_epoch) epochs.

Each benchmark result should be normalized by dividing the reference result for the corresponding reference implementation by the benchmark result. This normalization produces higher numbers for better results, which better aligns with human intuition.

13. Appendix: Benchmark Specific Rules

ResNet
- ResNet may have 1000 or 1001 classes, where the 1001st is "I don’t know"

14. Appendix: Allowed Optimizers

Analysis to support this can be found in the document "MLPerf Optimizer Review" in the MLPerf Training document area.

Benchmark	Algorithm	Framework	Allowed Optimizer
RN50	LARS	PyTorch	[No compliant implementation]
		TensorFlow	MLPERF_LARSOptimizer
		MxNet	SGDwFASTLARS
RN50	SGD with Momentum	PyTorch	apex.optimizers.FusedSGD
		PyTorch	torch.optim.SGD
		TensorFlow	tf.train.MomentumOptimizer
		MxNet	[No compliant implementation]
Minigo	SGD with Momentum	PyTorch	apex.optimizers.FusedSGD
		PyTorch	torch.optim.SGD
		TensorFlow	tf.train.MomentumOptimizer
Mask-RCNN	SGD with Momentum	PyTorch	apex.optimizers.FusedSGD
		PyTorch	torch.optim.SGD
		TensorFlow	tf.train.MomentumOptimizer
SSD	SGD with Momentum	PyTorch	apex.optimizers.FusedSGD
		PyTorch	torch.optim.SGD
		TensorFlow	tf.train.MomentumOptimizer
BERT	LAMB	PyTorch	apex.optimizers.FusedLAMB
		TensorFlow	tf.optimizers.LAMB
RNN-T	LAMB	PyTorch	apex.optimizers.FusedLAMB
		TensorFlow	tf.optimizers.LAMB
DLRM	SGD	PyTorch	torch.optim.SGD
		TensorFlow	tf.train.MomentumOptimizer
UNET3D	SGD with Momentum	PyTorch	torch.optim.SGD
		TensorFlow	tf.train.MomentumOptimizer
		MXNet	mx.optimizer.NAG

Files

training_rules.adoc

Latest commit

History