FPnew is a parametric floating-point unit which supports standard RISC-V operations as well as transprecision formats, written in SystemVerilog.
Table of Contents
The top-level module of the FPU is fpnew_top
and its interface is further described in this section.
FPnew uses a synchronous interface using handshaking to transfer data into and out of the FPU.
All array types are packed due to poor support of unpacked arrays in some EDA tools.
SystemVerilog interface
s are not used due to poor support in some EDA tools.
The configuration parameters use data types defined in fpnew_pkg
which are structs containing multi-dimensional arrays of custom enumeration types.
For more in-depth explanations on how to configure the unit and the layout of the types used, please refer to the Configuration Section.
Parameter Name | Description |
---|---|
Features |
Specifies the features of the FPU, such as the set of supported formats and operations. |
Implementation |
Allows to control how the above features are implemented, such as the number of pipeline stages and architecture of subunits |
DivSqrtSel |
Chooses among the three supported DivSqrt units |
TagType |
The SystemVerilog data type of the operation tag |
TrueSIMDClass |
If enabled, the result of a classify operation in vectorial mode will be RISC-V compliant if each output has at least 10 bits |
EnableSIMDMask |
Enable the RISC-V floating-point status flags masking of inactive vectorial lanes. When disabled, simd_mask_i is inactive |
StochasticRndImplementation |
Enable stochastic rounding support for SDOTP, define LFSR bitwidth and number of trailing bits considered for the SR decision |
CompressedVecCmpResult |
Compress the result of a vector compare in the LSBs, conceived for RV32FD cores |
Many ports use custom types and enumerations from fpnew_pkg
to improve code structure internally (see Data Types).
As the width of some input/output signals is defined by the configuration, it is denoted W
in the following table.
Port Name | Direction | Type | Description |
---|---|---|---|
clk_i |
in | logic |
Clock, synchronous, rising-edge triggered |
rst_ni |
in | logic |
Asynchronous reset, active low |
hart_id_i |
in | logic [31:0] |
Core ID, used only when stochastic rounding is enabled |
operands_i |
in | logic [2:0][W-1:0] |
Operands, henceforth referred to as op[ i] |
rnd_mode_i |
in | roundmode_e |
Floating-point rounding mode |
op_i |
in | operation_e |
Operation select |
op_mod_i |
in | logic |
Operation modifier |
src_fmt_i |
in | fp_format_e |
Source FP format |
dst_fmt_i |
in | fp_format_e |
Destination FP format |
int_fmt_i |
in | int_format_e |
Integer format |
vectorial_op_i |
in | logic |
Vectorial operation select |
tag_i |
in | TagType |
Operation tag input |
simd_mask_i |
in | MaskType |
Vector mask input for the status flags |
in_valid_i |
in | logic |
Input data valid (see Handshake) |
in_ready_o |
out | logic |
Input interface ready (see Handshake) |
flush_i |
in | logic |
Synchronous pipeline reset |
result_o |
out | logic [W-1:0] |
Result |
status_o |
out | status_t |
RISC-V floating-point status flags fflags |
tag_o |
out | TagType |
Operation tag output |
out_valid_o |
out | logic |
Output data valid (see Handshake) |
out_ready_i |
in | logic |
Output interface ready (see Handshake) |
busy_o |
out | logic |
FPU operation in flight |
The following custom data types and enumerations used in ports of the FPU and are defined in fpnew_pkg
.
Default values from the package are listed.
Enumeration of type logic [2:0]
holding available rounding modes, encoded for use in RISC-V cores:
Enumerator | Value | Rounding Mode |
---|---|---|
RNE |
3'b000 |
To nearest, tie to even (default) |
RTZ |
3'b001 |
Toward zero |
RDN |
3'b010 |
Toward negative infinity |
RUP |
3'b011 |
Toward positive infinity |
RMM |
3'b100 |
To nearest, tie away from zero |
ROD |
3'b101 |
To odd |
RSR |
3'b110 |
Stochastic Rounding (available only on SDOTP operations) |
DYN |
3'b111 |
RISC-V Dynamic RM, invalid if passed to operations |
Enumeration of type logic [3:0]
holding the FP operation.
The operation modifier op_mod_i
can change the operation carried out.
Unless noted otherwise, the first operand op[0]
is used for the operation.
Enumerator | Modifier | Operation |
---|---|---|
FMADD |
0 |
Fused multiply-add ((op[0] * op[1]) + op[2] ) |
FMADD |
1 |
Fused multiply-subtract ((op[0] * op[1]) - op[2] ) |
FNMSUB |
0 |
Negated fused multiply-subtract (-(op[0] * op[1]) + op[2] ) |
FNMSUB |
1 |
Negated fused multiply-add (-(op[0] * op[1]) - op[2] ) |
ADD |
0 |
Addition (op[1] + op[2] ) note the operand indices |
ADD |
1 |
Subtraction (op[1] - op[2] ) note the operand indices |
MUL |
0 |
Multiplication (op[0] * op[1] ) |
SDOTP |
0 |
Sum of dot product ) |
VSUM |
0 |
Vector Inner Sum ) |
DIV |
0 |
Division (op[0] / op[1] ) |
SQRT |
0 |
Square root |
SGNJ |
0 |
Sign injection, operation encoded in rounding modeRNE : op[0] with sign(op[1]) RTZ : op[0] with ~sign(op[1]) RDN : op[0] with sign(op[0]) ^ sign(op[1]) RUP : op[0] (passthrough) |
SGNJ |
1 |
As above, but result is sign-extended instead of NaN-Boxed |
MINMAX |
0 |
Minimum / maximum, operation encoded in rounding modeRNE : minimumNumber(op[0], op[1]) RTZ : maximumNumber(op[0], op[1]) |
CMP |
0 |
Comparison, operation encoded in rounding modeRNE : op[0] <= op[1] RTZ : op[0] < op[1] RDN : op[0] == op[1] |
CLASSIFY |
0 |
Classification, returns RISC-V classification block |
F2F |
0 |
FP to FP cast, formats given by src_fmt_i and dst_fmt_i |
F2I |
0 |
FP to signed integer cast, formats given by src_fmt_i and int_fmt_i |
F2I |
1 |
FP to unsigned integer cast, formats given by src_fmt_i and int_fmt_i |
I2F |
0 |
Signed integer to FP cast, formats given by int_fmt_i and dst_fmt_i |
I2F |
1 |
Unsigned integer to FP cast, formats given by int_fmt_i and dst_fmt_i |
CPKAB |
0 |
Cast-and-pack op[0] and op[1] to entries 0, 1 of vector op[2] . |
CPKAB |
1 |
Cast-and-pack op[0] and op[1] to entries 2, 3 of vector op[2] . |
CPKCD |
0 |
Cast-and-pack op[0] and op[1] to entries 4, 5 of vector op[2] . |
CPKCD |
1 |
Cast-and-pack op[0] and op[1] to entries 6, 7 of vector op[2] . |
Enumeration of type logic [2:0]
holding the supported FP formats.
Enumerator | Format | Width | Exp. Bits | Man. Bits |
---|---|---|---|---|
FP32 |
IEEE binary32 | 32 bit | 8 | 23 |
FP64 |
IEEE binary64 | 64 bit | 11 | 52 |
FP16 |
IEEE binary16 | 16 bit | 5 | 10 |
FP8 |
binary8 | 8 bit | 5 | 2 |
FP16ALT |
binary16alt | 16 bit | 8 | 7 |
FP8ALT |
binary8alt | 8 bit | 4 | 3 |
The following global parameters associated with FP formats are set in fpnew_pkg
:
localparam int unsigned NUM_FP_FORMATS = 6;
localparam int unsigned FP_FORMAT_BITS = $clog2(NUM_FP_FORMATS);
Enumeration of type logic [1:0]
holding the supported integer formats.
Enumerator | Width |
---|---|
INT8 |
8 bit |
INT16 |
16 bit |
INT32 |
32 bit |
INT64 |
64 bit |
The following global parameters associated with integer formats are set in fpnew_pkg
:
localparam int unsigned NUM_INT_FORMATS = 4;
localparam int unsigned INT_FORMAT_BITS = $clog2(NUM_INT_FORMATS);
Packed struct containing the five FP status flags as logic
in order MSB to LSB:
Memeber | Description |
---|---|
NV |
Invalid operation |
DZ |
Division by zero |
OF |
Overflow |
UF |
Underflow |
NX |
Inexact operation |
RISC-V mandates so-called NaN-boxing of all FP values in formats that are narrower than the widest available format in the system.
This means that all unused high-order bits of narrow formats must be set to '1
, otherwise the value is considered invalid (a NaN).
Checks for whether input values are properly NaN-boxed are enabled by default but can be turned off (see Configuration). Narrow FP output values from the FPU are always NaN-boxed. Narrow integer output values from the FPU are sign-extended, even if unsigned.
Both the input and output side of FPnew feature a valid
/ready
handshake interface which controls the flow of data into and out of the FPU.
The handshaking protocol is similar to ones used in common protocols such as AXI:
- An asserted
valid
singnals that data on the corresponding interface is valid and stable. - Once
valid
is asserted, it must not be disasserted until the handshake is complete. - An asserted
ready
signals that the interface is capable of processing data on the following rising clock edge. - Once
valid
andready
are asserted during a rising clock edge, the transaction is complete. - After a completed transaction,
valid
may remain asserted to provide new data for the next transfer. - The protocol direction is top-down.
ready
may depend onvalid
butvalid
must not depend onready
.
Operation tags are metadata accompanying an operation and can be used to link results back to the oprerations that produced them. Tags traverse the FPU without being modified, but always stay in sync with the operation they were issued with.
Tags are an optional feature of FPnew and can be controlled by setting the TagType
parameter as needed (usually a packed vector of logic
, but can be any type).
In order to disable the use of tags, set TagType
to logic
(the default value), and bind the tag_i
port to a static value.
Furthermore ensure that your synthesis tool removes static registers.
This input is meant to be used in vectorial mode. The mask for the status flags is an input vector with NumLanes
bits, and each bit can mask the status flags of a different FPU vectorial lane. This helps not make the final output flag signal dirty due to status flags from inactive lanes.
If simd_mask_i[n] == 1'b0
, the n
th FPU lane will be masked for this operation and its resulting status flags will not be propagated to the final output status flag.
Main configuration of the FPU is done through parameters on the fpnew_top
module.
A default selection of formats and features is defined in the package and can be controlled through these parameters.
Furthermore, the project package fpnew_pkg
can be modified to provide even more custom formats to the FPU.
The Features
parameter is used to configure the available formats and special features of the FPU.
It is of type fpu_features_t
which is defined as:
typedef struct packed {
int unsigned Width;
logic EnableVectors;
logic EnableNanBox;
fmt_logic_t FpFmtMask;
ifmt_logic_t IntFmtMask;
} fpu_features_t;
The fields of this struct behave as follows:
Specifies the width of the FPU datapath and of the input and output data ports (operands_i
/result_o
).
It must be larger or equal to the width of the widest enabled FP and integer format.
Default: 64
Controls the generation of packed-SIMD computation units in the FPU.
If set to 1
, vectorial execution units will be generated for all FP formats that are narrower than Width
in order to fill up the datapath width.
For example, given Width = 64
, there will be four execution units for every operation on 16-bit FP formats.
Default: 1'b1
Controls whether input value NaN-boxing is enforced (see NaN-Boxing).
If set to 1
, all values of FP formats that are narrower than Width
will be considered NaN unless all unused high-order bits are set to '1
.
Output FP values are always NaN-boxed, regardless of this setting.
Default: 1'b1
The FpFmtMask
parameter is of type fmt_logic_t
which is an array holding one logic
bit per FP format from fp_format_e
, in ascending order.
typedef logic [0:NUM_FP_FORMATS-1] fmt_logic_t; // Logic indexed by FP format
If a bit in FpFmtMask
is set, FPU hardware for the corresponding format is generated.
Otherwise, synthesis tools can optimize away any logic associated with this format and operations on the format yield undefined results.
Default: '1
(all enabled)
The IntFmtMask
parameter is of type ifmt_logic_t
which is an array holding one logic
bit per integer format from int_format_e
, in ascending order.
typedef logic [0:NUM_INT_FORMATS-1] ifmt_logic_t; // Logic indexed by integer format
If a bit in IntFmtMask
is set, FPU hardware for the corresponding format is generated.
Otherwise, synthesis tools can optimize away any logic associated with this format and operations on the format yield undefined results.
Default: '1
(all enabled)
The FPU is divided into five operation groups, ADDMUL
, DIVSQRT
, NONDOMP
, CONV
, and DOTP
(see Architecture: Top-Level).
The Implementation
parameter controls the implementation of these operation groups.
It is of type fpu_implementation_t
which is defined as:
typedef struct packed {
opgrp_fmt_unsigned_t PipeRegs;
opgrp_fmt_unit_types_t UnitTypes;
pipe_config_t PipeConfig;
} fpu_implementation_t;
The fields of this struct behave as follows:
The PipeRegs
parameter is of type opgrp_fmt_unsigned_t
which is an array of arrays, holding for each operation group for each format an unsigned value, in ascending order.
typedef logic [0:NUM_FP_FORMATS-1][31:0] fmt_unsigned_t; // Array of unsigned indexed by FP format
typedef fmt_unsigned_t [0:NUM_OPGROUPS-1] opgrp_fmt_unsigned_t; // Array of format-specfic unsigned indexed by operation group
This parameter sets a number of pipeline stages to be inserted into the computational units per operation group, per FP format. As such, latencies for different operations and different formats can be freely configured.
Default: '{default: 0}
(no pipelining - all operations combinatorial)
The UnitTypes
parameter is of type opgrp_fmt_unit_types_t
which is an array of arrays, holding for each operation group for each format an enumeration value, in ascending order.
typedef unit_type_t [0:NUM_FP_FORMATS-1] fmt_unit_types_t; // Array of unit types indexed by format
typedef fmt_unit_types_t [0:NUM_OPGROUPS-1] opgrp_fmt_unit_types_t; // Array of format-specific unit types indexed by opgroup
The unit type unit_type_t
is an enumeration of type logic [1:0]
holding the following implementation options for a particular hardware unit:
Enumerator | Description |
---|---|
DISABLED |
No hardware units will be generated for this format |
PARALLEL |
One hardware unit per format will be generated |
MERGED |
One combined multi-format hardware unit will be generated for all formats selecting MERGED |
The UnitTypes
parameter allows to control resources used for the FPU by either removing operation units for certain formats and operations, or merging multiple formats into one.
Currently, the follwoing unit types are available for the FPU operation groups:
ADDMUL |
DIVSQRT |
NONCOMP |
CONV |
DOTP |
|
---|---|---|---|---|---|
PARALLEL |
✔️ | ✔️ | |||
MERGED |
✔️ | ✔️ | ✔️ | ✔️ |
Default:
'{'{default: PARALLEL}, // ADDMUL
'{default: MERGED}, // DIVSQRT
'{default: PARALLEL}, // NONCOMP
'{default: MERGED}, // CONV`
'{default: DISABLED}} // DOTP`
(all formats within operation group use same type)
The PipeConfig
parameter is of type pipe_config_t
and controls register placement in operational units.
The requested number of registers is placed to predefined locations within the units according to the PipeConfig
parameter.
For best results, we strongly encourage the use of automatic retiming options in synthesis tools to optimize the pre-placed pipeline registers.
The configuration pipe_config_t
is an enumeration of type logic [1:0]
holding the following implementation options for the pipelines in operational units:
Enumerator | Description |
---|---|
BEFORE |
All pipeline registers are inserted at the inputs of the operational unit |
AFTER |
All pipeline registers are inserted at the outputs of the operational unit |
INSIDE |
All registers are inserted at roughly the middle of the operational unit (if not possible, BEFORE ) |
DISTRIBUTED |
Registers are evenly distributed to INSIDE , BEFORE , and AFTER (if no INSIDE , all BEFORE ) |
The DivSqrtSel
parameter is used to choose among the support DivSqrt units.
It is of type divsqrt_unit_t
, which is defined as:
typedef enum logic[1:0] {
PULP, // "PULP" instantiates the PULP DivSqrt unit supports FP64, FP32, FP16, FP16ALT, FP8 and SIMD operations
TH32, // "TH32" instantiates the E906 DivSqrt unit supports only FP32 (no SIMD support)
THMULTI // "THMULTI" instantiates the C910 DivSqrt unit supports FP64, FP32, FP16, FP16ALT and SIMD operations
} divsqrt_unit_t;
The StochasticRndImplementation
parameter is used to configure the RSR support.
It is of type rsr_impl_t
which is defined as:
typedef struct packed {
logic EnableRSR;
int unsigned RsrPrecision;
int unsigned LfsrInternalPrecision;
} rsr_impl_t;
The fields of this struct behave as follows:
Enables stochastic rounding support in the DOTP
operation group block. It instantiates an LFSR
in the rounding module.
Default: 1'b0
Specifies the number of trailing bits considered for the stochastic rounding decision.
Default: 12
Specifies the LFSR internal bitwidth, thus controlling the pseudorandom number periodicity.
Default: 32
In order to add custom FP or integer formats to the FPU, it is necessary to make small changes to fpnew_pkg
.
New formats can easily be added by extending the default list of available formats, and/or by changing or removing the defaults.
Namely, the following parameters and types shall be adapted:
// For FP formats:
localparam int unsigned NUM_FP_FORMATS
typedef enum logic [FP_FORMAT_BITS-1:0] {...} fp_format_e
localparam fp_encoding_t [0:NUM_FP_FORMATS-1] FP_ENCODINGS
localparam fmt_logic_t CPK_FORMATS
// For Int formats:
localparam int unsigned NUM_INT_FORMATS
typedef enum logic [INT_FORMAT_BITS-1:0] {...} int_format_e
Furthermore, the default configuration parameters shall be adjusted to match the dimensions of the modified format list.
No other changes should be necessary to the package or other source files of the FPU.
The exact architecture of FPnew depends on the configuration through parameters. The main architectural traits as well as the effect of some parameters are described henceforth.
The design philosophy begind FPnew is that the "plumbing" of the architecture is quite regular and generic and the actual operations that handle the data are located at the lowest level. Handshaking is used to pass data through the hierarchy levels. As such, very fine-grained clock-gating can be applied to silence all parts of the architecture that are not actively contributing to useful work, significantly increasing energy efficiency.
The topmost level of hierarchy in FPnew is host to several operation group blocks as well as an output arbiter. The operation group is the highest level of grouping within FPnew and signifies a class of operations that can usually be executed on a single hardware unit - such as additions and multiplications usually being mapped to an FMA unit.
There are currently five operation groups in FPnew which are enumerated in opgroup_e
as outlined in the following table:
Enumerator | Description | Associated Operations |
---|---|---|
ADDMUL |
Addition and Multiplication | FMADD , FNMSUB , ADD , MUL |
DIVSQRT |
Division and Square Root | DIV , SQRT |
NONCOMP |
Non-Computational Operations like Comparisons | SGNJ , MINMAX , CMP , CLASS |
CONV |
Conversions | F2I , I2F , F2F , CPKAB , CPKCD |
DOTP |
Dot Products | SDOTP , EXVSUM , VSUM |
Most architectural decisions for FPnew are made at very fine granularity.
The big exception to this is the generation of vectorial hardware which is decided at top level through the EnableVectors
parameter.
Each operation group is implemented in its own operation group block, each generating slices.
A unit type is selected for each format according to the settings in the Implementation
parameter.
Formats can either be implemented in a format-specific PARALLEL
slice, or a multi-format MERGED
slice.
Both PARALLEL
and MERGED
slices can co-exist in case a subset of formats is assigned to both of the two options.
In a parallel slice, operational units capable of processing exactly one format are generated.
If EnableVectors
is set, operational units are duplicated into vectorial lanes in order to fill up the width of the datapath.
Results from all lanes are collected and assembled at the output of the slice.
Implementing units as parallel slices usually yields best format-specific latency, however costs more in terms of area.
In a merged slice, operational units capable of processing multiple formats are generated.
If EnableVectors
is set, operational units for narrow formats are duplicated into vectorial lanes in order to fill up the width of the datapath.
To facilitate vectorial conversions that update an input vector, the third operand is pipelined along with the operation in the CONV
block.
Results from all lanes are collected and assembled at the output of the slice.
Implementing units as merged slices usually yields best total area, however costs more in terms of per-format latency.
When the ADDMUL
block is implemented using the MERGED
implementation, multi-format FMA (multiplication done in src_format
, accumulation in dst_format
) is automatically supported among all formats using MERGED
.
Pipeline registers are inserted into the operational units directly, according to the settings in the Implementation
parameter.
As such, each slice in the system can have a different latency.
Merged slices are bound to thave the largest latency of the included formats.
All pipeline registers are inserted as shift registers at predefined locations in the FPU. For optimal mapping, retiming funcitonality of your synthesis tools should be used to balance the paths between registers.
Data traverses the pipeline stages within the operational units using the same handshaking mechanism that is also present at the top-level FPU interface. An individual pipeline stage is only stalled if its successor stage is stalled and cannot proceed in the following cycle. In general, different operations can overtake each other in the FPU if their latencies differ or significant backpressure exists in one of the paths. Hence, the use of operation tags is required to identify the exiting data if more than one operation is allowed to enter the FPU.
There are round-robin arbiters located at the ouputs of slices as well as the outputs of operation group blocks that resolve contentions for the ouput port of the FPU. Arbitration is fair, i.e. a unit cannot write the outputs twice in a row if other units are also contending for the output.