- Yuze Chi, Jason Cong. Exploiting Computation Reuse for Stencil Accelerators. In DAC, 2020. [PDF] [Code]
- Yuze Chi, Jason Cong, Peng Wei, Peipei Zhou. SODA: Stencil with Optimized Dataflow Architecture. In ICCAD, 2018. (Best Paper Candidate) [PDF] [Slides] [Code]
# comments start with hashtag(#)
kernel: blur # the kernel name, will be used as the kernel name in HLS
burst width: 512 # I/O width in bits, for Xilinx platform 512 works the best
unroll factor: 16 # how many pixels are generated per cycle
# specify the dram bank, type, name, and dimension of the input tile
# the last dimension is not needed and a placeholder '*' must be given
# dram bank is optional
# multiple inputs can be specified but 1 and only 1 must specify the dimensions
input dram 0 uint16: input(2000, *)
# specify an intermediate stage of computation, may appear 0 or more times
local uint16: tmp(0, 0) = (input(-1, 0) + input(0, 0) + input(1, 0)) / 3
# specify the output
# dram bank is optional
output dram 1 uint16: output(0, 0) = (tmp(0, -1) + tmp(0, 0) + tmp(0, 1)) / 3
# how many times the whole computation is repeated (only works if input matches output)
iterate: 2
- Python 3.6+ and corresponding
pip
How to install Python 3.6+ on Ubuntu 16.04+ and CentOS 7?
sudo apt install software-properties-common python3-pip
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.6
sudo apt install python3 python3-pip
sudo yum install python3 python3-pip
python3 -m pip install --user --upgrade sodac
- Replace
python3
with a more specific Python version higher than or equal topython3.6
, if necessary. - Make sure
~/.local/bin
is in yourPATH
, or replacesodac
withpython3 -m soda.sodac
below.
sodac tests/src/blur.soda --xocl-kernel blur_kernel.cpp
sodac tests/src/blur.soda --iocl-kernel blur_kernel.cl
Requries vivado_hls
.
sodac tests/src/blur.soda --xocl-hw-xo blur_kernel.hw.xo
Requries vivado_hls
.
sodac tests/src/blur.soda --xocl-hw-xo blur_kernel.hw.xo --xocl-interface axis
sodac tests/src/blur.soda --computation-reuse --xocl-kernel blur_kernel.cpp
- 5-point 2D Jacobi:
t0(0, 0) = (t1(0, 1) + t1(1, 0) + t1(0, 0) + t1(0, -1) + t1(-1, 0)) * 0.2f
- tile size is
(2000, *)
Each function in the below code snippets is synthesized into an RTL module.
Their arguments are all hls::stream
FIFOs; Without unrolling, a simple line-buffer pipeline is generated, producing 1 pixel per cycle.
With unrolling, a SODA microarchitecture pipeline is generated, procuding 2 pixeles per cycle.
#pragma HLS dataflow
Module1Func(
/*output*/ &from_t1_offset_0_to_t1_offset_1999,
/*output*/ &from_t1_offset_0_to_t0_pe_0,
/* input*/ &from_super_source_to_t1_offset_0);
Module2Func(
/*output*/ &from_t1_offset_1999_to_t1_offset_2000,
/*output*/ &from_t1_offset_1999_to_t0_pe_0,
/* input*/ &from_t1_offset_0_to_t1_offset_1999);
Module3Func(
/*output*/ &from_t1_offset_2000_to_t1_offset_2001,
/*output*/ &from_t1_offset_2000_to_t0_pe_0,
/* input*/ &from_t1_offset_1999_to_t1_offset_2000);
Module3Func(
/*output*/ &from_t1_offset_2001_to_t1_offset_4000,
/*output*/ &from_t1_offset_2001_to_t0_pe_0,
/* input*/ &from_t1_offset_2000_to_t1_offset_2001);
Module4Func(
/*output*/ &from_t1_offset_4000_to_t0_pe_0,
/* input*/ &from_t1_offset_2001_to_t1_offset_4000);
Module5Func(
/*output*/ &from_t0_pe_0_to_super_sink,
/* input*/ &from_t1_offset_0_to_t0_pe_0,
/* input*/ &from_t1_offset_1999_to_t0_pe_0,
/* input*/ &from_t1_offset_2000_to_t0_pe_0,
/* input*/ &from_t1_offset_4000_to_t0_pe_0,
/* input*/ &from_t1_offset_2001_to_t0_pe_0);
In the above code snippet, Module1Func
to Module4Func
are forwarding modules; they constitute the line buffer.
The line buffer size is approximately two lines of pixels, i.e. 4000 pixels.
Module5Func
is a computing module; it implements the computation kernel.
The whole design is fully pipelined; however, with only 1 computing module, it can only produce 1 pixel per cycle.
#pragma HLS dataflow
Module1Func(
/*output*/ &from_t1_offset_1_to_t1_offset_1999,
/*output*/ &from_t1_offset_1_to_t0_pe_0,
/* input*/ &from_super_source_to_t1_offset_1);
Module1Func(
/*output*/ &from_t1_offset_0_to_t1_offset_2000,
/*output*/ &from_t1_offset_0_to_t0_pe_1,
/* input*/ &from_super_source_to_t1_offset_0);
Module2Func(
/*output*/ &from_t1_offset_1999_to_t1_offset_2001,
/*output*/ &from_t1_offset_1999_to_t0_pe_1,
/* input*/ &from_t1_offset_1_to_t1_offset_1999);
Module3Func(
/*output*/ &from_t1_offset_2000_to_t1_offset_2002,
/*output*/ &from_t1_offset_2000_to_t0_pe_1,
/*output*/ &from_t1_offset_2000_to_t0_pe_0,
/* input*/ &from_t1_offset_0_to_t1_offset_2000);
Module4Func(
/*output*/ &from_t1_offset_2001_to_t1_offset_4001,
/*output*/ &from_t1_offset_2001_to_t0_pe_1,
/*output*/ &from_t1_offset_2001_to_t0_pe_0,
/* input*/ &from_t1_offset_1999_to_t1_offset_2001);
Module5Func(
/*output*/ &from_t1_offset_2002_to_t1_offset_4000,
/*output*/ &from_t1_offset_2002_to_t0_pe_0,
/* input*/ &from_t1_offset_2000_to_t1_offset_2002);
Module6Func(
/*output*/ &from_t1_offset_4001_to_t0_pe_0,
/* input*/ &from_t1_offset_2001_to_t1_offset_4001);
Module7Func(
/*output*/ &from_t0_pe_0_to_super_sink,
/* input*/ &from_t1_offset_1_to_t0_pe_0,
/* input*/ &from_t1_offset_2000_to_t0_pe_0,
/* input*/ &from_t1_offset_2001_to_t0_pe_0,
/* input*/ &from_t1_offset_4001_to_t0_pe_0,
/* input*/ &from_t1_offset_2002_to_t0_pe_0);
Module8Func(
/*output*/ &from_t1_offset_4000_to_t0_pe_1,
/* input*/ &from_t1_offset_2002_to_t1_offset_4000);
Module7Func(
/*output*/ &from_t0_pe_1_to_super_sink,
/* input*/ &from_t1_offset_0_to_t0_pe_1,
/* input*/ &from_t1_offset_1999_to_t0_pe_1,
/* input*/ &from_t1_offset_2000_to_t0_pe_1,
/* input*/ &from_t1_offset_4000_to_t0_pe_1,
/* input*/ &from_t1_offset_2001_to_t0_pe_1);
In the above code snippet, Module1Func
to Module6Func
and Module8Func
are forwarding modules; they constitute the line buffers of the SODA microarchitecture.
Although unrolled, the line buffer size is still approximately two lines of pixels, i.e. 4000 pixels.
Module7Func
is a computing module; it is instanciated twice.
The whole design is fully pipelined and can produce 2 pixel per cycle.
In general, the unroll factor can be set to any number that satisfies the throughput requirement.
- For non-iterative stencil,
unroll factor
shall be determined by the DRAM bandwidth, i.e. saturate the external bandwidth, since the resource is usually not the bottleneck - For iterative stencil, to use more PEs in a single iteration or to implement more iterations is yet to be explored
- Note that
2.0
will be adouble
number. To generatefloat
, use2.0f
. This may help reduce DSP usage - SODA is tiling-based and the size of the tile is specified in the
input
keyword. The last dimension is omitted because it is not needed in the reuse buffer generation
- Licheng Guo, Jason Lau, Yuze Chi, Jie Wang, Cody Hao Yu, Zhe Chen, Zhiru Zhang, Jason Cong . Analysis and Optimization of the Implicit Broadcasts in FPGA HLS to Improve Maximum Frequency. In DAC, 2020. [PDF]
- Jiajie Li, Yuze Chi, Jason Cong. HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration. In FPGA, 2020. [PDF] [Slides]
- Young-kyu Choi, Yuze Chi, Jie Wang, Jason Cong. FLASH: Fast, ParalleL, and Accurate Simulator for HLS. In TCAD, 2020. [PDF]
- Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, Jie Wang, Cody Hao Yu, Yuan Zhou, Jason Cong, Zhiru Zhang. HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing. In FPGA, 2019. (Best Paper Candidate) [PDF] [Slides] [Code]
- Yuze Chi, Young-kyu Choi, Jason Cong, Jie Wang. Rapid Cycle-Accurate Simulator for High-Level Synthesis. In FPGA, 2019. [PDF] [Slides]