Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instruction selection error from loads and stores on CUDA #33

Open
tkf opened this issue Jun 5, 2022 · 9 comments
Open

Instruction selection error from loads and stores on CUDA #33

tkf opened this issue Jun 5, 2022 · 9 comments

Comments

@tkf
Copy link
Member

tkf commented Jun 5, 2022

(This is not really an Atomix issue but it's not clear where to track this.)

Currently, LLVM does not seem to select atomic store and load on CUDA (see Examples below). Until it is fixed in LLVM, as a short-term solution, it may be nice to use @device_override in CUDA.jl to emit appropriate device-specific instructions through LLVM.Interop.atomic_pointer* APIs.

Ref:
JuliaGPU/CUDA.jl#1353
JuliaGPU/CUDA.jl#1393

Workaround

If you really need to establish some atomic ordering semantics, sometimes it can be done by using some stronger operations such as:

x = Atomix.@atomic xs[i] += 0  # load (for numeric elements)
Atomix.@atomicswap xs[i] = x   # store

Examples

julia> function load(xs)
           Atomix.@atomic xs[1]
           nothing
       end;

julia> @cuda load(CUDA.zeros(1))
ERROR: LLVM error: Cannot select: 0x6b13098: f32,ch = AtomicLoad<(load seq_cst 4 from %ir.2, addrspace
1)> 0x704a6d8, 0x6b13b90, /home/tkf/.julia/packages/LLVM/YSJ2s/src/interop/base.jl:45 @[ /home/tkf/.jul
ia/packages/LLVM/YSJ2s/src/interop/atomics.jl:166 @[ /home/tkf/.julia/packages/LLVM/YSJ2s/src/interop/a
tomics.jl:166 @[ /home/tkf/.julia/packages/UnsafeAtomicsLLVM/i4GMj/src/internal.jl:7 @[ /home/tkf/.juli
a/packages/Atomix/F9VIX/src/core.jl:5 @[ REPL[11]:2 ] ] ] ] ]
  0x6b13b90: i64,ch = CopyFromReg 0x704a6d8, Register:i64 %0, abstractarray.jl:656 @[ /home/tkf/.julia/
packages/Atomix/F9VIX/src/references.jl:95 @[ REPL[11]:2 ] ]
    0x6b12fc8: i64 = Register %0
julia> function store!(xs)
           Atomix.@atomic xs[1] = zero(eltype(xs))
           nothing
       end;

julia> @cuda store!(CUDA.zeros(1))
ERROR: LLVM error: Cannot select: 0x854a080: ch = AtomicStore<(store seq_cst 4 into %ir.2, addrspace 1)
> 0x863d288, 0x85273f0, ConstantFP:f32<0.000000e+00>, /home/tkf/.julia/packages/LLVM/YSJ2s/src/interop/
base.jl:45 @[ /home/tkf/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:245 @[ /home/tkf/.julia/packa
ges/LLVM/YSJ2s/src/interop/atomics.jl:201 @[ /home/tkf/.julia/packages/UnsafeAtomicsLLVM/i4GMj/src/inte
rnal.jl:11 @[ /home/tkf/.julia/packages/Atomix/F9VIX/src/core.jl:14 @[ REPL[13]:2 ] ] ] ] ]
  0x85273f0: i64,ch = CopyFromReg 0x863d288, Register:i64 %0, abstractarray.jl:656 @[ /home/tkf/.julia/
packages/Atomix/F9VIX/src/references.jl:95 @[ REPL[13]:2 ] ]
    0x854a220: i64 = Register %0
  0x8527ad8: f32 = ConstantFP<0.000000e+00>
(jl_4BkM8Y) pkg> st
      Status `/tmp/jl_4BkM8Y/Project.toml`
  [a9b6321e] Atomix v0.1.0
  [052768ef] CUDA v3.10.1
  [d80eeb9a] UnsafeAtomicsLLVM v0.1.0

julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, broadwell)
@jgreener64
Copy link

I am trying to create a minimal example for EnzymeAD/Enzyme.jl#511 but am struggling to get Atomix to work on the GPU.

The error I get for the above examples is different to that shown. For example:

using Atomix, CUDA
function load(xs)
    Atomix.@atomic xs[1]
    nothing
end
@cuda load(CUDA.zeros(1))
ERROR: GPU compilation of kernel #load(CuDeviceVector{Float32, 1}) failed
KernelError: kernel returns a value of type `Union{}`

Make sure your kernel function ends in `return`, `return nothing` or `nothing`.
If the returned value is of type `Union{}`, your Julia code probably throws an exception.
Inspect the code with `@device_code_warntype` for more details.

Stacktrace:
  [1] check_method(job::GPUCompiler.CompilerJob)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/hi5Wg/src/validation.jl:41
  [2] macro expansion
    @ ~/.julia/packages/TimerOutputs/LHjFw/src/TimerOutput.jl:253 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/GPUCompiler/hi5Wg/src/driver.jl:152 [inlined]
  [4] emit_julia(job::GPUCompiler.CompilerJob)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/hi5Wg/src/utils.jl:68
  [5] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/compiler/execution.jl:352
  [6] #224
    @ ~/.julia/packages/CUDA/Ey3w2/src/compiler/execution.jl:347 [inlined]
  [7] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(load), Tuple{CuDeviceVector{Float32, 1}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/hi5Wg/src/driver.jl:76
  [8] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/compiler/execution.jl:346
  [9] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/hi5Wg/src/cache.jl:90
 [10] cufunction(f::typeof(load), tt::Type{Tuple{CuDeviceVector{Float32, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/compiler/execution.jl:299
 [11] cufunction(f::typeof(load), tt::Type{Tuple{CuDeviceVector{Float32, 1}}})
    @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/compiler/execution.jl:292
 [12] top-level scope
    @ ~/.julia/packages/CUDA/Ey3w2/src/compiler/execution.jl:102
 [13] top-level scope
    @ ~/.julia/packages/CUDA/Ey3w2/src/initialization.jl:52

I also get a similar error with the suggestion at the top:

function load2(xs)
    x = Atomix.@atomic xs[1] += 0
    nothing
end
@cuda load2(CUDA.zeros(1))

If I try the following I get a different error:

function load3(xs)
    if threadIdx().x == 1
        Atomix.@atomic xs[1]
    end
    nothing
end
@cuda load3(CUDA.zeros(1))
ERROR: InvalidIRError: compiling kernel #load3(CuDeviceVector{Float32, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to load)
Stacktrace:
 [1] get
   @ ~/.julia/packages/Atomix/F9VIX/src/core.jl:5
 [2] load3
   @ ./REPL[11]:3
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(load3), Tuple{CuDeviceVector{Float32, 1}}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/hi5Wg/src/validation.jl:141
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/hi5Wg/src/driver.jl:418 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/LHjFw/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/hi5Wg/src/driver.jl:416 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/hi5Wg/src/utils.jl:68
  [6] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/compiler/execution.jl:354
  [7] #224
    @ ~/.julia/packages/CUDA/Ey3w2/src/compiler/execution.jl:347 [inlined]
  [8] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(load3), Tuple{CuDeviceVector{Float32, 1}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/hi5Wg/src/driver.jl:76
  [9] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/compiler/execution.jl:346
 [10] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/hi5Wg/src/cache.jl:90
 [11] cufunction(f::typeof(load3), tt::Type{Tuple{CuDeviceVector{Float32, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/compiler/execution.jl:299
 [12] cufunction(f::typeof(load3), tt::Type{Tuple{CuDeviceVector{Float32, 1}}})
    @ CUDA ~/.julia/packages/CUDA/Ey3w2/src/compiler/execution.jl:292
 [13] top-level scope
    @ ~/.julia/packages/CUDA/Ey3w2/src/compiler/execution.jl:102
 [14] top-level scope
    @ ~/.julia/packages/CUDA/Ey3w2/src/initialization.jl:52

I am on Julia 1.8.2, Atomix 0.1.0 and CUDA 3.12.1. System info:

Julia Version 1.8.2
Commit 36034abf260 (2022-09-29 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, cascadelake)
  Threads: 16 on 36 virtual cores
Environment:
  LD_LIBRARY_PATH = /usr/local/gromacs/lib
CUDA toolkit 11.7, artifact installation
NVIDIA driver 470.161.3, for CUDA 11.4
CUDA driver 11.7

Libraries: 
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+470.161.3
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.2
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

2 devices:
  0: NVIDIA RTX A6000 (sm_86, 47.249 GiB / 47.544 GiB available)
  1: NVIDIA RTX A6000 (sm_86, 46.256 GiB / 47.541 GiB available)

@vchuravy
Copy link
Member

So to get the actual LLVM error

julia> using CUDA, Atomix, UnsafeAtomicsLLVM

julia> function load3(xs)
           if threadIdx().x == 1
               Atomix.@atomic xs[1]
           end
           nothing
       end
load3 (generic function with 1 method)

julia> @cuda load3(CUDA.zeros(1))

Then you get:

ERROR: LLVM error: Cannot select: 0x51076a0: f32,ch = AtomicLoad<(load seq_cst (s32) from %ir.3, addrspace 1)> 0x47b3938, 0x450a590, /home/vchuravy/.julia/packages/LLVM/9gCXO/src/interop/base.jl:40 @[ /home/vchuravy/.julia/packages/LLVM/9gCXO/src/interop/atomics.jl:166 @[ /home/vchuravy/.julia/packages/LLVM/9gCXO/src/interop/atomics.jl:166 @[ /home/vchuravy/.julia/packages/UnsafeAtomicsLLVM/i4GMj/src/internal.jl:7 @[ /home/vchuravy/.julia/packages/Atomix/F9VIX/src/core.jl:5 @[ REPL[2]:3 ] ] ] ] ]
  0x450a590: i64,ch = CopyFromReg 0x47b3938, Register:i64 %0, REPL[2]:2
    0x450a5f8: i64 = Register %0
In function: _Z16julia_load3_185713CuDeviceArrayI7Float32Li1ELi1EE

One way around this would be JuliaGPU/CUDA.jl#1644

@jgreener64
Copy link

One way around this would be JuliaGPU/CUDA.jl#1644

Is that expected to work currently? I get the same error when using that branch.

@vchuravy
Copy link
Member

No not yet fully, and I haven't done the work to integrate it with Atomix.

@vchuravy
Copy link
Member

This works though:

julia> function load3(xs)
           if threadIdx().x == 1
               Atomix.@atomic :monotonic xs[1]
           end
           nothing
       end
load3 (generic function with 1 method)

julia> @cuda load3(CUDA.zeros(1))
CUDA.HostKernel{typeof(load3), Tuple{CuDeviceVector{Float32, 1}}}(load3, CuFunction(Ptr{Nothing} @0x0000000006513120, CuModule(Ptr{Nothing} @0x00000000065535c0, CuContext(0x00000000024c8b20, instance 1658ac4f1e894673))), CUDA.KernelState(Ptr{Nothing} @0x00007f9baf400000))

@jgreener64
Copy link

Works for me, I'll see how this plays with Enzyme gradients.

@vchuravy
Copy link
Member

But the atomic decrement is still wrong

julia> function load3(xs)
                  if threadIdx().x == 1
                      Atomix.@atomic :monotonic xs[1] -= 1.0
                  end
                  nothing
              end
load3 (generic function with 1 method)

julia> @cuda load3(CUDA.zeros(1))
ERROR: LLVM error: Cannot select: 0xfc25d28: f32,ch = <<Unknown DAG Node>><(load store monotonic (s32) on %ir.3, addrspace 1)> 0x885c938, 0xfc24cb0, ConstantFP:f32<1.000000e+00>, /home/vchuravy/.julia/packages/LLVM/9gCXO/src/interop/base.jl:40 @[ /home/vchuravy/.julia/packages/LLVM/9gCXO/src/interop/atomics.jl:270 @[ /home/vchuravy/.julia/packages/LLVM/9gCXO/src/interop/atomics.jl:270 @[ /home/vchuravy/.julia/packages/LLVM/9gCXO/src/interop/atomics.jl:374 @[ /home/vchuravy/.julia/packages/UnsafeAtomicsLLVM/i4GMj/src/internal.jl:18 @[ /home/vchuravy/.julia/packages/Atomix/F9VIX/src/core.jl:33 @[ REPL[33]:3 ] ] ] ] ] ]
  0xfc24cb0: i64,ch = CopyFromReg 0x885c938, Register:i64 %0, REPL[33]:2
    0xfc24d18: i64 = Register %0
  0xfc25cc0: f32 = ConstantFP<1.000000e+00>
In function: _Z16julia_load3_842013CuDeviceArrayI7Float32Li1ELi1EE

@vchuravy
Copy link
Member

But ...

 function load3(xs)
                  if threadIdx().x == 1
                      Atomix.@atomic :monotonic xs[1] += -1.0
                  end
                  nothing
              end

@vchuravy
Copy link
Member

Maybe helped by JuliaGPU/GPUCompiler.jl#652

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants