Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent segfault in apparently memory safe code, perhaps FFTW related #48722

Closed
kbarros opened this issue Feb 19, 2023 · 11 comments
Closed

Comments

@kbarros
Copy link

kbarros commented Feb 19, 2023

We are observing intermittent segfault behavior when running the tests of the Sunny.jl package. It sometimes shows up during Github CI testing of our simplified crash branch:

pkg> add Sunny#crash
pkg> test Sunny

It only crashes sometimes, however. On my Mac, for example, crashes are rare, but when they happen, it's in roughly the same code location. An example of the segfault output is shown from this CI action: https://github.com/SunnySuite/Sunny.jl/actions/runs/4214225988/jobs/7314550112

The segfault seems to always occur inside FFTW, but perhaps there is memory corruption happening prior to FFTW.

The branch Sunny#crash contains no @inbounds annotations, or other "memory unsafe" operations from what we can tell (presumably the FFT package is intended to be memory safe?). Sunny does depend on external C libraries, which could of course corrupt memory.

I tried to bisect to a commit where the crash first appeared, and it seems to be one of these two:
SunnySuite/Sunny.jl@fb0a631 <- where crashes become very noticeable
SunnySuite/Sunny.jl@9f97b54 <- parent commit, seems suspicious to me

We recorded a log of the crash using --bug-report=rr and uploaded here:
https://julialang-dumps.s3.amazonaws.com/reports/2023-02-18T02-49-23-ddahlbom.tar.zst

Two example segfault outputs are below.

signal (11): Segmentation fault
in expression starting at /home/runner/work/Sunny.jl/Sunny.jl/test/test_energy_consistency.jl:77
unknown function (ip: 0x11b22230)
energy at /home/runner/work/Sunny.jl/Sunny.jl/src/System/Interactions.jl:194
test_delta at /home/runner/work/Sunny.jl/Sunny.jl/test/test_energy_consistency.jl:67
unknown function (ip: 0x7f9c2fdad002)
_jl_invoke at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2377 [inlined]
ijl_apply_generic at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/gf.c:2559
jl_apply at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/julia.h:1843 [inlined]
do_call at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:126
eval_value at /cache/build/default-amdci4-2/julialang/julia-release-1-dot-8/src/interpreter.c:215
...

and

signal (11): Segmentation fault
in expression starting at /home/runner/work/Sunny.jl/Sunny.jl/test/test_energy_consistency.jl:77
unknown function (ip: 0x31)
unsafe_execute! at /home/runner/.julia/packages/FFTW/sfy1o/src/fft.jl:500 [inlined]
mul! at /home/runner/.julia/packages/FFTW/sfy1o/src/fft.jl:859 [inlined]
energy at /home/runner/work/Sunny.jl/Sunny.jl/src/System/Ewald.jl:125
Allocations: 294550517 (Pool: 294319097; Big: 231420); GC: 174
ERROR: LoadError: Package Sunny errored during testing (received signal: 11)
Stacktrace:
 [1] pkgerror(msg::String)
   @ Pkg.Types /opt/hostedtoolcache/julia/1.8.5/x64/share/julia/stdlib/v1.8/Pkg/src/Types.jl:67
 [2] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, julia_args::Cmd, test_args::Cmd, test_fn::Nothing, force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool)
...
  1. The output of versioninfo()

We have observed the problem on multiple machines, all using Julia 1.8.5. It primarily appears on Github Actions CI using x64, but I have also seen it on my M1 Mac, which is:

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 8 × Apple M1 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 6 virtual cores
  1. How you installed Julia

Github Actions Julia installer with [Julia 1.8 - ubuntu-latest - x64] or juliaup for Mac.

Thank you.

@oscardssmith
Copy link
Member

oscardssmith commented Feb 19, 2023

Have you tried running this with --checkbounds=yes? Quite possibly not related, but it's always nice to be able to totally rule out bounds errors.

@kbarros
Copy link
Author

kbarros commented Feb 19, 2023

On my local machine (M1 Mac) the tests run cleanly with --check-bounds=yes. I am not sure how to enable that option on Github CI (x64 Ubuntu) which is where it usually segfaults.

@vtjnash
Copy link
Member

vtjnash commented Feb 19, 2023

Can you try running on master branch of Julia and see if it still happens? I am currently getting a replay divergence with that trace, so are you recording on AMD by any chance? (Intel CPUs tends to be a bit more reliable at counting ticks, and a bit more reliable, particularly since my replay machine is Intel)

@ddahlbom
Copy link

ddahlbom commented Feb 19, 2023

The recording was indeed made on an AMD machine.

Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × AMD Ryzen Threadripper PRO 3975WX 32-Cores
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, znver2)
  Threads: 1 on 64 virtual cores

Note also this recording was made on the main branch of Sunny (commit dc0a6b5b12a7c700d4f8b6a808bf9ef6c2c5f741). We can try to get a new recording on the #crash branch, but it would still be on the AMD machine listed above.

We can also see what happens on the master branch of Julia.

@kbarros
Copy link
Author

kbarros commented Feb 19, 2023

I can confirm a Github CI crash on the 1.10.0-DEV.637 nightly.
https://github.com/SunnySuite/Sunny.jl/actions/runs/4217218004/jobs/7320835373

@kbarros
Copy link
Author

kbarros commented Feb 20, 2023

We will attempt to produce an rr trace on an Intel machine.

Update Despite some effort, we're having trouble creating another rr trace. Does the existing rr trace replay on an AMD machine?

@ddahlbom
Copy link

We replayed the trace on the AMD machine on which it was originally recorded. After entering continue in gdb, we get the following (followed by the tail of the trace dump):

[FATAL src/ReplaySession.cc:1178:check_ticks_consistency()] 
 (task 5896 (rec:41649) at time 1082)
 -> Assertion `ticks_now == trace_ticks' failed to hold. ticks mismatch for 'SYSCALL: mmap'; expected 22197705, got 22197709

@vtjnash Is this what you were referring to? Does this indicate that the trace is unusable?

We will continue the effort to capture a trace on an Intel machine using the nightly. As noted above, we have seen the segfault on Intel, but so far it has only been on computers where we don't have the ability to use rr.

@kbarros
Copy link
Author

kbarros commented Feb 24, 2023

Also, any other hints to diagnose memory corruption are welcome. We tried running valgrind with a vanilla nightly using these instructions, but the output appears clean. Is it likely to be helpful to try a custom Julia build with CFLAGS = -DMEMDEBUG -DMEMDEBUG2?

@vtjnash
Copy link
Member

vtjnash commented Feb 24, 2023

Yes, though perhaps there are settings you can use to relax the ticks checks, or maybe this is controlled by the perf_event_paranoid kernel setting?

MSAN can be very good for that, but can be a bit annoying to build (https://docs.julialang.org/en/v1/devdocs/sanitizers/#Sanitizer-support)

kbarros added a commit to SunnySuite/Sunny.jl that referenced this issue Feb 24, 2023
Unfortunately, the @inbounds marker seems unrelated to the crashing
behavior. Attempt another temporary workaround: disable dipole-dipole
in energy consistency test.

JuliaLang/julia#48722
@kbarros
Copy link
Author

kbarros commented Feb 27, 2023

On Slack vtjnash and Gabriel Baraldi explained that deepcopy is memory unsafe, especially when used in combination with C libraries. Removing deepcopy from the Sunny codebase appears to fix the segfault. Perhaps the problem was that calling deepcopy on an FFTW plan led to a new FFTW plan in an invalid state (I don't have the tools to investigate). Hopefully this issue can be considered resolved now. Many thanks!

@kbarros
Copy link
Author

kbarros commented Mar 1, 2023

Indeed the problem was deepcopy of FFTW plans. A minimal example is here: JuliaMath/FFTW.jl#261.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants