Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loader corruption when running CTS with native cpu + level zero enabled #2511

Open
aarongreig opened this issue Dec 31, 2024 · 0 comments
Open
Labels
level-zero L0 adapter specific issues

Comments

@aarongreig
Copy link
Contributor

This issue only applies to the branch for #2479. I'm recording it here because it doesn't affect how our CI currently works so it isn't necessarily a blocker to merging.

To reproduce check out a branch containing the changes for #2479, enable and build the L0 and native cpu adapters and run the test-enqueue cts suite. The problem is intermittent but it shouldn't take many attempts to see either a segfault in, or an error returned from urQueueCreate.

The pathology of this behaviour is that the context handle (importantly it seems to be the loader handle, not the adapter handle) passed to urQueueCreate is corrupted somehow, resulting in the wrong adapter's implementation of urQueueCreate getting called. Most commonly this happens during a native cpu test, where the level zero implementation is called and returns UR_RESULT_ERROR_INVALID_DEVICE when it doesn't recognize the (native cpu) device. The problem doesn't respond well to debuggers but I've instrumented various bits of loader and adapter code and been able to observe the address for the urQueueCreate entry point changing from test to test when this occurs.

This same issue is behind various other spooky behaviours in a few test suites. You can see problems running the test-queue suite, and sometimes rather than what's described above in test-enqueue you'll get wrong results or a hang.

Removing this line from the l0 urContextRelease implementation (effectively leaking all the contexts) makes the problem go away

No problems are observed when only the native cpu + opencl adapters are enabled, strangely the opencl adapter seems completely unaffected.

The issue isn't anything to do with a bad urEnqueue operation (initially I thought it might be related to a bad buffer operation or something). It can be reproduced in the test-queue suite running tests that only call the following entry points:

   ---> urAdapterGet
   ---> urAdapterGetInfo
   ---> urContextCreate
   ---> urContextRelease
   ---> urDeviceGet
   ---> urDeviceGetInfo
   ---> urPlatformGet
   ---> urPlatformGetInfo
   ---> urQueueCreate
   ---> urQueueGetInfo
   ---> urQueueRelease

Valgrind, UB sanitizer and address sanitizer have all come up empty handed, although this must be some kind of memory corruption. As mentioned it doesn't reproduce while running in a debugger for the most part so it isn't too surprising that these tools are enough to mess with whatever's going on.

My current best guess is that something in the l0 adapter is retaining a reference to a data member from a context after it gets destroyed, and that's getting used somewhere such that bad memory accesses occur, although I haven't actually produced any evidence of this.

@aarongreig aarongreig added the level-zero L0 adapter specific issues label Dec 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level-zero L0 adapter specific issues
Projects
None yet
Development

No branches or pull requests

1 participant