diff --git a/rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org b/rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org new file mode 100755 index 0000000000..8628da284b --- /dev/null +++ b/rfcs/proposed/numa_support/tbbbind-link-static-hwloc.org @@ -0,0 +1,125 @@ +# -*- fill-column: 80; -*- + +#+title: Link ~tbbbind~ with static HWLOC to improve predictability of NUMA support API + +*Note:* This is a sub-RFC of the https://github.com/oneapi-src/oneTBB/pull/1535. +Specifically, its section about "Increased availability of NUMA support". + +* Introduction +oneTBB has a soft dependency on several variants of ~tbbbind~, which are loaded +by the library as part of its initialization stage. In turn, each ~tbbbind~ has +a hard dependency on a concrete version of the HWLOC library [1, 2]. The soft +dependency of oneTBB on ~tbbbind~ allows the library to continue its execution +even if the system loader is unable to resolve the hard dependency on HWLOC for +~tbbbind~. In this case, the HW topology is not discovered and the machine is +seen as if all CPU cores were uniform, which is the default TBB behavior when +NUMA constraints are not used. Thus, the following code returns the values that +do not reflect the real topology and do not matter: + +#+begin_src C++ +std::vector numa_nodes = oneapi::tbb::info::numa_nodes(); +std::vector core_types = oneapi::tbb::info::core_types(); +#+end_src + +This lack of valid HW topology data due to absence of a third party library is +the major problem with the current oneTBB behavior. There is no diagnostics for +the issue, which likely makes it unnoticeable by developers, and the code that +uses oneTBB NUMA support facilities continues running but does not use NUMA as +intended. + +Having a dependency on a shared HWLOC library has advantages: +1. Code reuse with all of the positive consequences out of this, including + relying on the same code that has been tested and debugged, allowing the OS + to share it among different processes, which consequently improves on cache + locality and memory footprint. That's the primary purpose of shared + libraries. +2. A drop-in replacement. Users are able to use their own version of HWLOC + without recompilation of oneTBB. This specific version of HWLOC could include + a hotfix to support a particular and/or new hardware that a customer has, but + whose support is not yet upstreamed to HWLOC project. It is also possible + that such support won't be upstreamed at all if that hardware is not going to + be available for massive users. It could also be a development version of + HWLOC that someone wants to test on their systems first. Of course, they can + do it with the static version as well, but that's more cumbersome as it + requires recompilation of every dependent component. + +The only disadvantage from depending on HWLOC library dynamically is that the +developers that use oneTBB's NUMA support API need to make sure the library is +available and can be found by oneTBB. Depending on the distribution model of a +developer's code, this is achieved either by: +1. Asking the end user to have necessary version of a dependency pre-installed. +2. Bundling necessary HWLOC version together with other pieces of a product + release. + +However, the requirement to fulfill one of the above steps for the NUMA API to +start paying off may be considered as an incovenience and, what is more +important, it is not always obvious that one of these steps is needed. +Especially, due to silent behavior in case HWLOC library cannot be found in the +environment. + +This proposal suggests an improvement to reduce the effect of the disadvantage +being dependent on a dynamic version of HWLOC library by having it linked +statically with one of the ~tbbbind~ libraries that are distributed together +with oneTBB, yet leaving possibility to specify another version of HWLOC library +if users see the need. + +Since HWLOC 1.x is an old version of HWLOC and modern versions of operating +systems install HWLOC 2.x by default, the probability of someone who is +constrained by using only HWLOC 1.x on their system is relatively small. Thus, +the filename of the ~tbbbind~ library that is linked against HWLOC 1.x can be +re-used for the library that is linked against static HWLOC version 2.x. + +* Proposal +1. Replace the dynamic link of ~tbbbind~ library which is currently linked + against HWLOC 1.x with the link to a static HWLOC library version 2.x. +2. Add loading of that ~tbbbind~ variant as the last attempt to resolve the + dependency on functionality provided by ~tbbbind~ layer. +3. Update the oneTBB documentation considering [[https://oneapi-src.github.io/oneTBB/search.html?q=tbb%3A%3Ainfo][these documentation pages]] to + include steps determining the variant of ~tbbbind~ being used. + +** Advantages +1. The proposed behavior allows having a mechanism for resolving a dependency on + HWLOC library in case it cannot be found in the environment, while still + preferring user-provided version of HWLOC. As a result, the problematic use of + oneTBB API mentioned above should work as expected, returning enumerated list + of actual NUMA nodes and core types on the system the code is running on, + provided that the loaded HWLOC library works on that system and that an + application properly distributes all binaries of oneTBB, sets the environment + so that the necessary variant of ~tbbbind~ library can be found and loaded. +2. The drop of support for HWLOC 1.x allows to not introducing additional + ~tbbbind~ variant of the library, yet maintaining support for popular + versions of HWLOC. + +** Disadvantages +By default still no diagnostics if users failed to setup environment with their +own version of HWLOC library correctly. Although, specifying ~TBB_VERSION=1~ +envar will help identifying an issue with setup of environment pretty quickly. + +* Alternative handling of inability to parse system topology +The other behavior in case HWLOC library cannot be found is to be more explicit +about the problem of a missing component and to either issue a warning or to +refuse working requiring one of the ~tbbbind~ variant to be loaded (e.g., throw +an exception). + +Comparing these alternative approaches to the one proposed. +** Common Advantages +- Explicitly tells that the functionality being used is not going to work + instead of just being silent. +- Does not require additional variant of ~tbbbind~ library to be distributed + along with the others. + +** Common Disadvantages +- Requires additional step from the user side to resolve the problem. In other + words, it does not provide complete solution to the problem. + +** Disadvantages of Issuing a Warning +- The warning may still not be visible, especially if standard streams are + closed. + +** Disadvantages of Throwing an Exception +- May break existing code as it does not expect an exception to be thrown. +- Requires introduction of an additional exception hierarchy. + +* References +1. [[https://www.open-mpi.org/projects/hwloc/][HWLOC project main page]] +2. [[https://github.com/open-mpi/hwloc][HWLOC project repository on GitHub]]