Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix LTC build error in CI #3910

Open
vivekkhandelwal1 opened this issue Dec 9, 2024 · 7 comments
Open

Fix LTC build error in CI #3910

vivekkhandelwal1 opened this issue Dec 9, 2024 · 7 comments
Assignees

Comments

@vivekkhandelwal1
Copy link
Collaborator

After the recent GH runner version upgrade the Torch-MLIR build in CI is failing with some LTC related error. The error can be found here: https://github.com/llvm/torch-mlir/actions/runs/12138622225/job/33891848916#step:6:1.

Since the error is not yet fixed and all the PRs are blocked on this, I'm disabling the LTC build from the CI and this issue will keep track the progress related to the fix.

CC: @antoniojkim @ke1337

vivekkhandelwal1 added a commit to vivekkhandelwal1/torch-mlir that referenced this issue Dec 9, 2024
This commit disables the LTC build from the Torch-MLIR CI since
after the recent GH runner version upgrade the Torch-MLIR build
in CI is failing with an LTC related error.

The tracking issue for the same can be found here: llvm#3910

Signed-off-by: Vivek Khandelwal <[email protected]>
@ke1337
Copy link
Collaborator

ke1337 commented Dec 9, 2024

@vivekkhandelwal1 could you share the range of commits between passing and failed CI runs? Is it from PyTorch version update? That would help us narrow down in root cause.

The error seems from link with undefined symbol:

  FAILED: tools/torch-mlir/python_packages/torch_mlir/torch_mlir/_mlir_libs/lib_torch_mlir_ltc.so 
  : && /usr/bin/clang++ -fPIC -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -O3 -DNDEBUG  -fuse-ld=lld -Wl,--gdb-index -Wl,-z,defs -Wl,-z,nodelete -Wl,--color-diagnostics  -rdynamic  -Xlinker --dependency-file -Xlinker tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/link.d -shared -Wl,-soname,lib_torch_mlir_ltc.so -o tools/torch-mlir/python_packages/torch_mlir/torch_mlir/_mlir_libs/lib_torch_mlir_ltc.so tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/generated/LazyNativeFunctions.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/generated/RegisterLazy.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/generated/shape_inference.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/mlir_lowering_context.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/mlir_native_functions.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/mlir_node_lowering.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/shape_inference.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/backend_impl.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/dynamic_ir.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/mlir_node.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/tensor.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/ops/device_data.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/ops/generic.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/ops/index.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/ops/ivalue.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/ops/split.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/ops/unbind_int.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/utils/jit_utils.cpp.o tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/utils/tensor_utils.cpp.o -L/opt/python/cp311-cp311/lib/python3.11/site-packages/torch/lib   -L/lib/intel64   -L/lib/intel64_win   -L/lib/win-x64 -Wl,-rpath,/opt/python/cp311-cp311/lib/python3.11/site-packages/torch/lib:/lib/intel64:/lib/intel64_win:/lib/win-x64:/_work/torch-mlir/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/_mlir_libs  lib/lib_jit_ir_importer.a  /opt/python/cp311-cp311/lib/python3.11/site-packages/torch/lib/libtorch.so  /opt/python/cp311-cp311/lib/python3.11/site-packages/torch/lib/libc10.so  tools/torch-mlir/python_packages/torch_mlir/torch_mlir/_mlir_libs/libTorchMLIRAggregateCAPI.so  -Wl,--no-as-needed,"/opt/python/cp311-cp311/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so" -Wl,--as-needed  /opt/python/cp311-cp311/lib/python3.11/site-packages/torch/lib/libc10.so  -Wl,--no-as-needed,"/opt/python/cp311-cp311/lib/python3.11/site-packages/torch/lib/libtorch.so" -Wl,--as-needed && cd /_work/torch-mlir/torch-mlir/build/tools/torch-mlir/projects/ltc/csrc/base_lazy_backend && mkdir -p /_work/torch-mlir/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/base_lazy_backend/generated/ && cd /_work/torch-mlir/torch-mlir/build/tools/torch-mlir/projects/ltc/csrc/base_lazy_backend && cp /_work/torch-mlir/torch-mlir/projects/ltc/csrc/base_lazy_backend/*.h /_work/torch-mlir/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/base_lazy_backend/ && cd /_work/torch-mlir/torch-mlir/build/tools/torch-mlir/projects/ltc/csrc/base_lazy_backend && cp /_work/torch-mlir/torch-mlir/projects/ltc/csrc/base_lazy_backend/generated/*.h /_work/torch-mlir/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/base_lazy_backend/generated/ && cd /_work/torch-mlir/torch-mlir/build/tools/torch-mlir/projects/ltc/csrc/base_lazy_backend && mkdir -p /_work/torch-mlir/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/base_lazy_backend/ops/ && cd /_work/torch-mlir/torch-mlir/build/tools/torch-mlir/projects/ltc/csrc/base_lazy_backend && cp /_work/torch-mlir/torch-mlir/projects/ltc/csrc/base_lazy_backend/ops/*.h /_work/torch-mlir/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/base_lazy_backend/ops/ && cd /_work/torch-mlir/torch-mlir/build/tools/torch-mlir/projects/ltc/csrc/base_lazy_backend && mkdir -p /_work/torch-mlir/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/base_lazy_backend/utils/ && cd /_work/torch-mlir/torch-mlir/build/tools/torch-mlir/projects/ltc/csrc/base_lazy_backend && cp /_work/torch-mlir/torch-mlir/projects/ltc/csrc/base_lazy_backend/utils/*.h /_work/torch-mlir/torch-mlir/build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/base_lazy_backend/utils/
  ld.lld: error: undefined symbol: unsigned char const* at::TensorBase::const_data_ptr<unsigned char, 0>() const
  >>> referenced by LazyNativeFunctions.cpp
  >>>               tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/generated/LazyNativeFunctions.cpp.o:(torch::lazy::hash_t torch::lazy::Hash<at::Generator>(std::optional<at::Generator> const&))
  
  ld.lld: error: undefined symbol: c10::BFloat16 const* at::TensorBase::const_data_ptr<c10::BFloat16, 0>() const
  >>> referenced by LazyNativeFunctions.cpp
  >>>               tools/torch-mlir/projects/ltc/csrc/base_lazy_backend/CMakeFiles/torch_mlir_ltc_backend.dir/generated/LazyNativeFunctions.cpp.o:(torch::lazy::hash_t torch::lazy::Hash<at::Generator>(std::optional<at::Generator> const&))

@vivekkhandelwal1
Copy link
Collaborator Author

@vivekkhandelwal1 could you share the range of commits between passing and failed CI runs? Is it from PyTorch version update? That would help us narrow down in root cause.

Hi @ke1337, the error is not because of the PyTorch version update. You can see any of the latest PR CI run it will fail with this error independent of the changes made in the PR. Also, the same PR's CI was passing before the gh runner version upgrade, but as soon as that update was done it started failing. Before that update the CI was not even running all the jobs were indefinitely queued.

@antoniojkim
Copy link
Collaborator

Also, the same PR's CI was passing before the gh runner version upgrade, but as soon as that update was done it started failing.

What was the purpose of the GH runner version upgrade? Could there be an issue with GH runner itself? Can we roll back the GH runner version and upgrade to a version that doesn't have this issue?

vivekkhandelwal1 added a commit that referenced this issue Dec 9, 2024
This commit disables the LTC build from the Torch-MLIR CI since after
the recent GH runner version upgrade the Torch-MLIR build in CI is
failing with an LTC related error.

The tracking issue for the same can be found here:
#3910

Signed-off-by: Vivek Khandelwal <[email protected]>
@vivekkhandelwal1
Copy link
Collaborator Author

Also, the same PR's CI was passing before the gh runner version upgrade, but as soon as that update was done it started failing.

What was the purpose of the GH runner version upgrade? Could there be an issue with GH runner itself? Can we roll back the GH runner version and upgrade to a version that doesn't have this issue?

@saienduri can tell you about this.

@rahuls-cerebras
Copy link
Collaborator

rahuls-cerebras commented Dec 19, 2024

@saienduri a gentle reminder, in addition to the above queries, can you also plz let us know if there are any logs generated related to GH runner upgrade.

@saienduri
Copy link
Collaborator

Hello, we have to upgrade the GH runner version because they deprecated the old one. What logs are you looking for? The undefined symbol error linked above is the one we have to get past to enable LTC again

@stellaraccident
Copy link
Collaborator

stellaraccident commented Dec 19, 2024

Hi - the current CI bots are managed by my team and we can only provide limited support for features we do not use (basically happy to answer easy questions and if not causing problems, happy to run some additional configs). But LTC has a complex relationship with pytorch which is different from everything else in the repo. I would recommend bringing up your own runners if needing to support LTC.

As Sai says, GH forces upgrades of runners and it is non optional. Tracking down this kind of thing is very costly, and that cost needs to be handled by folks who use LTC. It will need to be tolerant of breakages because we have to keep both the pytorch and runner version upgraded and this seems to be a fragile integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants