Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when precompiling in Windows #198

Open
yolhan83 opened this issue Oct 27, 2024 · 5 comments
Open

Error when precompiling in Windows #198

yolhan83 opened this issue Oct 27, 2024 · 5 comments

Comments

@yolhan83
Copy link

Hello,

I realize it might be too early for Windows support, but I didn't see an existing issue on this.
In case it hasn't been tested yet, I just wanted to point out that I encountered the following error on Windows during precompilation. It works fine on my wsl though.

version :

Julia Version 1.11.1
Commit 8f5b7ca12a (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700H
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, alderlake)
Threads: 20 default, 0 interactive, 10 GC (on 20 virtual cores)

error :

Precompiling Reactant...
Info Given Reactant was explicitly requested, output will be shown live
ERROR: LoadError: UndefVarError: `libReactantExtra` not defined in `Reactant_jll`
Stacktrace:
 [1] getproperty(x::Module, f::Symbol)
   @ Base .\Base.jl:42
 [2] top-level scope
   @ C:\Users\yolha\.julia\packages\Reactant\rRa4g\src\mlir\MLIR.jl:8
 [3] include(mod::Module, _path::String)
   @ Base .\Base.jl:557
 [4] include(x::String)
   @ Reactant C:\Users\yolha\.julia\packages\Reactant\rRa4g\src\Reactant.jl:1
 [5] top-level scope
   @ C:\Users\yolha\.julia\packages\Reactant\rRa4g\src\Reactant.jl:82
 [6] include
   @ .\Base.jl:557 [inlined]
 [7] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::Nothing)
   @ Base .\loading.jl:2790
 [8] top-level scope
   @ stdin:5
in expression starting at C:\Users\yolha\.julia\packages\Reactant\rRa4g\src\mlir\MLIR.jl:1
in expression starting at C:\Users\yolha\.julia\packages\Reactant\rRa4g\src\Reactant.jl:1
in expression starting at stdin:5
  ✗ Reactant
  0 dependencies successfully precompiled in 4 seconds. 79 already precompiled.

ERROR: The following 1 direct dependency failed to precompile:

Reactant

Failed to precompile Reactant [3c362404-f566-11ee-1572-e11a4b42c853] to "C:\\Users\\yolha\\.julia\\compiled\\v1.11\\Reactant\\jl_E4CF.tmp".
ERROR: LoadError: UndefVarError: `libReactantExtra` not defined in `Reactant_jll`
Stacktrace:
 [1] getproperty(x::Module, f::Symbol)
   @ Base .\Base.jl:42
 [2] top-level scope
   @ C:\Users\yolha\.julia\packages\Reactant\rRa4g\src\mlir\MLIR.jl:8
 [3] include(mod::Module, _path::String)
   @ Base .\Base.jl:557
 [4] include(x::String)
   @ Reactant C:\Users\yolha\.julia\packages\Reactant\rRa4g\src\Reactant.jl:1
 [5] top-level scope
   @ C:\Users\yolha\.julia\packages\Reactant\rRa4g\src\Reactant.jl:82
 [6] include
   @ .\Base.jl:557 [inlined]
 [7] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::Nothing)
   @ Base .\loading.jl:2790
 [8] top-level scope
   @ stdin:5
in expression starting at C:\Users\yolha\.julia\packages\Reactant\rRa4g\src\mlir\MLIR.jl:1
in expression starting at C:\Users\yolha\.julia\packages\Reactant\rRa4g\src\Reactant.jl:1
@mofeing
Copy link
Collaborator

mofeing commented Oct 27, 2024

ReactantExtra, the C-API that wraps XLA C++ API and includes Enzyme-JAX, doesn't support Windows yet. Actually, I fear that supporting Windows will be a headache...

But do you mind running it in Windows Subsystem for Linux? It might work there. I'm curious but I don't have a Windows machine.

@yolhan83
Copy link
Author

Yes I said it worked (precompilation at least) on my wsl subsystem Ubuntu. I will install my gpu on it and try things out no problem

@mofeing
Copy link
Collaborator

mofeing commented Oct 27, 2024

ups, didn't read that part. Nice to know.

@yolhan83
Copy link
Author

yolhan83 commented Oct 27, 2024

Just tested in CPU, working fine on 1.10 on wsl subsystem
version :

Julia Version 1.10.5
Commit 6f3fdf7b362 (2024-08-27 14:19 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 20 × 12th Gen Intel(R) Core(TM) i7-12700H
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, alderlake)
Threads: 1 default, 0 interactive, 1 GC (on 20 virtual cores)

env :

  [7da242da] Enzyme v0.13.12
  [b2108857] Lux v1.2.0
  [3c362404] Reactant v0.2.3

code :

julia> using Lux,Reactant,Enzyme,Random;

julia> const dev = xla_device()
(::XLADevice) (generic function with 3 methods)

julia> model = Lux.Chain(Dense(3,10,relu),Dense(10,10,relu),Dense(10,1));

julia> ps,st = Lux.setup(Random.default_rng(1),model) |> dev;

julia> x = rand(Float32,3,1000) |> dev;

julia> y = sum(x,dims=1) |> dev;

julia> stlayer = Lux.StatefulLuxLayer{true}(model,nothing,st);

julia> loss(stlayer,ps,x,y) = sum(abs2,stlayer(x,ps).-y)/length(y);

julia> loss(stlayer,ps,x,y)
0.6765756f0

julia> dps = deepcopy(ps);

julia> grad = Enzyme.autodiff(Reverse,loss,Active,Const(stlayer),Duplicated(ps,dps),Const(x),Const(y));

julia> dps[1][1]
10×3 ConcreteRArray{Float32, 2}:
  0.905959   1.53576   -1.35888
  1.73563    1.80716    1.42986
 -0.273622  -0.401162  -1.86655
 -0.073245  -1.31546    0.438985
 -1.04398    1.96327    0.349859
 -0.225534  -1.01021    0.564287
 -1.0768    -1.55186   -1.36067
 -0.203418  -0.791012   1.53536
  0.935806   1.02484    0.214392
 -1.35451   -1.598      0.427091

still issues in 1.11 tough, but that doesn't come from reactant.

for gpu, it looks insanely great actually :
env :

  [6e4b80f9] BenchmarkTools v1.5.0
  [052768ef] CUDA v5.5.2
  [7da242da] Enzyme v0.13.12
  [b2108857] Lux v1.2.0
  [3c362404] Reactant v0.2.3

code :

julia> using CUDA,Reactant,BenchmarkTools

julia> x= rand(100000000);

julia> xc = cu(x);

julia> Reactant.set_default_backend("gpu");

julia> const dev = xla_device();

julia> xc2 = dev(x);

julia> f(x) = sum(x);

julia> f_comp = @compile f(xc2);

julia> @btime f($x);
  41.367 ms (0 allocations: 0 bytes)

julia> @btime begin
           f($xc)
           CUDA.synchronize()
           end
  1.789 ms (96 allocations: 2.89 KiB)

julia> @btime begin
           Reactant.synchronize($f_comp($xc2))
           end
  29.175 μs (2 allocations: 48 bytes)

cuda version :

CUDA runtime 12.6, artifact installation
CUDA driver 12.4
NVIDIA driver 552.12.0

CUDA libraries:
- CUBLAS: 12.3.4
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+550.73.1

Julia packages:
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0

Toolchain:
- Julia: 1.10.5
- LLVM: 15.0.7

1 device:
  0: NVIDIA GeForce RTX 4060 Laptop GPU (sm_89, 872.383 MiB / 7.996 GiB available)

can't make Enzyme work on gpu with Reactant but that may be skill issues, common error I see is

grad_comp = Enzyme.autodiff(Reverse,loss_comp,Const(stlayer),Duplicated(ps,dps),Const(x),Const(y));
ERROR:
No augmented forward pass found for XLAExecute
 at context:   call void @XLAExecute(i64 %102, i32 noundef 8, [8 x i64]* nocapture noundef nonnull readonly %inpa.i.i, [8 x i8]* nocapture noundef nonnull readonly %dona.i.i, i32 noundef 1, [1 x i64]* nocapture noundef nonnull writeonly %outa.i.i, i8* nocapture noundef nonnull writeonly %futa.i.i, [1 x i64]* nocapture noundef nonnull writeonly %futpa.i.i) #14, !dbg !58

where the only diff with cpu code was :

loss_comp = @compile loss(stlayer,ps,x,y)

update : just saw the issue gradient should compiled and so a little change on the code leads to everything working fine (at least on wsl Ubuntu)

julia> using Lux,Reactant,Enzyme,Random

julia> Reactant.set_default_backend("gpu");

julia> model = Lux.Chain(Dense(3,10,relu),Dense(10,10,relu),Dense(10,1));

julia> const dev = xla_device();

julia> x = rand(Float32,3,1000) |> dev;

julia> f(x) = sum(x,dims=1);

julia> f_comp = @compile f(x);

julia> y = f_comp(x)
1×1000 ConcreteRArray{Float32, 2}:
 1.91339  0.977359  1.40866  1.14727  1.35244  0.854568  …  2.66806  1.4641  1.9488  2.03938  1.75204  1.36932

julia> ps,st = Lux.setup(Random.default_rng(1),model) |> dev
((layer_1 = (weight = Float32[-1.1879814 0.5789728 0.78146553; -1.0979857 -0.7196951 -0.960644; … ; 0.38421702 -0.82885265 -1.7875335; -1.7717319 1.8281577 -0.5482931], bias = Float32[-0.55165297, -0.025069488, 0.12070516, -0.33363134, -0.27165163, -0.5426975, -0.02777502, -0.5143611, 0.06754502, 0.41636044]), layer_2 = (weight = Float32[-0.4120962 0.88357955 … 0.8667873 -0.60699934; 0.39998135 0.99203914 … -0.859783 0.14729756; … ; 0.57311285 1.0405946 … -0.9094574 0.4193144; -0.64737666 0.13689981 … -0.57182777 -0.76190686], bias = Float32[-0.19788045, -0.26863387, -0.03341088, -0.19565816, -0.19184135, -0.042680115, 0.18373081, 0.30809134, -0.2994386, 0.15902404]), layer_3 = (weight = Float32[0.033292364 0.3406409 … 0.20983447 -0.5079821], bias = Float32[-0.06395315])), (layer_1 = NamedTuple(), layer_2 = NamedTuple(), layer_3 = NamedTuple()))

julia> loss(stlayer,ps,x,y) = sum(abs2,stlayer(x,ps).-y)/length(y);

julia> stlayer = Lux.StatefulLuxLayer{true}(model,nothing,st);

julia> function gradloss(stlayer,ps,x,y)
       dps = Enzyme.make_zero(ps)
       _,res = Enzyme.autodiff(ReverseWithPrimal,loss,Active,Const(stlayer),Duplicated(ps,dps),Const(x),Const(y))
       return res,dps
       end
gradloss (generic function with 1 method)

julia> gradloss_comp = @compile gradloss(stlayer,ps,x,y)
Reactant.Compiler.Thunk{Symbol("##gradloss_reactant#471")}()

julia> gradloss_comp(stlayer,ps,x,y)
(fill(1.6686882f0), (layer_1 = (weight = Float32[-0.019157264 -0.052835472 -0.07052663; 0.0 0.0 0.0; … ; 0.008739114 0.001753656 0.00081436656; -0.28920138 -0.41064578 -0.3309787], bias = Float32[-0.08458641, 0.0, -0.49688935, 0.120346546, -0.1256084, 0.0, 0.0013638609, -0.47442067, 0.011482064, -0.60481024]), layer_2 = (weight = Float32[0.0 0.0 … -0.000109314075 -0.0012933938; 0.0 0.0 … 0.0 0.0; … ; -0.00888882 0.0 … 0.0 -0.08582606; 0.0001844522 0.0 … 0.0 7.595214f-5], bias = Float32[-0.04766386, 0.0, -0.0004994411, 0.0, -1.05122, 0.2795517, 0.26252118, -0.030470902, -0.10657983, 0.0051385043]), layer_3 = (weight = Float32[-0.56777835 0.0 … -0.072760016 -0.0005584934], bias = Float32[-2.335299])))

@mofeing
Copy link
Collaborator

mofeing commented Oct 27, 2024

yay, great!

We run Enzyme.autodiff in a different way (we run Enzyme through the MLIR, not the LLVM IR, so we use Enzyme as a MLIR dialect + pass) so we need to overlay the method with one of our own... so that's why you need to call Enzyme.autodiff inside the compiled function and not outside.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants