Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProcessExitedException failure #484

Open
montyvesselinov opened this issue Jan 6, 2019 · 3 comments
Open

ProcessExitedException failure #484

montyvesselinov opened this issue Jan 6, 2019 · 3 comments

Comments

@montyvesselinov
Copy link

montyvesselinov commented Jan 6, 2019

   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.0.3 (2018-12-18)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

[ Info: Recompiling stale cache file /Users/monty/.julia/compiled/v1.0/Revise/M1Qoh.ji for Revise [295af30f-e4ad-537b-8983-00126c2a3abe]

julia> import Distributed

julia> Distributed.addprocs()
12-element Array{Int64,1}:
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13

julia> train_step = train.minimize(train.AdamOptimizer(1e-4), cross_entropy)^C

julia> include(Pkg.dir("TensorFlow", "examples", "mnist_loader.jl"))
┌ Warning: `Pkg.dir(pkgname, paths...)` is deprecated; instead, do `import TensorFlow; joinpath(dirname(pathof(TensorFlow)), "..", paths...)`.
└ @ Pkg.API /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Pkg/src/API.jl:480
load_test_set (generic function with 2 methods)

julia> loader = DataLoader()
DataLoader(1, [50062, 35819, 15309, 52246, 13718, 11203, 53366, 9419, 37928, 43193  …  45497, 35259, 25068, 1760, 42169, 51694, 55109, 45499, 25948, 11216])

julia> using TensorFlow

julia> sess = Session()
┌ Warning: `set_field!(obj::Any, fld::Symbol, val)` is deprecated, use `setproperty!(obj, fld, val)` instead.
│   caller = (::getfield(TensorFlow, Symbol("##Session#42#45")))(::Nothing, ::Bool, ::Graph, ::Nothing, ::Type) at core.jl:484
└ @ TensorFlow ~/.julia/packages/TensorFlow/A6TdG/src/core.jl:484
┌ Warning: `set_field!(obj::Any, fld::Symbol, val)` is deprecated, use `setproperty!(obj, fld, val)` instead.
│   caller = (::getfield(TensorFlow, Symbol("##Session#42#45")))(::Nothing, ::Bool, ::Graph, ::Nothing, ::Type) at core.jl:485
└ @ TensorFlow ~/.julia/packages/TensorFlow/A6TdG/src/core.jl:485
2019-01-06 09:38:33.346292: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
Session(Ptr{Nothing} @0x00000001289796d0)

julia> x = placeholder(Float32)
<Tensor placeholder:1 shape=unknown dtype=Float32>

julia> y_ = placeholder(Float32)
<Tensor placeholder_2:1 shape=unknown dtype=Float32>

julia> W = Variable(zeros(Float32, 784, 10))
Variable{Float32}(<Tensor node:1 shape=(784, 10) dtype=Float32>, <Tensor node/Assign:1 shape=(784, 10) dtype=Float32>)

julia> b = Variable(zeros(Float32, 10))
Variable{Float32}(<Tensor node_2:1 shape=(10) dtype=Float32>, <Tensor node_2/Assign:1 shape=(10) dtype=Float32>)

julia> run(sess, global_variables_initializer())

julia> y = nn.softmax(x*W + b)
<Tensor Softmax:1 shape=(?, 10) dtype=Float32>

julia> cross_entropy = reduce_mean(-reduce_sum(y_ .* log(y), axis=[2]))
<Tensor reduce_2:1 shape=unknown dtype=Float32>

julia> train_step = train.minimize(train.GradientDescentOptimizer(.00001), cross_entropy)
┌ Error: Fatal error on process 14
│   exception =
│    PyError ($(Expr(:escape, :(ccall(#= /Users/monty/.julia/packages/PyCall/0jMpb/src/pyfncall.jl:44 =# @pysym(:PyObject_Call), PyPtr, (PyPtr, PyPtr, PyPtr), o, pyargsptr, kw))))) <class 'TypeError'>
│    TypeError("can't pickle traceback objects",)
│
│    Stacktrace:
│     [1] pyerr_check at /Users/monty/.julia/packages/PyCall/0jMpb/src/exception.jl:60 [inlined]
│     [2] pyerr_check at /Users/monty/.julia/packages/PyCall/0jMpb/src/exception.jl:64 [inlined]
│     [3] macro expansion at /Users/monty/.julia/packages/PyCall/0jMpb/src/exception.jl:84 [inlined]
│     [4] __pycall!(::PyObject, ::Ptr{PyCall.PyObject_struct}, ::PyObject, ::Ptr{Nothing}) at /Users/monty/.julia/packages/PyCall/0jMpb/src/pyfncall.jl:44
│     [5] _pycall!(::PyObject, ::PyObject, ::Tuple{PyObject}, ::Int64, ::Ptr{Nothing}) at /Users/monty/.julia/packages/PyCall/0jMpb/src/pyfncall.jl:29
│     [6] #pycall#87 at /Users/monty/.julia/packages/PyCall/0jMpb/src/pyfncall.jl:11 [inlined]
│     [7] pycall at /Users/monty/.julia/packages/PyCall/0jMpb/src/pyfncall.jl:83 [inlined]
│     [8] serialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::PyObject) at /Users/monty/.julia/packages/PyCall/0jMpb/src/serialize.jl:18
│     [9] serialize_any(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Any) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Serialization/src/Serialization.jl:638
│     [10] serialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Any) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Serialization/src/Serialization.jl:617
│     [11] serialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::CapturedException) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/clusterserialize.jl:208
│     [12] serialize_any(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Any) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Serialization/src/Serialization.jl:638
│     [13] serialize(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Any) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Serialization/src/Serialization.jl:617
│     [14] serialize_msg(::Distributed.ClusterSerializer{Sockets.TCPSocket}, ::Distributed.ResultMsg) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/messages.jl:90
│     [15] #invokelatest#1 at ./essentials.jl:697 [inlined]
│     [16] invokelatest at ./essentials.jl:696 [inlined]
│     [17] send_msg_(::Distributed.Worker, ::Distributed.MsgHeader, ::Distributed.ResultMsg, ::Bool) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/messages.jl:182
│     [18] send_msg_now at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/messages.jl:130 [inlined]
│     [19] send_msg_now(::Sockets.TCPSocket, ::Distributed.MsgHeader, ::Distributed.ResultMsg) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/messages.jl:125
│     [20] deliver_result(::Sockets.TCPSocket, ::Symbol, ::Distributed.RRID, ::RemoteException) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/process_messages.jl:88
│     [21] macro expansion at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/process_messages.jl:277 [inlined]
│     [22] (::getfield(Distributed, Symbol("##115#117")){Distributed.CallWaitMsg,Distributed.MsgHeader,Sockets.TCPSocket})() at ./task.jl:259
└ @ Distributed /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/process_messages.jl:92
Worker 14 terminated.ERROR:
ProcessExitedException()
Stacktrace:
 [1] try_yieldto(::typeof(Base.ensure_rescheduled), ::Base.RefValue{Task}) at ./event.jl:196
 [2] wait() at ./event.jl:255
 [3] wait(::Condition) at ./event.jl:46
 [4] wait_impl at ./channels.jl:353 [inlined]
 [5] wait at ./channels.jl:349 [inlined]
 [6] fetch_buffered at ./channels.jl:292 [inlined]
 [7] fetch(::Channel{Any}) at ./channels.jl:290
 [8] #remotecall_wait#154(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Distributed.Worker) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:417
 [9] remotecall_wait(::Function, ::Distributed.Worker) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:412
 [10] #remotecall_wait#157(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Function, ::Int64) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:433
 [11] remotecall_wait(::Function, ::Int64) at /Users/osx/buildbot/slave/package_osx64/build/usr/share/julia/stdlib/v1.0/Distributed/src/remotecall.jl:433
 [12] top-level scope at /Users/monty/.julia/packages/TensorFlow/A6TdG/src/TensorFlow.jl:182
 [13] eval at ./boot.jl:319 [inlined]
 [14] eval at ./sysimg.jl:68 [inlined]
 [15] add_gradients_py(::Tensor{Float32}, ::Array{Any,1}, ::Nothing) at /Users/monty/.julia/packages/TensorFlow/A6TdG/src/core.jl:1545
 [16] gradients at /Users/monty/.julia/packages/TensorFlow/A6TdG/src/core.jl:1533 [inlined] (repeats 2 times)
 [17] compute_gradients(::TensorFlow.train.GradientDescentOptimizer, ::Tensor{Float32}, ::Nothing) at /Users/monty/.julia/packages/TensorFlow/A6TdG/src/train.jl:49
 [18] #minimize#1(::Nothing, ::Nothing, ::Nothing, ::Function, ::TensorFlow.train.GradientDescentOptimizer, ::Tensor{Float32}) at /Users/monty/.julia/packages/TensorFlow/A6TdG/src/train.jl:41
 [19] minimize(::TensorFlow.train.GradientDescentOptimizer, ::Tensor{Float32}) at /Users/monty/.julia/packages/TensorFlow/A6TdG/src/train.jl:38
 [20] top-level scope at none:0

julia> tf_versioninfo()
Wording: Please copy-paste the entirely of the below output into any bug reports.
Note that this may display some errors, depending upon on your configuration. This is fine.

----------------
Library Versions
----------------
Trying to evaluate ENV["TF_USE_GPU"] but got error: KeyError("TF_USE_GPU")
Trying to evaluate ENV["LIBTENSORFLOW"] but got error: KeyError("LIBTENSORFLOW")

tf_version(kind=:backend) = 1.10.0
Trying to evaluate tf_version(kind=:python) but got error: ProcessExitedException()
tf_version(kind=:julia) = 0.10.2

-------------
Python Status
-------------
PyCall.conda = false
ENV["PYTHON"] = /usr/local/bin/python3
PyCall.PYTHONHOME = /usr/local/Cellar/python/3.6.5_1/Frameworks/Python.framework/Versions/3.6:/usr/local/Cellar/python/3.6.5_1/Frameworks/Python.framework/Versions/3.6
String(read(#= /Users/monty/.julia/packages/TensorFlow/A6TdG/src/version.jl:104 =# @cmd("pip --version"))) = pip 18.1 from /Users/monty/Library/Python/3.6/lib/python/site-packages/pip (python 3.6)

String(read(#= /Users/monty/.julia/packages/TensorFlow/A6TdG/src/version.jl:105 =# @cmd("pip3 --version"))) = pip 18.1 from /Users/monty/Library/Python/3.6/lib/python/site-packages/pip (python 3.6)


------------
Julia Status
------------
Julia Version 1.0.3
Commit 099e826241 (2018-12-18 01:34 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i9-8950HK CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
@oxinabox
Copy link
Collaborator

oxinabox commented Jan 6, 2019

What seems to be happening is

TensorFlow.jl is calling into python, to get gradients, during the training step.

Something is then going wrong,
The python code we call throws an exception:

    TypeError("can't pickle traceback objects",)

I am not sure (at this time) why that is happenning,
and I am probably not going to have time to look into this too soon.

@montyvesselinov
Copy link
Author

montyvesselinov commented Jan 7, 2019 via email

@quatrix
Copy link

quatrix commented Jan 7, 2019

I have the same issue.

Following code throws a ProcessExitedException() with a TypeError("can't pickle traceback objects",)

using TensorFlow

x = placeholder(Float32)
y_ = placeholder(Float32)

W = get_variable("W", [50, 10], Float64)
b = get_variable("b", [10], Float64)

println(W)
println(b)

y = nn.softmax(x*W + b)

cross_entropy = reduce_mean(-reduce_sum(y_ .* log(y), axis=[2]))
optimizer = train.AdamOptimizer()
minimize_op = train.minimize(optimizer, cross_entropy)

I've added some debug prints in PyCall.serialize.jl and this seems to be the actual exception python side is throwing (and then failing to pickle):

From worker 2:    debug: PyObject <type 'exceptions.KeyError'>
From worker 2:    debug: PyObject KeyError("The name 'W:0' refers to a Tensor which does not exist. The operation, 'W', does not exist in the graph.",)
From worker 2:    debug: PyObject <traceback object at 0x7f9c90d3c170>

When printing W I get the following:

Variable{Float64}(<Tensor W:1 shape=(50, 10) dtype=Float64>, <Tensor W/Assign:1 shape=(50, 10) dtype=Float64>)

It seems that the variable in the graph referred to as W:1 while in the exception above it's trying to refer to W:0?

I'm not sure if it's comparable, but in Python it looks like this:

W = tf.Variable(tf.zeros([784,10]))
print(W)

outputs

<tf.Variable 'Variable:0' shape=(784, 10) dtype=float32_ref> <tf.Variable 'Variable_1:0' shape=(10,) dtype=float32_re
f>

(Notice Variable:0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants