Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Giant refactor to move all state into a Kernel struct #1145

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

JamesWrigley
Copy link
Collaborator

@JamesWrigley JamesWrigley commented Feb 5, 2025

Pretty much what the title says 😛 This is not ready yet but I'm opening it for transparency. There are two reasons for the refactor:

  1. Easier testing, so that we can more confidently make changes like Make message receive and handling async #1140 and Jupyter messaging updates/correctness #1138.
  2. Allow for creating a temporary kernel that can be used in a precompilation workload for Add precompile statements to reduce startup latency #1074.

Overview:

  • Almost all global state has been moved into the new Kernel struct, and all relevant internal methods now require a kernel be passed to them. That's the bulk of the diff. I've tried to minimize other changes as much as possible, so there should 🤞 be no change in behaviour from master.

  • There are certain fields like ans and n which are documented as globals and must stay global, so setproperty!(::Kernel,...) is implemented such that it will copy changes to those fields to their corresponding global variable. Similarly there are certain public methods like IJulia.history() that must still work without a kernel argument, so a new _default_kernel global is defined for normal usage and that's used by default for all public methods.

    An edge-case are the dicts In and Out which are by default assigned to the corresponding fields of _default_kernel and will not contain changes from temporary test kernels. I don't think that's a particularly important discrepancy so I'm inclined to ignore it.

  • I implemented a basic test in waitloop.jl that spawns the IJulia waitloop and then makes a request to it using jupyter_client. This isn't working very well right now because blocking calls to the client causes hangs, I suspect there's a deadlock somewhere.

This will need a lot more work before it can be merged, but feel free to review the design already (cc @stevengj, @fredrikekre).

@MasonProtter, you should be able to do... something... with this for a precompilation workload. I would suggest creating a Kernel, connecting to some of the sockets with ZMQ.jl, and then sending some mock requests following the wire protocol: https://jupyter-client.readthedocs.io/en/latest/messaging.html#the-wire-protocol

@JamesWrigley JamesWrigley self-assigned this Feb 5, 2025
Copy link

codecov bot commented Feb 7, 2025

Codecov Report

Attention: Patch coverage is 88.33819% with 40 lines in your changes missing coverage. Please review.

Project coverage is 67.19%. Comparing base (4390ef4) to head (bb7cef4).

Files with missing lines Patch % Lines
src/handlers.jl 64.00% 9 Missing ⚠️
src/init.jl 87.03% 7 Missing ⚠️
src/IJulia.jl 93.10% 6 Missing ⚠️
src/inline.jl 57.14% 6 Missing ⚠️
src/stdio.jl 82.35% 6 Missing ⚠️
src/eventloop.jl 86.36% 3 Missing ⚠️
src/comm_manager.jl 90.00% 2 Missing ⚠️
src/execute_request.jl 96.29% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #1145       +/-   ##
===========================================
+ Coverage   10.21%   67.19%   +56.97%     
===========================================
  Files          14       15        +1     
  Lines         822      948      +126     
===========================================
+ Hits           84      637      +553     
+ Misses        738      311      -427     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@JamesWrigley
Copy link
Collaborator Author

Figured out the deadlock, we can now start writing full tests with jupyter_client :)

@JamesWrigley JamesWrigley force-pushed the kernel-struct branch 4 times, most recently from 64ec133 to b19c5aa Compare February 10, 2025 00:11
using ZMQ, JSON, SoftGlobalScope
import Base.invokelatest
import Base: invokelatest, RefValue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably using rather than import since we only use these, not extend them? Also invokelatest is exported automatically these days.

(This code dates back to the days before using Foo: bar)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but I think I'd prefer to do those sorts of changes in a later PR. This one is already quite large 😅

src/IJulia.jl Outdated
setfield!(value, name, x)
end

function Base.close(kernel::Kernel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be called by a finalizer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately that wouldn't be easy because it wait's on the stdio tasks so it might yield. I think it's ok to rely on the do-constructor closing it in the tests.

# global counterparts.
if name (:ans, :n, :In, :Out, :inited, :verbose)
setproperty!(IJulia, name, x)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't ans, In, Out global in Main (not IJulia)?

Maybe this kind of global should be copied to/from Main before/after each execute request?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whereas n and verbose at least can probably be moved in to the Kernel?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're quite right 🤔 I need to figure out a good way to test that, for now I've added a @test_broken test in d9acb6e.

whereas n and verbose at least can probably be moved in to the Kernel?

n is documented to be module-global (https://julialang.github.io/IJulia.jl/dev/library/public/#IJulia.n) so I don't think we can move that fully into Kernel, but verbose should be possible. I think only @vprintln/@verror_show will need tweaking to look at _default_kernel.verbose instead of the global.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the tests for ans/In/Out being propagated to kernel.current_module in 2fd607f, also moved verbose fully into Kernel.

src/stdio.jl Outdated
@@ -274,8 +260,8 @@ function oslibuv_flush()
end

import Base.flush
function flush(io::IJuliaStdio)
function flush(io::IJuliaStdio, kernel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a user-visible function, e.g. if the user calls flush(stdout).

It would be better to store a reference to the kernel in the IJuliaStdio object.

Copy link
Collaborator Author

@JamesWrigley JamesWrigley Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, fixed in d9acb6e and added a test for it.

@JamesWrigley
Copy link
Collaborator Author

JamesWrigley commented Feb 10, 2025

Progress update:

  • Moved the waitloop task into Kernel to make testing easier, and added a wait(::Kernel) method to wait for it to finish.
  • Added/expanded tests for autocompletion, MIME display, and clear_output() using jupyter_kernel_test. I think that covers everything we can use jupyter_kernel_test for.
  • Moved the calls to watch_stdio() and pushdisplay() into init(), now close(::Kernel) will ensure they're cleaned up properly.
  • Added a bunch of tests for some of the public API: clear_history(), load(), the hook system, and history().
  • Tweaked history() to print all the entries on new lines in 20dc591. From the docstring, I think that's what it should be doing?

To-do:

  • Address remaining review comments.
  • Add a test for kernel.jl to ensure that the kernel works outside of the test suite.
  • Figure out why the load() test is failing on Windows.
  • Go through the coverage reports to find more low-hanging fruit to test.

@JamesWrigley JamesWrigley force-pushed the kernel-struct branch 2 times, most recently from 212ff17 to 3c393ba Compare February 13, 2025 22:09
@JamesWrigley JamesWrigley marked this pull request as ready for review February 13, 2025 23:16
@JamesWrigley
Copy link
Collaborator Author

I believe this is ready to be reviewed now. One thorny issue is segfaults on windows that appear to be coming from ZMQ, using @threadcall for the heartbeat seemed to fix it: d491800
But I'm not convinced I understand the problem fully, will try to debug it a bit more before merging.

For reviewers, I've tried to keep the commits atomic so I would recommend reviewing each commit one-by-one.

@JamesWrigley
Copy link
Collaborator Author

Explicitly waiting for the heartbeat thread also seemed to fix it, which seems preferable to @threadcall in the interest of minimizing changes. But now mac is hanging...

@JamesWrigley JamesWrigley force-pushed the kernel-struct branch 2 times, most recently from 68e3717 to 0b6e7be Compare February 14, 2025 22:05
@JamesWrigley
Copy link
Collaborator Author

Back in the green with 0b6e7be 😅 I think this is in pretty good shape now.

@JamesWrigley
Copy link
Collaborator Author

I went ahead and added a simple precompilation workload in 9bb03a1. On my system it reduces the running time of getting the kernel info and executing 42 from 4s to 0.5s.

Not sure why the tests are failing, it seems to be related to CondaPkg 🤔 Will investigate later, but I believe this is ready for review anyway.

@JamesWrigley JamesWrigley linked an issue Feb 20, 2025 that may be closed by this pull request
This is the only change in 5.4, in previous versions it was sent on the shell
socket.
The Jupyter messaging docs state that all messages on the shell (request)
channel shall be ended with a `status: idle` message, and that: "This idle
status message indicates that IOPub messages associated with a given request
have all been received."

Previously we did not attempt to ensure this, which meant that in some cases the
`execute_result` message would be published after the `execute_reply` message
had been sent. It's still not exactly guaranteed by the `yield()`, but it's
something and it makes the `jupyter_kernel_test` tests pass.
Previously we were not ensuring that the heartbeat thread had stopped before
exiting, and were closing the context in an unsafe way. This was causing hangs
on Mac and segfaults on Windows.
Other notable changes:
- Moved `create_profile()` into init.jl so we can use it in `init()` and the
  precompilation workload.
- Added an option to not capture stdin, because during precompilation it's not
  possible to redirect stdin.
@JamesWrigley
Copy link
Collaborator Author

Back in the green, had to explicitly specify MicroMamba after the latest CondaPkg update.

@MasonProtter
Copy link
Contributor

Testing locally and installing a kernel, the kernel seems to die as soon as I attempt to connect

@JamesWrigley
Copy link
Collaborator Author

Is the path to kernel.jl in the kernel.json file correct? If precompilation succeeds but Jupyter can't connect to the kernel I think that points to a problem in kernel.json (I ran into that while testing locally).

Copy link
Contributor

@MasonProtter MasonProtter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figured out the problem I was having, it was crashing at startup for me because comm_open, comm_close, and comm_msg had the wrong signature relative to the invokelatest call in the event loop.

Github isn't letting me give a suggestion for the comm_close and comm_msg functions, but they need to all have the same signature.


try
send_status("busy", kernel, msg)
invokelatest(get(handlers, msg.header["msg_type"], unknown_request), socket, kernel, msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to hit comm_open, comm_msg and comm_close, but those don't have the same signature as the one used here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bb7cef4.

@@ -97,7 +97,7 @@ end

# handlers for incoming comm_* messages

function comm_open(sock, msg)
function comm_open(sock, msg, kernel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function comm_open(sock, msg, kernel)
function comm_open(sock, kernel, msg)

needed to match the signature from the event loop (I think?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in bb7cef4.

@JamesWrigley
Copy link
Collaborator Author

JamesWrigley commented Feb 24, 2025

Ouch, that's quite a serious oversight 😬 I've added tests and hopefully fixed the comms code in bb7cef4, with the one change of making the :notarget Comm sent in the comm_open error case to not be a primary comm, so it won't send a comm_open message to the client:

send_ipython(IJulia.publish[],

One other thing I noticed is that close_comm() (closing a comm from the kernel side) doesn't remove the comm from the comms dictionary:

function close_comm(comm::Comm, data::Dict = Dict(),
metadata::Dict = Dict(); kwargs...)
msg = msg_comm(comm, IJulia.execute_msg, "comm_close", data,
metadata; kwargs...)
send_ipython(IJulia.publish[], msg)
end

@stevengj, should we be removing it like the comm_close() handler does?

@stevengj
Copy link
Member

@shashi, can you common on the close_comm question?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add precompile statements to reduce startup latency
3 participants