-
Notifications
You must be signed in to change notification settings - Fork 7
add uring_context #38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Casey Bodley <[email protected]>
uring_context manages a single instance of io_uring which has an associated submission queue and completion queue each async operation prepares a submission queue entry (sqe) but does not call io_uring_submit() to submit it (except for resume_at(), see comment). this allows sqes to be submitted in batches by run_one(), where a single system call in io_uring_submit_and_wait() can submit pending sqes and await the next completion queue entry (cqe) io_uring_sqe_set_data() associates each sqe with its io_base pointer. run_one() calls io_uring_cqe_get_data() to retreive that pointer and call its work() function to invoke complete()/cancel()/error() depending on the return code in cqe->res unlike poll_context, cancel() and resume_at() are treated as their own async operations initiated with io_uring_prep_cancel() and io_uring_prep_timeout() respectively Signed-off-by: Casey Bodley <[email protected]>
examples/CMakeLists.txt
Outdated
| if (BEMAN_NET_WITH_URING) | ||
| target_compile_definitions(${EXAMPLE_TARGET} PRIVATE BEMAN_NET_USE_URING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[pre-commit] reported by reviewdog 🐶
| if (BEMAN_NET_WITH_URING) | |
| target_compile_definitions(${EXAMPLE_TARGET} PRIVATE BEMAN_NET_USE_URING) | |
| if(BEMAN_NET_WITH_URING) | |
| target_compile_definitions( | |
| ${EXAMPLE_TARGET} | |
| PRIVATE BEMAN_NET_USE_URING | |
| ) |
| #else | ||
| ::std::unique_ptr<::beman::net::detail::context_base> d_owned{new ::beman::net::detail::poll_context()}; | ||
| #endif | ||
| ::beman::net::detail::context_base& d_context{*this->d_owned}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[pre-commit] reported by reviewdog 🐶
| ::beman::net::detail::context_base& d_context{*this->d_owned}; | |
| ::beman::net::detail::context_base& d_context{*this->d_owned}; |
for testing, this BEMAN_NET_USE_URING define is added to example targets when cmake option BEMAN_NET_WITH_URING is enabled Signed-off-by: Casey Bodley <[email protected]>
with poll_context: when stop_source.request_stop() calls poll_context::cancel(), the target operation's io_base::cancel() is called inline. this calls sender_awaiter::stop() which transitions from stop_state::stopping to stopped. because request_stop() returns back to callback_t in stop_state::stopped, complete_stopped() is called there with uring_context: uring_context::cancel() is asynchronous, so callback_t is still in stop_state::stopping when request_stop() returns. run_one() eventually sees the target operation complete with ECANCELED and calls io_base::cancel(). when that calls sender_awaiter::stop(), the state is still stop_state::stopping, so complete_stopped() never gets called this causes beman.net.examples.client to exit before demo::scope gets the stop signal from the 2 coroutines: > ERROR: scope destroyed with live jobs: 2 to address this, callback_t now handles this async case by transitioning back to stop_state::running after stop_source.request_stop() returns Signed-off-by: Casey Bodley <[email protected]>
dietmarkuehl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven’t looked at the actual implementation, yet, but I hope to provide feedback on that soon, too (it may be early next week - I aim for 2025-11-12 the latest).
Thank you very much for the contribution!
dietmarkuehl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really sorry for the belated response! I didn't manage to muster the energy to look at the pull request in detail. In the attempts I had I got mostly tripped over by the get_sqe function - it felt wrong to potentially block. I didn't come up with a plausible way around that until quite recently and it isn't clear whether that approach is reasonable. So, for now I consider potentially blocking is fine.
It may be worth looking at what other implementations do (but I haven't done that thoroughly):
stdexecseems to have a loop whichsubmit(...)s tasks but also reaps completions, i.e., it would only block if there are no completions and it can't submit anything. It seems they useio_uringin a mode which doesn't requireio_uring_submit(or I missed the call). They also don't use the wrapper library, it seems.- Unifex likewise seems to
schedule_pending_iowhich is then submitted byio_uring_context::run_impl.
I do want to get an io_uring implementation. It seems the current version is reasonably close to what I think is needed, i.e., I think I could land this pull request if that's the preference (I would probably apply some of the changes I outlined).
| while (sqe == nullptr) { | ||
| // if the submission queue is full, flush and try again | ||
| submit(); | ||
| sqe = ::io_uring_get_sqe(&ring); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit I tripped over the last time I tried to review the code thoroughly: this is, effectively, a busy wait for an indefinite amount of time. The assumption would be that it actually isn't hit (especially if the interface eventually gets enhanced to cope with sequences of results with just one submission) so it shouldn't be a problem (and I'm won't push back on this one).
I was thinking about alternative approaches. One idea is to keep track of unsubmitted work. The io_base object could be put into a [singly-linked] intrusive list of unsubmitted work which gets submitted on next "opportunity". Of course, this opportunity may not come unless there is something making sure it comes. A potential for that could be to submit an IORING_OP_NOP operation (io_uring_prep_nop) using the last empty slot in the submission queue to trigger processing of the unsubmitted work queue (if any).
Also, as the submission is communication between the program and the kernel, this loop shouldn't block indefinitely in any case. I may be entirely considering the wrong problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in a single-threaded application, i am assuming that io_uring_submit() will reliably drain the submission queue and guarantee that the next io_uring_get_sqe() has something to return. if my assumption is wrong, we might avoid an infinite loop here by throwing after a single submit/retry fails
if many threads are submitting work, then it's possible for enough io_uring_get_sqe() calls (QUEUE_DEPTH=128) to occur between this thread's calls to io_uring_submit() and io_uring_get_sqe(). however, i've made no considerations for thread-safety in this design. can you please clarify the requirements here?
- can
io_context::run()be called multiple times to establish a pool of worker threads? if so, we probably want a separate ring for each but this gets complicated - can
accept()/connect()/etc be called to submit work from multiple threads? if so, we would need to add theirio_bases to a thread-safe queue so thatrun_one()alone is calling into liburing
| if (auto count = process_task(); count) { | ||
| return count; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This special treatment of tasks gives the tasks priority over outstanding I/O. I would consider scheduling the tasks as a IORING_OP_NOP (::io_uring_prep_nop) which should achieve two things:
- There is no special case in
run_oneneeded to process tasks. - Even if there are always tasks ready to be scheduled, completions of I/O operations get a chance to run.
If using IORING_OP_NOP isn't a good idea (I'm not experienced with io_uring to tell) my inclination would be to still give some priority to I/O operations over scheduled tasks: I could imagine a strategy which makes sure that there is always something [hopefully short] to run instead of waiting on I/O to come in. However, if there is always a task scheduled, I think I/O never gets a chance. So, the alternative approach would be to:
::io_uring_submitwork if any is unsubmitted.- Use
::io_uring_peek_cqeto see if I/O is ready and, if so, process that. - Otherwise, see if there is a task outstanding which can be scheduled.
- Otherwise (when there is no work which can be progressed)
wait().
If that is believed to give too much priority to I/O operations a variation alternating between preferring I/O or scheduled tasks could be used. Submitting IORING_OP_NOP items would kind of do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if these "tasks" are related to scheduling, is it safe to assume that process_task() is likely just calling some chained sender that will immediately start more i/o? if so, processing these tasks before io_uring_submit() has the advantage that we can batch those new sqes with any that were already pending to minimize the number of uring system calls
for example, consider 3 concurrent calls to net::resume_after(scheduler, duration). ideally, this would make 3 calls to io_uring_prep_timeout() then one io_uring_submit() for all 3 - whereas your alternative would issue a separate io_uring_submit() after each io_uring_prep_timeout()
it's probably worth prioritizing completions over io_uring_submit() too, as they may also prepare chained work that can be batched. there's also the possibility that the kernel has overflowed the completion queue, which would apparently cause io_uring_submit() to fail with EBUSY. using io_uring_peek_cqe() to drain completions first should help to avoid this
what do you think about the following order?
- if
io_uring_peek_cqe()finds a ready completion, process it and return - if there are pending tasks, process one and return
io_uring_submit()if sqes are pendingio_uring_wait_cqe()to wait for a completion to process
| outstanding += r; | ||
| } | ||
|
|
||
| if (!outstanding) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think testing outstanding alone is correct: I don't see why ::io_uring_submit_and_wait needs to return a non-zero result when it called, i.e., submitting may be non-zero even though outstanding is zero. I entirely agree that is very unlikely that ::io_uring_submit_and_wait (or io_uring_submit) would return zero, though.
Of course, if outstanding just tracks the total outstanding work, this concern is addressed, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, if
outstandingjust tracks the total outstanding work, this concern is addressed, too.
i did apply this change to outstanding, but i don't think it fully addresses this concern with a 0-return from io_uring_submit_and_wait(). if we prepared some sqes but none were successfully submitted to the kernel, our call to io_uring_wait_cqe() could deadlock waiting for a completion that isn't ever coming
before the change to outstanding, a failure to submit any sqes could instead cause io_context::run() to return early
since this breaks our implementation either way, maybe it's best to assert or throw on an unexpected 0-return? if we do end up finding cases where this happens, the reproducer may give us more information to guide the design
| // single system call. this allows io_uring_wait_cqe() below to be | ||
| // served directly from memory | ||
| unsigned wait_nr = 1; | ||
| int r = ::io_uring_submit_and_wait(&ring, wait_nr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this ::io_uring_submit_and_wait instead of just ::io_uring_submit? The wait() function will ::io_uring_wait_cqe, i.e., waiting will occur if necessary anyway (if no results are ready).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023#batching recommends the use of io_uring_submit_and_wait() to minimize the number of io_uring_enter() system calls in "IO event loops" like ours
my understanding is that:
io_uring_submit()callsio_uring_enter()to submit sqes to the kernel, andio_uring_wait_cqe()callsio_uring_enter()to wait only if there are no ready cqes in userspace memory
io_uring_submit_and_wait() does both in a single system call, guaranteeing that our later call to io_uring_wait_cqe() can be served from memory as if by io_uring_peek_cqe()
| submitting -= r; | ||
| outstanding += r; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this logic is guaranteed to be accurate! I had the impression that items can be picked up by the kernel without them actually getting submitted: once the queue is updated (i.e., the SQE pointer(s) is(are) set and the count is increased) work can be picked up. The call to ::io_uring_submit is used to make sure things get picked up. On the other hand, the man page explicitly mentions that SQPOLL affects whether the value is correct. It also implies that the primary use of tracking submissions it to determine whether the data pointed to by the SQE needs to remain valid.
My inclination would be to update outstanding every time to work gets added to the context and have it be independent of whether the work still needs to be submitted: every work item added needs to be completed, i.e., outstanding would indicate how much work there is still to do. The current value is how much submitted work is still to do which seems to muddle two separate concerns. It is reasonable (possibly necessary) to separately track submitting to determine whether any work still needs to be submitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this logic is guaranteed to be accurate!
i am expecting this return value to be accurate because our io_uring_queue_init() call is not passing the IORING_SETUP_SQPOLL flag documented in io_uring_setup()
My inclination would be to update outstanding every time to work gets added to the context and have it be independent of whether the work still needs to be submitted.
thanks, this would make it easier to experiment with SQPOLL in the future - done
| submitting -= r; | ||
| outstanding += r; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned previously, I think I wouldn't tie submitting and outstanding: I don't see a reason why outstanding can't just track how much outstanding work there is, independent of whether it was submitted.
| // read the next completion, waiting if necessary | ||
| auto [res, completion] = wait(); | ||
|
|
||
| if (completion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe all operations are submitted with a completion set (in get_sqe; peeking ahead I see that cancel [currently] uses a nullptr but I don't think that is right), i.e., this check shouldn't be needed. If this check is needed, there may have been nothing completed but the function still returns 1u.
| // use io_uring_prep_cancel() for asynchronous cancellation of op. | ||
| // cancel_op, aka sender_state::cancel_callback, lives inside of op's | ||
| // operation state. op's completion may race with this cancellation, | ||
| // causing that sender_state and its cancel_callback to be destroyed. | ||
| // so we can't pass cancel_op to io_uring_sqe_set_data() and attach a | ||
| // cancel_op->work() function to handle its completion in run_one(). | ||
| // instead, we just complete it here without waiting for the result | ||
| cancel_op->complete(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is correct! Yes, cancellation and completion are racy! However, the senders should be (need to be) implemented such that the number of outstanding completions is accurately tracked! That is, if an operation gets cancelled, its count of outstanding completions needs to be first incremented and then any actual cancellation gets submitted. Only once the count is incremented the cancellation state is created. I believe this is the relevant code: if outstanding was incremented beyond 1u, the operation wasn't completed, yet, and with outstanding being 2u a non-cancellation completion won't complete the operation.
If the logic I tried to outline above doesn't work as described, it is a bug which needs fixing (this wouldn't be part of this pull request: it would be a separate issue).
Looking at the current implementation I'm not entirely happy with the behaviour: if cancellation was triggered but there was a successful completion before the cancellation completes (which is a potential in particular with io_uring but can't happen with poll, epoll, or kqueue, I think), the ultimate result is currently set_stopped(...) although it could be set_value(...).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only once the count is incremented the cancellation state is created. I believe this is the relevant code: if outstanding was incremented beyond 1u, the operation wasn't completed, yet, and with outstanding being 2u a non-cancellation completion won't complete the operation.
thanks for the context! i added a squash commit that restores the asynchronous completion of uring_context::cancel(), and was able to reproduce the original crash
i believe the problem is that sender_state::complete() and error() both destroy this cancel_callback even if a cancellation is pending:
net/include/beman/net/detail/sender.hpp
Lines 115 to 118 in 3874d13
| auto complete() -> void override final { | |
| d_callback.reset(); | |
| if (0 == --this->d_outstanding) { | |
| this->d_data.set_value(*this, ::std::move(this->d_receiver)); |
the crash no longer reproduces if i move the d_callback.reset() line inside that if-block. does that look right?
If the logic I tried to outline above doesn't work as described, it is a bug which needs fixing (this wouldn't be part of this pull request: it would be a separate issue).
i added this change as an additional commit to this pr, but i'm happy to raise a separate pr too if you like. similar to the demo::task fix, it only applies to async cancellation which is specific to uring_context
| op->work = [](context_base& ctx, io_base* io) { | ||
| auto res = *static_cast<int*>(io->extra.get()); | ||
| if (res == -ECANCELED) { | ||
| io->cancel(); | ||
| return submit_result::ready; | ||
| } else if (res < 0) { | ||
| io->error(::std::error_code(-res, ::std::system_category())); | ||
| return submit_result::error; | ||
| } | ||
| auto op = static_cast<accept_operation*>(io); | ||
| // set socket | ||
| ::std::get<2>(*op) = ctx.make_socket(res); | ||
| io->complete(); | ||
| return submit_result::ready; | ||
| }; | ||
|
|
||
| auto sqe = get_sqe(op); | ||
| auto fd = native_handle(op->id); | ||
| auto addr = ::std::get<0>(*op).data(); | ||
| auto addrlen = &::std::get<1>(*op); | ||
| int flags = 0; | ||
| ::io_uring_prep_accept(sqe, fd, addr, addrlen, flags); | ||
| return submit_result::submit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks as if this code is stamped out a few times below (I haven't verified if the blocks are actually mostly identical). I think I would factor most of the logic into a function which may need to take two [lambda] functions (one wrapping the "prep" function and one producing the result) as arguments. Also, I probably wouldn't create variables out of the various op accesses but just pass them to the "prep" function. I do admit that it helps with readability spelling out what the various members are, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I probably wouldn't create variables out of the various op accesses but just pass them to the "prep" function. I do admit that it helps with readability spelling out what the various members are, though.
i do like to give names to function arguments. while it uses some extra vertical space, each of these functions still fits nicely on a single page in the editor 🤷
It looks as if this code is stamped out a few times below (I haven't verified if the blocks are actually mostly identical).
each of the completion lambdas differ slightly in how they map error codes and store return values, if any. each of the prepare blocks differ significantly in the io_uring_prep_* function and arguments
I think I would factor most of the logic into a function which may need to take two [lambda] functions (one wrapping the "prep" function and one producing the result) as arguments.
i don't really see the benefit of a helper function that combines the two, assuming it just looks something like this:
auto prepare_sqe(io_base* op, auto prepare, auto complete) -> submit_result {
op->work = std::move(complete);
return prepare(op);
}'outstanding' tracks all pending completions, not just the submitted ones
treat cancel() as a normal async operation whose completion is triggered by its corresponding cqe. remove the special case for this null completion in run_one()
if the cancel_callback() has incremented d_outstanding such that sender_state::complete() or error() cannot complete immediately, avoid destroying the cancel_callback with d_callback.reset(). where cancellation is asynchronous, that memory needs to remain intact until the cancellation completes Signed-off-by: Casey Bodley <[email protected]>
@dietmarkuehl thanks very much for the review. i see no reason to rush the merge - as long as you're willing to give feedback, i'm happy to iterate on the design and implementation |
introduce a uring backend, with cmake option BEMAN_NET_WITH_URING to enable the liburing dependency and #ifdef BEMAN_NET_USE_URING to select uring_context instead of poll_context
uring_context manages a single instance of io_uring which has an associated submission queue and completion queue
each async operation prepares a submission queue entry (sqe) but does not call io_uring_submit() to submit it (except for resume_at(), see comment). this allows sqes to be submitted in batches by run_one(), where a single system call in io_uring_submit_and_wait() can submit pending sqes and await the next completion queue entry (cqe)
io_uring_sqe_set_data() associates each sqe with its io_base pointer. run_one() calls io_uring_cqe_get_data() to retreive that pointer and call its work() function to invoke complete()/cancel()/error() depending on the return code in cqe->res
unlike poll_context, cancel() and resume_at() are treated as their own async operations initiated with io_uring_prep_cancel() and io_uring_prep_timeout() respectively
this async cancellation strategy was not compatible with demo::task, which did not call complete_stopped() unless the cancelled op completes inside of stop_source.request_stop() (see commit message for details). the proposed solution seems to work for both poll_context and uring_context
tested primarily with the client/server examples