Skip to content

Conversation

@cbodley
Copy link

@cbodley cbodley commented Nov 8, 2025

introduce a uring backend, with cmake option BEMAN_NET_WITH_URING to enable the liburing dependency and #ifdef BEMAN_NET_USE_URING to select uring_context instead of poll_context


uring_context manages a single instance of io_uring which has an associated submission queue and completion queue

each async operation prepares a submission queue entry (sqe) but does not call io_uring_submit() to submit it (except for resume_at(), see comment). this allows sqes to be submitted in batches by run_one(), where a single system call in io_uring_submit_and_wait() can submit pending sqes and await the next completion queue entry (cqe)

io_uring_sqe_set_data() associates each sqe with its io_base pointer. run_one() calls io_uring_cqe_get_data() to retreive that pointer and call its work() function to invoke complete()/cancel()/error() depending on the return code in cqe->res

unlike poll_context, cancel() and resume_at() are treated as their own async operations initiated with io_uring_prep_cancel() and io_uring_prep_timeout() respectively


this async cancellation strategy was not compatible with demo::task, which did not call complete_stopped() unless the cancelled op completes inside of stop_source.request_stop() (see commit message for details). the proposed solution seems to work for both poll_context and uring_context


tested primarily with the client/server examples

uring_context manages a single instance of io_uring which has an
associated submission queue and completion queue

each async operation prepares a submission queue entry (sqe) but does
not call io_uring_submit() to submit it (except for resume_at(), see
comment). this allows sqes to be submitted in batches by run_one(),
where a single system call in io_uring_submit_and_wait() can submit
pending sqes and await the next completion queue entry (cqe)

io_uring_sqe_set_data() associates each sqe with its io_base pointer.
run_one() calls io_uring_cqe_get_data() to retreive that pointer and
call its work() function to invoke complete()/cancel()/error() depending
on the return code in cqe->res

unlike poll_context, cancel() and resume_at() are treated as their own
async operations initiated with io_uring_prep_cancel() and
io_uring_prep_timeout() respectively

Signed-off-by: Casey Bodley <[email protected]>
Comment on lines 23 to 24
if (BEMAN_NET_WITH_URING)
target_compile_definitions(${EXAMPLE_TARGET} PRIVATE BEMAN_NET_USE_URING)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[pre-commit] reported by reviewdog 🐶

Suggested change
if (BEMAN_NET_WITH_URING)
target_compile_definitions(${EXAMPLE_TARGET} PRIVATE BEMAN_NET_USE_URING)
if(BEMAN_NET_WITH_URING)
target_compile_definitions(
${EXAMPLE_TARGET}
PRIVATE BEMAN_NET_USE_URING
)

#else
::std::unique_ptr<::beman::net::detail::context_base> d_owned{new ::beman::net::detail::poll_context()};
#endif
::beman::net::detail::context_base& d_context{*this->d_owned};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[pre-commit] reported by reviewdog 🐶

Suggested change
::beman::net::detail::context_base& d_context{*this->d_owned};
::beman::net::detail::context_base& d_context{*this->d_owned};

for testing, this BEMAN_NET_USE_URING define is added to example targets
when cmake option BEMAN_NET_WITH_URING is enabled

Signed-off-by: Casey Bodley <[email protected]>
with poll_context:
when stop_source.request_stop() calls poll_context::cancel(), the target
operation's io_base::cancel() is called inline. this calls
sender_awaiter::stop() which transitions from stop_state::stopping to
stopped. because request_stop() returns back to callback_t in
stop_state::stopped, complete_stopped() is called there

with uring_context:
uring_context::cancel() is asynchronous, so callback_t is still in
stop_state::stopping when request_stop() returns. run_one() eventually
sees the target operation complete with ECANCELED and calls
io_base::cancel(). when that calls sender_awaiter::stop(), the state is
still stop_state::stopping, so complete_stopped() never gets called

this causes beman.net.examples.client to exit before demo::scope gets
the stop signal from the 2 coroutines:
> ERROR: scope destroyed with live jobs: 2

to address this, callback_t now handles this async case by transitioning
back to stop_state::running after stop_source.request_stop() returns

Signed-off-by: Casey Bodley <[email protected]>
@coveralls
Copy link

Coverage Status

coverage: 3.39%. remained the same
when pulling 2ec9cc5 on cbodley:wip-uring
into 3874d13 on bemanproject:main.

Copy link
Member

@dietmarkuehl dietmarkuehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven’t looked at the actual implementation, yet, but I hope to provide feedback on that soon, too (it may be early next week - I aim for 2025-11-12 the latest).

Thank you very much for the contribution!

Copy link
Member

@dietmarkuehl dietmarkuehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really sorry for the belated response! I didn't manage to muster the energy to look at the pull request in detail. In the attempts I had I got mostly tripped over by the get_sqe function - it felt wrong to potentially block. I didn't come up with a plausible way around that until quite recently and it isn't clear whether that approach is reasonable. So, for now I consider potentially blocking is fine.

It may be worth looking at what other implementations do (but I haven't done that thoroughly):

I do want to get an io_uring implementation. It seems the current version is reasonably close to what I think is needed, i.e., I think I could land this pull request if that's the preference (I would probably apply some of the changes I outlined).

Comment on lines +88 to +92
while (sqe == nullptr) {
// if the submission queue is full, flush and try again
submit();
sqe = ::io_uring_get_sqe(&ring);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit I tripped over the last time I tried to review the code thoroughly: this is, effectively, a busy wait for an indefinite amount of time. The assumption would be that it actually isn't hit (especially if the interface eventually gets enhanced to cope with sequences of results with just one submission) so it shouldn't be a problem (and I'm won't push back on this one).

I was thinking about alternative approaches. One idea is to keep track of unsubmitted work. The io_base object could be put into a [singly-linked] intrusive list of unsubmitted work which gets submitted on next "opportunity". Of course, this opportunity may not come unless there is something making sure it comes. A potential for that could be to submit an IORING_OP_NOP operation (io_uring_prep_nop) using the last empty slot in the submission queue to trigger processing of the unsubmitted work queue (if any).

Also, as the submission is communication between the program and the kernel, this loop shouldn't block indefinitely in any case. I may be entirely considering the wrong problem.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a single-threaded application, i am assuming that io_uring_submit() will reliably drain the submission queue and guarantee that the next io_uring_get_sqe() has something to return. if my assumption is wrong, we might avoid an infinite loop here by throwing after a single submit/retry fails

if many threads are submitting work, then it's possible for enough io_uring_get_sqe() calls (QUEUE_DEPTH=128) to occur between this thread's calls to io_uring_submit() and io_uring_get_sqe(). however, i've made no considerations for thread-safety in this design. can you please clarify the requirements here?

  1. can io_context::run() be called multiple times to establish a pool of worker threads? if so, we probably want a separate ring for each but this gets complicated
  2. can accept()/connect()/etc be called to submit work from multiple threads? if so, we would need to add their io_bases to a thread-safe queue so that run_one() alone is calling into liburing

Comment on lines +116 to +118
if (auto count = process_task(); count) {
return count;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This special treatment of tasks gives the tasks priority over outstanding I/O. I would consider scheduling the tasks as a IORING_OP_NOP (::io_uring_prep_nop) which should achieve two things:

  1. There is no special case in run_one needed to process tasks.
  2. Even if there are always tasks ready to be scheduled, completions of I/O operations get a chance to run.

If using IORING_OP_NOP isn't a good idea (I'm not experienced with io_uring to tell) my inclination would be to still give some priority to I/O operations over scheduled tasks: I could imagine a strategy which makes sure that there is always something [hopefully short] to run instead of waiting on I/O to come in. However, if there is always a task scheduled, I think I/O never gets a chance. So, the alternative approach would be to:

  1. ::io_uring_submit work if any is unsubmitted.
  2. Use ::io_uring_peek_cqe to see if I/O is ready and, if so, process that.
  3. Otherwise, see if there is a task outstanding which can be scheduled.
  4. Otherwise (when there is no work which can be progressed) wait().

If that is believed to give too much priority to I/O operations a variation alternating between preferring I/O or scheduled tasks could be used. Submitting IORING_OP_NOP items would kind of do that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these "tasks" are related to scheduling, is it safe to assume that process_task() is likely just calling some chained sender that will immediately start more i/o? if so, processing these tasks before io_uring_submit() has the advantage that we can batch those new sqes with any that were already pending to minimize the number of uring system calls

for example, consider 3 concurrent calls to net::resume_after(scheduler, duration). ideally, this would make 3 calls to io_uring_prep_timeout() then one io_uring_submit() for all 3 - whereas your alternative would issue a separate io_uring_submit() after each io_uring_prep_timeout()

it's probably worth prioritizing completions over io_uring_submit() too, as they may also prepare chained work that can be batched. there's also the possibility that the kernel has overflowed the completion queue, which would apparently cause io_uring_submit() to fail with EBUSY. using io_uring_peek_cqe() to drain completions first should help to avoid this

what do you think about the following order?

  1. if io_uring_peek_cqe() finds a ready completion, process it and return
  2. if there are pending tasks, process one and return
  3. io_uring_submit() if sqes are pending
  4. io_uring_wait_cqe() to wait for a completion to process

outstanding += r;
}

if (!outstanding) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think testing outstanding alone is correct: I don't see why ::io_uring_submit_and_wait needs to return a non-zero result when it called, i.e., submitting may be non-zero even though outstanding is zero. I entirely agree that is very unlikely that ::io_uring_submit_and_wait (or io_uring_submit) would return zero, though.

Of course, if outstanding just tracks the total outstanding work, this concern is addressed, too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, if outstanding just tracks the total outstanding work, this concern is addressed, too.

i did apply this change to outstanding, but i don't think it fully addresses this concern with a 0-return from io_uring_submit_and_wait(). if we prepared some sqes but none were successfully submitted to the kernel, our call to io_uring_wait_cqe() could deadlock waiting for a completion that isn't ever coming

before the change to outstanding, a failure to submit any sqes could instead cause io_context::run() to return early

since this breaks our implementation either way, maybe it's best to assert or throw on an unexpected 0-return? if we do end up finding cases where this happens, the reproducer may give us more information to guide the design

// single system call. this allows io_uring_wait_cqe() below to be
// served directly from memory
unsigned wait_nr = 1;
int r = ::io_uring_submit_and_wait(&ring, wait_nr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this ::io_uring_submit_and_wait instead of just ::io_uring_submit? The wait() function will ::io_uring_wait_cqe, i.e., waiting will occur if necessary anyway (if no results are ready).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/axboe/liburing/wiki/io_uring-and-networking-in-2023#batching recommends the use of io_uring_submit_and_wait() to minimize the number of io_uring_enter() system calls in "IO event loops" like ours

my understanding is that:

  • io_uring_submit() calls io_uring_enter() to submit sqes to the kernel, and
  • io_uring_wait_cqe() calls io_uring_enter() to wait only if there are no ready cqes in userspace memory

io_uring_submit_and_wait() does both in a single system call, guaranteeing that our later call to io_uring_wait_cqe() can be served from memory as if by io_uring_peek_cqe()

Comment on lines 82 to 83
submitting -= r;
outstanding += r;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this logic is guaranteed to be accurate! I had the impression that items can be picked up by the kernel without them actually getting submitted: once the queue is updated (i.e., the SQE pointer(s) is(are) set and the count is increased) work can be picked up. The call to ::io_uring_submit is used to make sure things get picked up. On the other hand, the man page explicitly mentions that SQPOLL affects whether the value is correct. It also implies that the primary use of tracking submissions it to determine whether the data pointed to by the SQE needs to remain valid.

My inclination would be to update outstanding every time to work gets added to the context and have it be independent of whether the work still needs to be submitted: every work item added needs to be completed, i.e., outstanding would indicate how much work there is still to do. The current value is how much submitted work is still to do which seems to muddle two separate concerns. It is reasonable (possibly necessary) to separately track submitting to determine whether any work still needs to be submitted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this logic is guaranteed to be accurate!

i am expecting this return value to be accurate because our io_uring_queue_init() call is not passing the IORING_SETUP_SQPOLL flag documented in io_uring_setup()

My inclination would be to update outstanding every time to work gets added to the context and have it be independent of whether the work still needs to be submitted.

thanks, this would make it easier to experiment with SQPOLL in the future - done

Comment on lines 130 to 131
submitting -= r;
outstanding += r;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned previously, I think I wouldn't tie submitting and outstanding: I don't see a reason why outstanding can't just track how much outstanding work there is, independent of whether it was submitted.

// read the next completion, waiting if necessary
auto [res, completion] = wait();

if (completion) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe all operations are submitted with a completion set (in get_sqe; peeking ahead I see that cancel [currently] uses a nullptr but I don't think that is right), i.e., this check shouldn't be needed. If this check is needed, there may have been nothing completed but the function still returns 1u.

Comment on lines 156 to 163
// use io_uring_prep_cancel() for asynchronous cancellation of op.
// cancel_op, aka sender_state::cancel_callback, lives inside of op's
// operation state. op's completion may race with this cancellation,
// causing that sender_state and its cancel_callback to be destroyed.
// so we can't pass cancel_op to io_uring_sqe_set_data() and attach a
// cancel_op->work() function to handle its completion in run_one().
// instead, we just complete it here without waiting for the result
cancel_op->complete();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is correct! Yes, cancellation and completion are racy! However, the senders should be (need to be) implemented such that the number of outstanding completions is accurately tracked! That is, if an operation gets cancelled, its count of outstanding completions needs to be first incremented and then any actual cancellation gets submitted. Only once the count is incremented the cancellation state is created. I believe this is the relevant code: if outstanding was incremented beyond 1u, the operation wasn't completed, yet, and with outstanding being 2u a non-cancellation completion won't complete the operation.

If the logic I tried to outline above doesn't work as described, it is a bug which needs fixing (this wouldn't be part of this pull request: it would be a separate issue).

Looking at the current implementation I'm not entirely happy with the behaviour: if cancellation was triggered but there was a successful completion before the cancellation completes (which is a potential in particular with io_uring but can't happen with poll, epoll, or kqueue, I think), the ultimate result is currently set_stopped(...) although it could be set_value(...).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only once the count is incremented the cancellation state is created. I believe this is the relevant code: if outstanding was incremented beyond 1u, the operation wasn't completed, yet, and with outstanding being 2u a non-cancellation completion won't complete the operation.

thanks for the context! i added a squash commit that restores the asynchronous completion of uring_context::cancel(), and was able to reproduce the original crash

i believe the problem is that sender_state::complete() and error() both destroy this cancel_callback even if a cancellation is pending:

auto complete() -> void override final {
d_callback.reset();
if (0 == --this->d_outstanding) {
this->d_data.set_value(*this, ::std::move(this->d_receiver));

the crash no longer reproduces if i move the d_callback.reset() line inside that if-block. does that look right?

If the logic I tried to outline above doesn't work as described, it is a bug which needs fixing (this wouldn't be part of this pull request: it would be a separate issue).

i added this change as an additional commit to this pr, but i'm happy to raise a separate pr too if you like. similar to the demo::task fix, it only applies to async cancellation which is specific to uring_context

Comment on lines +182 to +204
op->work = [](context_base& ctx, io_base* io) {
auto res = *static_cast<int*>(io->extra.get());
if (res == -ECANCELED) {
io->cancel();
return submit_result::ready;
} else if (res < 0) {
io->error(::std::error_code(-res, ::std::system_category()));
return submit_result::error;
}
auto op = static_cast<accept_operation*>(io);
// set socket
::std::get<2>(*op) = ctx.make_socket(res);
io->complete();
return submit_result::ready;
};

auto sqe = get_sqe(op);
auto fd = native_handle(op->id);
auto addr = ::std::get<0>(*op).data();
auto addrlen = &::std::get<1>(*op);
int flags = 0;
::io_uring_prep_accept(sqe, fd, addr, addrlen, flags);
return submit_result::submit;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks as if this code is stamped out a few times below (I haven't verified if the blocks are actually mostly identical). I think I would factor most of the logic into a function which may need to take two [lambda] functions (one wrapping the "prep" function and one producing the result) as arguments. Also, I probably wouldn't create variables out of the various op accesses but just pass them to the "prep" function. I do admit that it helps with readability spelling out what the various members are, though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I probably wouldn't create variables out of the various op accesses but just pass them to the "prep" function. I do admit that it helps with readability spelling out what the various members are, though.

i do like to give names to function arguments. while it uses some extra vertical space, each of these functions still fits nicely on a single page in the editor 🤷

It looks as if this code is stamped out a few times below (I haven't verified if the blocks are actually mostly identical).

each of the completion lambdas differ slightly in how they map error codes and store return values, if any. each of the prepare blocks differ significantly in the io_uring_prep_* function and arguments

I think I would factor most of the logic into a function which may need to take two [lambda] functions (one wrapping the "prep" function and one producing the result) as arguments.

i don't really see the benefit of a helper function that combines the two, assuming it just looks something like this:

auto prepare_sqe(io_base* op, auto prepare, auto complete) -> submit_result {
	op->work = std::move(complete);
	return prepare(op);
}

'outstanding' tracks all pending completions, not just the submitted
ones
treat cancel() as a normal async operation whose completion is triggered
by its corresponding cqe. remove the special case for this null
completion in run_one()
if the cancel_callback() has incremented d_outstanding such that
sender_state::complete() or error() cannot complete immediately, avoid
destroying the cancel_callback with d_callback.reset(). where
cancellation is asynchronous, that memory needs to remain intact until
the cancellation completes

Signed-off-by: Casey Bodley <[email protected]>
@cbodley
Copy link
Author

cbodley commented Jan 5, 2026

I do want to get an io_uring implementation. It seems the current version is reasonably close to what I think is needed, i.e., I think I could land this pull request if that's the preference (I would probably apply some of the changes I outlined).

@dietmarkuehl thanks very much for the review. i see no reason to rush the merge - as long as you're willing to give feedback, i'm happy to iterate on the design and implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants