Skip to content

Conversation

@kernel-patches-daemon-bpf
Copy link

Pull request for series with
subject: mm: BPF OOM
version: 2
url: https://patchwork.kernel.org/project/netdevbpf/list/?series=1016490

@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: ff88079
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1016490
version: 2

@kernel-patches-review-bot
Copy link

commit 264db34cd79e7c07b4b1e58e2efae3d632df9cb2
Author: Roman Gushchin <[email protected]>

sched: psi: implement bpf_psi struct ops

This patch implements a BPF struct ops-based mechanism to create PSI
triggers, attach them to cgroups or system wide and handle PSI events
in BPF.

> diff --git a/kernel/sched/bpf_psi.c b/kernel/sched/bpf_psi.c
> new file mode 100644
> index 000000000..c383a2011
> --- /dev/null
> +++ b/kernel/sched/bpf_psi.c

[ ... ]

> +void bpf_psi_cgroup_offline(struct cgroup *cgroup)
> +{
> +	struct psi_group *group = cgroup->psi;
> +	u64 cgrp_id = cgroup_id(cgroup);
> +	struct psi_trigger *t, *p;
> +	struct bpf_psi *bpf_psi;
> +	LIST_HEAD(to_destroy);
> +	int idx;
> +
> +	if (!group)
> +		return;
> +
> +	spin_lock(&group->bpf_triggers_lock);
> +	list_for_each_entry_safe(t, p, &group->bpf_triggers, bpf_group_node) {
> +		if (!t->pinned) {
> +			t->pinned = true;
> +			list_move(&t->bpf_group_node, &to_destroy);
> +		}
> +	}
> +	spin_unlock(&group->bpf_triggers_lock);
> +
> +	list_for_each_entry_safe(t, p, &to_destroy, bpf_group_node) {
> +		bpf_psi = READ_ONCE(t->bpf_psi);
> +
> +		idx = srcu_read_lock(&bpf_psi->srcu);
> +		if (bpf_psi->ops->handle_cgroup_offline)
> +			bpf_psi->ops->handle_cgroup_offline(bpf_psi, cgrp_id);
> +		srcu_read_unlock(&bpf_psi->srcu, idx);
> +
> +		spin_lock(&bpf_psi->lock);
> +		list_del(&t->bpf_psi_node);
> +		spin_unlock(&bpf_psi->lock);
> +
> +		WRITE_ONCE(t->bpf_psi, NULL);
> +		flush_workqueue(bpf_psi_wq);
> +		synchronize_srcu(&bpf_psi->srcu);
> +		psi_trigger_destroy(t);
> +	}
> +}

Can bpf_psi_cgroup_offline() race with bpf_psi_ops_unreg() and access
freed memory?

Consider this sequence:

Thread A (bpf_psi_cgroup_offline):
  - Pins triggers under group->bpf_triggers_lock
  - Reads bpf_psi = READ_ONCE(t->bpf_psi) outside any lock

Thread B (bpf_psi_ops_unreg):
  - Iterates bpf_psi->triggers, skips pinned triggers
  - Calls synchronize_srcu(&bpf_psi->srcu)
  - Calls kfree(bpf_psi)

Thread A:
  - Tries srcu_read_lock(&bpf_psi->srcu) on freed memory

The pinning mechanism prevents Thread B from NULLing t->bpf_psi when
Thread A has pinned the trigger. However, Thread B unconditionally
frees the bpf_psi object itself even when it didn't process any
triggers because they were all pinned by Thread A.

After Thread A pins all triggers and releases group->bpf_triggers_lock,
Thread B can complete bpf_psi_ops_unreg(), call synchronize_srcu() (no
readers yet), and free bpf_psi. Then Thread A accesses the freed
bpf_psi via srcu_read_lock() and spin_lock().

> +static void bpf_psi_ops_unreg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_psi_ops *ops = kdata;
> +	struct bpf_psi *bpf_psi = ops->bpf_psi;
> +	struct psi_trigger *t, *p;
> +	LIST_HEAD(to_destroy);
> +
> +	spin_lock(&bpf_psi->lock);
> +	list_for_each_entry_safe(t, p, &bpf_psi->triggers, bpf_psi_node) {
> +		spin_lock(&t->group->bpf_triggers_lock);
> +		if (!t->pinned) {
> +			t->pinned = true;
> +			list_move(&t->bpf_group_node, &to_destroy);
> +			list_del(&t->bpf_psi_node);
> +
> +			WRITE_ONCE(t->bpf_psi, NULL);
> +		}
> +		spin_unlock(&t->group->bpf_triggers_lock);
> +	}
> +	spin_unlock(&bpf_psi->lock);
> +
> +	flush_workqueue(bpf_psi_wq);
> +	synchronize_srcu(&bpf_psi->srcu);
> +
> +	list_for_each_entry_safe(t, p, &to_destroy, bpf_group_node)
> +		psi_trigger_destroy(t);
> +
> +	if (!list_empty(&bpf_psi->node)) {
> +		mutex_lock(&bpf_psi_lock);
> +		list_del(&bpf_psi->node);
> +		mutex_unlock(&bpf_psi_lock);
> +		static_branch_dec(&bpf_psi_notify_key);
> +	}
> +
> +	cleanup_srcu_struct(&bpf_psi->srcu);
> +	kfree(bpf_psi);
> +}

When bpf_psi_ops_unreg() skips all triggers because they're pinned
(by bpf_psi_cgroup_offline()), does it still free bpf_psi too early?

The to_destroy list will be empty if all triggers were pinned, but
the function still proceeds to cleanup_srcu_struct() and kfree(bpf_psi)
while bpf_psi_cgroup_offline() may be using that bpf_psi pointer.



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: sched: psi: implement bpf_psi struct ops
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

@kernel-patches-review-bot
Copy link

commit b58d54ed56054e3a550e02c324982feb20a2671e
Author: Roman Gushchin <[email protected]>

libbpf: introduce bpf_map__attach_struct_ops_opts()

This commit introduces a new API function that extends the existing
bpf_map__attach_struct_ops() to accept additional options, specifically
allowing attachment of struct_ops maps to cgroups via a relative_fd.

> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index b90574f39..be56a5dee 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -13196,12 +13196,19 @@ static int bpf_link__detach_struct_ops(struct bpf_link *link)
>  	return close(link->fd);
>  }
>
> -struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> +struct bpf_link *bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
> +						 const struct bpf_struct_ops_opts *opts)
>  {
> +	DECLARE_LIBBPF_OPTS(bpf_link_create_opts, link_opts);
>  	struct bpf_link_struct_ops *link;
>  	__u32 zero = 0;
>  	int err, fd;
>
> +	if (!OPTS_VALID(opts, bpf_struct_ops_opts)) {
> +		pr_warn("map '%s': invalid opts\n", map->name);
> +		return libbpf_err_ptr(-EINVAL);
> +	}
> +
>  	if (!bpf_map__is_struct_ops(map)) {
>  		pr_warn("map '%s': can't attach non-struct_ops map\n", map->name);
>  		return libbpf_err_ptr(-EINVAL);

[ ... ]

>  	link = calloc(1, sizeof(*link));
>  	if (!link)
>  		return libbpf_err_ptr(-EINVAL);

Should this return -ENOMEM instead of -EINVAL? All other calloc failures
in libbpf.c return -ENOMEM (see lines 11006, 11802, 12288, 12746, 12829,
12891, 13089, 13129 in the same file).

[ ... ]

> -	fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, NULL);
> +	link_opts.cgroup.relative_fd = OPTS_GET(opts, relative_fd, 0);
> +
> +	fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, &link_opts);

Is the expected_revision field being propagated correctly? The new
bpf_struct_ops_opts structure includes an expected_revision field
(see libbpf.h), and bpf_link_create() with BPF_STRUCT_OPS reads
opts->cgroup.expected_revision (see bpf.c line 889), but this code only
sets relative_fd in link_opts. Should it also include:

    link_opts.cgroup.expected_revision = OPTS_GET(opts, expected_revision, 0);

> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 5118d0a90..dc8489871 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -922,6 +922,20 @@ bpf_program__attach_cgroup_opts(const struct bpf_program *prog, int cgroup_fd,
>  struct bpf_map;
>
>  LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
> +
> +struct bpf_struct_ops_opts {
> +	/* size of this struct, for forward/backward compatibility */
> +	size_t sz;
> +	__u32 flags;
> +	__u32 relative_fd;
> +	__u64 expected_revision;
> +	size_t :0;
> +};
> +#define bpf_struct_ops_opts__last_field expected_revision
> +
> +LIBBPF_API struct bpf_link *
> +bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
> +				const struct bpf_struct_ops_opts *opts);

[ ... ]



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: libbpf: introduce bpf_map__attach_struct_ops_opts()
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

@kernel-patches-review-bot
Copy link

commit 96abfe0549713749f00e32bff919a759db284a58
Author: Roman Gushchin <[email protected]>

mm: introduce BPF kfuncs to access memcg statistics and events

This patch adds four new BPF kfuncs for accessing memory cgroup statistics
and events, making them available to all BPF program types for implementing
OOM policies and accelerating memcg data access.

> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 39a6c7c87..b9e08dddd 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h

[ ... ]

> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
> index 76c342318..387255b8a 100644
> --- a/mm/bpf_memcontrol.c
> +++ b/mm/bpf_memcontrol.c

[ ... ]

> @@ -75,6 +75,56 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
>  	css_put(&memcg->css);
>  }
>
> +/**
> + * bpf_mem_cgroup_vm_events - Read memory cgroup's vm event counter
> + * @memcg: memory cgroup
> + * @event: event id
> + *
> + * Allows to read memory cgroup event counters.
> + */
> +__bpf_kfunc unsigned long bpf_mem_cgroup_vm_events(struct mem_cgroup *memcg,
> +						enum vm_event_item event)
> +{
> +	return memcg_events(memcg, event);
> +}
> +
> +/**
> + * bpf_mem_cgroup_usage - Read memory cgroup's usage
> + * @memcg: memory cgroup
> + *
> + * Returns current memory cgroup size in bytes.
> + */
> +__bpf_kfunc unsigned long bpf_mem_cgroup_usage(struct mem_cgroup *memcg)
> +{
> +	return page_counter_read(&memcg->memory);
> +}
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The documentation says this returns bytes, but page_counter_read() returns
pages. In mm/memcontrol-v1.c, the existing mem_cgroup_usage() function's
return value is multiplied by PAGE_SIZE to convert to bytes:

    return (u64)mem_cgroup_usage(memcg, false) * PAGE_SIZE;

Should this function multiply the result by PAGE_SIZE to match its
documentation, or should the documentation be updated to say it returns
pages?

> +
> +/**
> + * bpf_mem_cgroup_page_state - Read memory cgroup's page state counter
> + * @memcg: memory cgroup
> + * @idx: counter idx
> + *
> + * Allows to read memory cgroup statistics. The output is in bytes.
> + */
> +__bpf_kfunc unsigned long bpf_mem_cgroup_page_state(struct mem_cgroup *memcg, int idx)
> +{
> +	if (idx < 0 || idx >= MEMCG_NR_STAT)
> +		return (unsigned long)-1;
> +
> +	return memcg_page_state_output(memcg, idx);
> +}

[ ... ]



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: mm: introduce BPF kfuncs to access memcg statistics and events
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

@kernel-patches-review-bot
Copy link

commit 31d238f36a0a6a56efa48c847cb776e67ba2f079
Author: Roman Gushchin <[email protected]>

bpf: initial support for attaching struct ops to cgroups

When a struct ops is being attached and a bpf link is created, allow to
pass a cgroup fd using bpf attr, so that struct ops can be attached to a
cgroup instead of globally. Attached struct ops doesn't hold a reference
to the cgroup, only preserves cgroup id.

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index eae907218..7205b813e 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1849,6 +1849,7 @@ struct bpf_struct_ops_link {
>  	struct bpf_link link;
>  	struct bpf_map __rcu *map;
>  	wait_queue_head_t wait_hup;
> +	u64 cgroup_id;
>  };
>
>  struct bpf_link_primer {
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index 45cc5ee19..58664779a 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -13,6 +13,7 @@
>  #include <linux/btf_ids.h>
>  #include <linux/rcupdate_wait.h>
>  #include <linux/poll.h>
> +#include <linux/cgroup.h>
>
>  struct bpf_struct_ops_value {
>  	struct bpf_struct_ops_common_value common;

[ ... ]

> @@ -1359,6 +1360,18 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>  	}
>  	bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
>  		      attr->link_create.attach_type);
> +#ifdef CONFIG_CGROUPS
> +	if (attr->link_create.cgroup.relative_fd) {
> +		struct cgroup *cgrp;
> +
> +		cgrp = cgroup_get_from_fd(attr->link_create.cgroup.relative_fd);
> +		if (IS_ERR(cgrp))
> +			return PTR_ERR(cgrp);
                        ^^^^^^^^^^^^^^^^^^^^^^

Can this leak the link and map? In bpf_struct_ops_link_create(), the link
is allocated with kzalloc(GFP_USER) and map is obtained with bpf_map_get()
before this point. If cgroup_get_from_fd() fails, the early return skips
the err_out label which does kfree(link) and bpf_map_put(map).

All other error paths in this function use "err = <error>; goto err_out;"
but this one returns directly.

> +
> +		link->cgroup_id = cgroup_id(cgrp);
> +		cgroup_put(cgrp);
> +	}
> +#endif /* CONFIG_CGROUPS */
> +
>  	err = bpf_link_prime(&link->link, &link_primer);
>  	if (err)
>  		goto err_out;



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: initial support for attaching struct ops to cgroups
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

@kernel-patches-review-bot
Copy link

commit ec8e2e1769704b4f69261f36e6aaf035d47c5f0f
Author: Roman Gushchin <[email protected]>

mm: introduce BPF kfuncs to deal with memcg pointers

This commit introduces two new BPF kfuncs for converting css pointers
to memcg pointers with proper reference counting: bpf_get_mem_cgroup()
and bpf_put_mem_cgroup().

> diff --git a/mm/Makefile b/mm/Makefile
> index 051e88c69..2d8f9beb3 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -107,6 +107,7 @@ obj-$(CONFIG_MEMCG) += swap_cgroup.o
>  endif
>  ifdef CONFIG_BPF_SYSCALL
>  obj-y += bpf_oom.o
> +obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
>  endif
>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>  obj-$(CONFIG_GUP_TEST) += gup_test.o
> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
> new file mode 100644
> index 000000000..1e4609774
> --- /dev/null
> +++ b/mm/bpf_memcontrol.c
> @@ -0,0 +1,88 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Memory Controller-related BPF kfuncs and auxiliary code
> + *
> + * Author: Roman Gushchin <[email protected]>
> + */
> +
> +#include <linux/memcontrol.h>
> +#include <linux/bpf.h>
> +
> +__bpf_kfunc_start_defs();
> +
> +/**
> + * bpf_get_mem_cgroup - Get a reference to a memory cgroup
> + * @css: pointer to the css structure
> + *
> + * Returns a pointer to a mem_cgroup structure after bumping
> + * the corresponding css's reference counter.
> + *
> + * It's fine to pass a css which belongs to any cgroup controller,
> + * e.g. unified hierarchy's main css.
> + *
> + * Implements KF_ACQUIRE semantics.
> + */
> +__bpf_kfunc struct mem_cgroup *
> +bpf_get_mem_cgroup(struct cgroup_subsys_state *css)
> +{
> +	struct mem_cgroup *memcg = NULL;
> +	bool rcu_unlock = false;
> +
> +	if (!root_mem_cgroup)
> +		return NULL;
> +
> +	if (root_mem_cgroup->css.ss != css->ss) {
                                            ^^^^^^^

Can this dereference a NULL css pointer?  The function checks css for
NULL at line 42 with "if (css && css_tryget(css))", which suggests css
can be NULL.  If a BPF program calls this function with a NULL css,
wouldn't it crash here before reaching that check?

> +		struct cgroup *cgroup = css->cgroup;
> +		int ssid = root_mem_cgroup->css.ss->id;
> +
> +		rcu_read_lock();
> +		rcu_unlock = true;
> +		css = rcu_dereference_raw(cgroup->subsys[ssid]);
> +	}
> +
> +	if (css && css_tryget(css))
> +		memcg = container_of(css, struct mem_cgroup, css);
> +
> +	if (rcu_unlock)
> +		rcu_read_unlock();
> +
> +	return memcg;
> +}

[ ... ]

> +__bpf_kfunc_end_defs();
> +
> +BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
> +BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL | KF_RCU)
> +BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
> +
> +BTF_KFUNCS_END(bpf_memcontrol_kfuncs)

[ ... ]



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: mm: introduce BPF kfuncs to deal with memcg pointers
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

@kernel-patches-review-bot
Copy link

commit 712137918bba1b0a9c7c670bab9200b0f0d6a7e8
Author: Roman Gushchin <[email protected]>

bpf: selftests: introduce read_cgroup_file() helper

This patch adds a new helper function read_cgroup_file() to read from
cgroup control files in BPF selftests. The function opens a file,
reads into a buffer, and returns the result.

> diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
> index 20cede4db..8fb02fe4c 100644
> --- a/tools/testing/selftests/bpf/cgroup_helpers.c
> +++ b/tools/testing/selftests/bpf/cgroup_helpers.c
> @@ -126,6 +126,45 @@ int enable_controllers(const char *relative_path, const char *controllers)
>  	return __enable_controllers(cgroup_path, controllers);
>  }
>
> +static size_t __read_cgroup_file(const char *cgroup_path, const char *file,
> +				 char *buf, size_t size)
> +{
> +	char file_path[PATH_MAX + 1];
> +	size_t ret;
> +	int fd;
> +
> +	snprintf(file_path, sizeof(file_path), "%s/%s", cgroup_path, file);
> +	fd = open(file_path, O_RDONLY);
> +	if (fd < 0) {
> +		log_err("Opening %s", file_path);
> +		return -1;

Can __read_cgroup_file() return -1 when its return type is size_t?

The function signature declares size_t (unsigned), but returns -1 here.
When -1 is implicitly converted to size_t, it becomes SIZE_MAX
(18446744073709551615 on 64-bit systems). Callers checking for errors
with "if (ret < 0)" will fail because size_t is unsigned, and callers
checking "if (ret == -1)" will also fail after the cast.

> +	}
> +
> +	ret = read(fd, buf, size);
> +	close(fd);
> +	return ret;

Does this handle read() errors correctly?

The read() syscall returns ssize_t which can be negative on error.
Assigning this to size_t (line 143) converts negative error codes into
large positive numbers. For example, read() returning -1 becomes SIZE_MAX
when stored in ret.

Should the return type be ssize_t instead of size_t to properly propagate
both success (byte count) and error (negative values) to callers?

> +}
> +
> +/**
> + * read_cgroup_file() - Read to a cgroup file
> + * @relative_path: The cgroup path, relative to the workdir
> + * @file: The name of the file in cgroupfs to read to
> + * @buf: Buffer to read from the file
> + * @size: Size of the buffer
> + *
> + * Read to a file in the given cgroup's directory.
> + *
> + * If successful, the number of read bytes is returned.
> + */
> +size_t read_cgroup_file(const char *relative_path, const char *file,
> +			char *buf, size_t size)
> +{
> +	char cgroup_path[PATH_MAX - 24];
> +
> +	format_cgroup_path(cgroup_path, relative_path);
> +	return __read_cgroup_file(cgroup_path, file, buf, size);
> +}
> +

[ ... ]

> diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h
> index 3857304be..9f9bb6b5d 100644
> --- a/tools/testing/selftests/bpf/cgroup_helpers.h
> +++ b/tools/testing/selftests/bpf/cgroup_helpers.h
> @@ -11,6 +11,8 @@
>
>  /* cgroupv2 related */
>  int enable_controllers(const char *relative_path, const char *controllers);
> +size_t read_cgroup_file(const char *relative_path, const char *file,
> +			char *buf, size_t size);
>  int write_cgroup_file(const char *relative_path, const char *file,
>  		      const char *buf);

[ ... ]



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: selftests: introduce read_cgroup_file() helper
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

@kernel-patches-review-bot
Copy link

commit 70a4d534559f5708a811241190cd158e83451a42
Author: Roman Gushchin <[email protected]>

bpf: selftests: PSI struct ops test

This patch adds a BPF selftest for PSI struct ops functionality,
including cgroup lifecycle callbacks, PSI trigger creation, and
memory pressure event handling.

> diff --git a/tools/testing/selftests/bpf/prog_tests/test_psi.c b/tools/testing/selftests/bpf/prog_tests/test_psi.c
> new file mode 100644
> index 000000000..b294cea0a
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/test_psi.c

[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/test_psi.c b/tools/testing/selftests/bpf/progs/test_psi.c
> new file mode 100644
> index 000000000..4ddec7ec3
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_psi.c
> @@ -0,0 +1,82 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +#define PSI_FULL 0x80000000
> +
> +/* cgroup which will experience the high memory pressure */
> +u64 high_pressure_cgroup_id;
> +
> +/* cgroup which will be deleted */
> +u64 deleted_cgroup_id;
> +
> +/* cgroup which will be created */
> +u64 new_cgroup_id;
> +
> +/* cgroup which was deleted */
> +u64 deleted_cgroup_id;
>    ^^^^^^^^^^^^^^^^^^

Is deleted_cgroup_id intentionally declared twice here? This appears
to be a duplicate global variable declaration - the same variable is
declared at line 13 with comment "cgroup which will be deleted" and
again at line 19 with comment "cgroup which was deleted".

[ ... ]



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: bpf: selftests: PSI struct ops test
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

@kernel-patches-review-bot
Copy link

commit 5cd8d912032d2c2105c04660338cf4ad47d24697
Author: Roman Gushchin <[email protected]>

mm: allow specifying custom oom constraint for BPF triggers

This patch adds a new OOM constraint type CONSTRAINT_BPF and allows
BPF programs to specify a custom constraint name when calling
bpf_out_of_memory().

> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 3cbdcd013..704fc0e78 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -19,6 +19,7 @@ enum oom_constraint {
>  	CONSTRAINT_CPUSET,
>  	CONSTRAINT_MEMORY_POLICY,
>  	CONSTRAINT_MEMCG,
> +	CONSTRAINT_BPF,
>  };
>
>  enum bpf_oom_flags {

[ ... ]

> @@ -63,6 +64,9 @@ struct oom_control {
>
>  	/* Policy name */
>  	const char *bpf_policy_name;
> +
> +	/* BPF-specific constraint name */
> +	const char *bpf_constraint;
>  #endif
>  };

[ ... ]

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index d7fca4bf5..72a346261 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c

[ ... ]

> @@ -267,6 +281,9 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
>  	bool cpuset_limited = false;
>  	int nid;
>
> +	if (oc->constraint == CONSTRAINT_BPF)
> +		return CONSTRAINT_BPF;
> +
>  	if (is_memcg_oom(oc)) {
>  		oc->totalpages = mem_cgroup_get_max(oc->memcg) ?: 1;
>  		return CONSTRAINT_MEMCG;

[ ... ]

> @@ -1350,11 +1367,14 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
>   * Returns a negative value if an error occurred.
>   */
>  __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
> -				  int order, u64 flags)
> +				  int order, u64 flags,
> +				  const char *constraint_text__nullable)
>  {
>  	struct oom_control oc = {
>  		.memcg = memcg__nullable,
>  		.order = order,
> +		.constraint = CONSTRAINT_BPF,
> +		.bpf_constraint = constraint_text__nullable,
>  	};
>  	int ret;

When CONSTRAINT_BPF is set in bpf_out_of_memory(), the early return in
constrained_alloc() prevents oc->totalpages from being initialized.  This
leaves totalpages at zero (from the designated initializer).

Later in the call chain out_of_memory()->select_bad_process()->
oom_evaluate_task()->oom_badness(), the code performs division by
totalpages at line 237:

    adj *= totalpages / 1000;

Can this cause a division by zero?  The path is reachable when a BPF
program calls bpf_out_of_memory() and either no BPF OOM handler is
registered or the handler fails to free memory, causing execution to fall
through to select_bad_process().



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: mm: allow specifying custom oom constraint for BPF triggers
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

@kernel-patches-daemon-bpf
Copy link
Author

Forwarding comment 3453808482 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

@kernel-patches-daemon-bpf
Copy link
Author

Forwarding comment 3453813193 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

@kernel-patches-daemon-bpf
Copy link
Author

Forwarding comment 3453815050 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

@kernel-patches-daemon-bpf
Copy link
Author

Forwarding comment 3453817751 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

@kernel-patches-daemon-bpf
Copy link
Author

Forwarding comment 3453819006 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

@kernel-patches-daemon-bpf
Copy link
Author

Forwarding comment 3453824885 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

@kernel-patches-daemon-bpf
Copy link
Author

Forwarding comment 3453825997 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

@kernel-patches-daemon-bpf
Copy link
Author

Forwarding comment 3453828296 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

@kernel-patches-review-bot
Copy link

commit 16c40fd36bbe32c63f44388b5613db55c6b27557
Author: Roman Gushchin <[email protected]>

mm: introduce BPF struct ops for OOM handling

This patch introduces a BPF struct ops mechanism for custom OOM handling
policies. It allows attaching bpf_oom_ops system-wide or per-memcg, with
handlers executed before the kernel OOM killer.

> diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c
> new file mode 100644
> index 000000000..c4d09ed9d
> --- /dev/null
> +++ b/mm/bpf_oom.c

[ ... ]

> @@ -0,0 +1,272 @@

[ ... ]

> +static void bpf_oom_ops_unreg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *ops_link = container_of(link, struct bpf_struct_ops_link, link);
> +	struct bpf_oom_ops **bpf_oom_ops_ptr = NULL;
> +	struct bpf_oom_ops *bpf_oom_ops = kdata;
> +	struct mem_cgroup *memcg = NULL;
> +
> +	if (IS_ENABLED(CONFIG_MEMCG) && ops_link->cgroup_id) {
> +		/* Detach from a memory cgroup? */
> +		memcg = mem_cgroup_get_from_ino(ops_link->cgroup_id);
> +		if (IS_ERR_OR_NULL(memcg))
> +			goto exit;
> +		bpf_oom_ops_ptr = bpf_oom_memcg_ops_ptr(memcg);
> +	} else {
> +		/* System-wide OOM handler */
> +		bpf_oom_ops_ptr = &system_bpf_oom;
> +	}
> +
> +	/* Hide bpf_oom_ops from new callers */
> +	if (!WARN_ON(READ_ONCE(*bpf_oom_ops_ptr) != bpf_oom_ops))
> +		WRITE_ONCE(*bpf_oom_ops_ptr, NULL);
> +
> +	mem_cgroup_put(memcg);
> +
> +exit:
> +	/* Release bpf_oom_ops after a srcu grace period */
> +	synchronize_srcu(&bpf_oom_srcu);
> +}
> +
> +#ifdef CONFIG_MEMCG
> +void bpf_oom_memcg_offline(struct mem_cgroup *memcg)
> +{
> +	struct bpf_oom_ops *bpf_oom_ops;
> +	struct bpf_oom_ctx exec_ctx;
> +	u64 cgrp_id;
> +	int idx;
> +
> +	/* All bpf_oom_ops structures are protected using bpf_oom_srcu */
> +	idx = srcu_read_lock(&bpf_oom_srcu);
> +
> +	bpf_oom_ops = READ_ONCE(memcg->bpf_oom);
> +	WRITE_ONCE(memcg->bpf_oom, NULL);

Can bpf_oom_ops_unreg() race with bpf_oom_memcg_offline()? If userspace
closes the BPF struct ops link FD while or after the cgroup is being
deleted, both functions could run concurrently on the same memcg:

  Thread 1 (unreg):                Thread 2 (offline):
  mem_cgroup_get_from_ino()
  [succeeds, cgroup exists]
                                   srcu_read_lock()
                                   READ_ONCE(memcg->bpf_oom) [gets ops]
                                   WRITE_ONCE(memcg->bpf_oom, NULL)
  READ_ONCE(*bpf_oom_ops_ptr)
  [sees NULL]
  WARN_ON() triggers

The WARN_ON in bpf_oom_ops_unreg() assumes *bpf_oom_ops_ptr matches
bpf_oom_ops, but bpf_oom_memcg_offline() can clear it first during
concurrent execution.

> +
> +	if (bpf_oom_ops && bpf_oom_ops->handle_cgroup_offline) {
> +		cgrp_id = cgroup_id(memcg->css.cgroup);
> +		exec_ctx.cgroup_id = cgrp_id;
> +		bpf_oom_ops->handle_cgroup_offline(&exec_ctx, cgrp_id);
> +	}
> +
> +	srcu_read_unlock(&bpf_oom_srcu, idx);
> +}
> +#endif /* CONFIG_MEMCG */

[ ... ]



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: mm: introduce BPF struct ops for OOM handling
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

@kernel-patches-review-bot
Copy link

commit bce57b9b96e05731a14c7e10a364fdd9b733cda2
Author: Roman Gushchin <[email protected]>

mm: introduce bpf_out_of_memory() BPF kfunc

This patch introduces bpf_out_of_memory() BPF kfunc to allow BPF programs
to declare OOM events and trigger the kernel OOM handling mechanism. The
function accepts a nullable memcg pointer for system-wide or memcg-scoped
OOMs.

> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 721087952..3cbdcd013 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -21,6 +21,11 @@ enum oom_constraint {
>  	CONSTRAINT_MEMCG,
>  };
>
> +enum bpf_oom_flags {
> +	BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK = 1 << 0,
> +	BPF_OOM_FLAGS_LAST = 1 << 1,
> +};
> +
>  /*
>   * Details of the page allocation that triggered the oom killer that are used to
>   * determine what should be killed.
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 3c86cd755..d7fca4bf5 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c

[ ... ]

> +/**
> + * bpf_out_of_memory - declare Out Of Memory state and invoke OOM killer
> + * @memcg__nullable: memcg or NULL for system-wide OOMs
> + * @order: order of page which wasn't allocated
> + * @flags: flags
> + * @constraint_text__nullable: custom constraint description for the OOM report
> + *
> + * Declares the Out Of Memory state and invokes the OOM killer.
> + *
> + * OOM handlers are synchronized using the oom_lock mutex. If wait_on_oom_lock
> + * is true, the function will wait on it. Otherwise it bails out with -EBUSY
> + * if oom_lock is contended.
> + *
> + * Generally it's advised to pass wait_on_oom_lock=false for global OOMs
> + * and wait_on_oom_lock=true for memcg-scoped OOMs.
> + *
> + * Returns 1 if the forward progress was achieved and some memory was freed.
> + * Returns a negative value if an error occurred.
> + */
> +__bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
> +				  int order, u64 flags)
> +{
> +	struct oom_control oc = {
> +		.memcg = memcg__nullable,
> +		.order = order,
> +	};

Can bpf_out_of_memory() work correctly for system-wide OOMs with
uninitialized gfp_mask?

Looking at out_of_memory():

    if (!(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
        return true;

With gfp_mask=0 and memcg=NULL (system-wide OOM), this evaluates to:
!(0 & __GFP_FS) && !false = true, causing an early return without
invoking the OOM killer. The comment in out_of_memory() explains this
check is about allocation context: "The OOM killer does not compensate
for IO-less reclaim."

All other callers of out_of_memory() initialize gfp_mask:
- mm/page_alloc.c: .gfp_mask = gfp_mask
- mm/memcontrol.c: .gfp_mask = gfp_mask

For memcg OOMs, the code works because is_memcg_oom(oc) is true,
bypassing the check. But for system-wide OOMs (NULL memcg), won't this
cause the function to return immediately without killing any process?

> +	int ret;
> +
> +	if (flags & ~(BPF_OOM_FLAGS_LAST - 1))
> +		return -EINVAL;
> +
> +	if (oc.order < 0 || oc.order > MAX_PAGE_ORDER)
> +		return -EINVAL;
> +
> +	if (flags & BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK) {
> +		ret = mutex_lock_killable(&oom_lock);
> +		if (ret)
> +			return ret;
> +	} else if (!mutex_trylock(&oom_lock))
> +		return -EBUSY;
> +
> +	ret = out_of_memory(&oc);
> +
> +	mutex_unlock(&oom_lock);
> +	return ret;
> +}

[ ... ]



AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: mm: introduce bpf_out_of_memory() BPF kfunc
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

@kernel-patches-daemon-bpf
Copy link
Author

Forwarding comment 3453857259 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

@kernel-patches-daemon-bpf
Copy link
Author

Forwarding comment 3453857502 via email
In-Reply-To: [email protected]
Patch: https://patchwork.kernel.org/project/netdevbpf/patch/[email protected]/

rgushchin and others added 19 commits October 27, 2025 19:50
Move struct bpf_struct_ops_link's definition into bpf.h,
where other custom bpf links definitions are.

It's necessary to access its members from outside of generic
bpf_struct_ops implementation, which will be done by following
patches in the series.

Signed-off-by: Roman Gushchin <[email protected]>
When a struct ops is being attached and a bpf link is created,
allow to pass a cgroup fd using bpf attr, so that struct ops
can be attached to a cgroup instead of globally.

Attached struct ops doesn't hold a reference to the cgroup,
only preserves cgroup id.

Signed-off-by: Roman Gushchin <[email protected]>
Struct oom_control is used to describe the OOM context.
It's memcg field defines the scope of OOM: it's NULL for global
OOMs and a valid memcg pointer for memcg-scoped OOMs.
Teach bpf verifier to recognize it as trusted or NULL pointer.
It will provide the bpf OOM handler a trusted memcg pointer,
which for example is required for iterating the memcg's subtree.

Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Kumar Kartikeya Dwivedi <[email protected]>
mem_cgroup_get_from_ino() can be reused by the BPF OOM implementation,
but currently depends on CONFIG_SHRINKER_DEBUG. Remove this dependency.

Signed-off-by: Roman Gushchin <[email protected]>
To use memcg_page_state_output() in bpf_memcontrol.c move the
declaration from v1-specific memcontrol-v1.h to memcontrol.h.

Signed-off-by: Roman Gushchin <[email protected]>
Introduce a bpf struct ops for implementing custom OOM handling
policies.

It's possible to load one bpf_oom_ops for the system and one
bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
cgroup tree is traversed from the OOM'ing memcg up to the root and
corresponding BPF OOM handlers are executed until some memory is
freed. If no memory is freed, the kernel OOM killer is invoked.

The struct ops provides the bpf_handle_out_of_memory() callback,
which expected to return 1 if it was able to free some memory and 0
otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
field of the oom_control structure, which is expected to be set by
kfuncs suitable for releasing memory. If both are set, OOM is
considered handled, otherwise the next OOM handler in the chain
(e.g. BPF OOM attached to the parent cgroup or the in-kernel OOM
killer) is executed.

The bpf_handle_out_of_memory() callback program is sleepable to enable
using iterators, e.g. cgroup iterators. The callback receives struct
oom_control as an argument, so it can determine the scope of the OOM
event: if this is a memcg-wide or system-wide OOM.

The callback is executed just before the kernel victim task selection
algorithm, so all heuristics and sysctls like panic on oom,
sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
are respected.

BPF OOM struct ops provides the handle_cgroup_offline() callback
which is good for releasing struct ops if the corresponding cgroup
is gone.

The struct ops also has the name field, which allows to define a
custom name for the implemented policy. It's printed in the OOM report
in the oom_policy=<policy> format. "default" is printed if bpf is not
used or policy name is not specified.

[  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
               oom_policy=bpf_test_policy
[  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
[  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
[  112.698167] Call Trace:
[  112.698177]  <TASK>
[  112.698182]  dump_stack_lvl+0x4d/0x70
[  112.698192]  dump_header+0x59/0x1c6
[  112.698199]  oom_kill_process.cold+0x8/0xef
[  112.698206]  bpf_oom_kill_process+0x59/0xb0
[  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
[  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
[  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
[  112.698240]  bpf_handle_oom+0x11a/0x1e0
[  112.698250]  out_of_memory+0xab/0x5c0
[  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
[  112.698274]  try_charge_memcg+0x4b5/0x7e0
[  112.698288]  charge_memcg+0x2f/0xc0
[  112.698293]  __mem_cgroup_charge+0x30/0xc0
[  112.698299]  do_anonymous_page+0x40f/0xa50
[  112.698311]  __handle_mm_fault+0xbba/0x1140
[  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
[  112.698335]  handle_mm_fault+0xe6/0x370
[  112.698343]  do_user_addr_fault+0x211/0x6a0
[  112.698354]  exc_page_fault+0x75/0x1d0
[  112.698363]  asm_exc_page_fault+0x26/0x30
[  112.698366] RIP: 0033:0x7fa97236db00

Signed-off-by: Roman Gushchin <[email protected]>
Introduce bpf_oom_kill_process() bpf kfunc, which is supposed
to be used by BPF OOM programs. It allows to kill a process
in exactly the same way the OOM killer does: using the OOM reaper,
bumping corresponding memcg and global statistics, respecting
memory.oom.group etc.

On success, it sets om_control's bpf_memory_freed field to true,
enabling the bpf program to bypass the kernel OOM killer.

Signed-off-by: Roman Gushchin <[email protected]>
To effectively operate with memory cgroups in BPF there is a need
to convert css pointers to memcg pointers. A simple container_of
cast which is used in the kernel code can't be used in BPF because
from the verifier's point of view that's a out-of-bounds memory access.

Introduce helper get/put kfuncs which can be used to get
a refcounted memcg pointer from the css pointer:
  - bpf_get_mem_cgroup,
  - bpf_put_mem_cgroup.

bpf_get_mem_cgroup() can take both memcg's css and the corresponding
cgroup's "self" css. It allows it to be used with the existing cgroup
iterator which iterates over cgroup tree, not memcg tree.

Signed-off-by: Roman Gushchin <[email protected]>
Introduce a BPF kfunc to get a trusted pointer to the root memory
cgroup. It's very handy to traverse the full memcg tree, e.g.
for handling a system-wide OOM.

It's possible to obtain this pointer by traversing the memcg tree
up from any known memcg, but it's sub-optimal and makes BPF programs
more complex and less efficient.

bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
however in reality it's not necessarily to bump the corresponding
reference counter - root memory cgroup is immortal, reference counting
is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.

Signed-off-by: Roman Gushchin <[email protected]>
Introduce BPF kfuncs to conveniently access memcg data:
  - bpf_mem_cgroup_vm_events(),
  - bpf_mem_cgroup_usage(),
  - bpf_mem_cgroup_page_state(),
  - bpf_mem_cgroup_flush_stats().

These functions are useful for implementing BPF OOM policies, but
also can be used to accelerate access to the memcg data. Reading
it through cgroupfs is much more expensive, roughly 5x, mostly
because of the need to convert the data into the text and back.

JP Kobryn:
An experiment was setup to compare the performance of a program that
uses the traditional method of reading memory.stat vs a program using
the new kfuncs. The control program opens up the root memory.stat file
and for 1M iterations reads, converts the string values to numeric data,
then seeks back to the beginning. The experimental program sets up the
requisite libbpf objects and for 1M iterations invokes a bpf program
which uses the kfuncs to fetch all available stats for node_stat_item,
memcg_stat_item, and vm_event_item types.

The results showed a significant perf benefit on the experimental side,
outperforming the control side by a margin of 93%. In kernel mode,
elapsed time was reduced by 80%, while in user mode, over 99% of time
was saved.

control: elapsed time
real    0m38.318s
user    0m25.131s
sys     0m13.070s

experiment: elapsed time
real    0m2.789s
user    0m0.187s
sys     0m2.512s

control: perf data
33.43% a.out libc.so.6         [.] __vfscanf_internal
 6.88% a.out [kernel.kallsyms] [k] vsnprintf
 6.33% a.out libc.so.6         [.] _IO_fgets
 5.51% a.out [kernel.kallsyms] [k] format_decode
 4.31% a.out libc.so.6         [.] __GI_____strtoull_l_internal
 3.78% a.out [kernel.kallsyms] [k] string
 3.53% a.out [kernel.kallsyms] [k] number
 2.71% a.out libc.so.6         [.] _IO_sputbackc
 2.41% a.out [kernel.kallsyms] [k] strlen
 1.98% a.out a.out             [.] main
 1.70% a.out libc.so.6         [.] _IO_getline_info
 1.51% a.out libc.so.6         [.] __isoc99_sscanf
 1.47% a.out [kernel.kallsyms] [k] memory_stat_format
 1.47% a.out [kernel.kallsyms] [k] memcpy_orig
 1.41% a.out [kernel.kallsyms] [k] seq_buf_printf

experiment: perf data
10.55% memcgstat bpf_prog_..._query [k] bpf_prog_16aab2f19fa982a7_query
 6.90% memcgstat [kernel.kallsyms]  [k] memcg_page_state_output
 3.55% memcgstat [kernel.kallsyms]  [k] _raw_spin_lock
 3.12% memcgstat [kernel.kallsyms]  [k] memcg_events
 2.87% memcgstat [kernel.kallsyms]  [k] __memcg_slab_post_alloc_hook
 2.73% memcgstat [kernel.kallsyms]  [k] kmem_cache_free
 2.70% memcgstat [kernel.kallsyms]  [k] entry_SYSRETQ_unsafe_stack
 2.25% memcgstat [kernel.kallsyms]  [k] __memcg_slab_free_hook
 2.06% memcgstat [kernel.kallsyms]  [k] get_page_from_freelist

Signed-off-by: Roman Gushchin <[email protected]>
Co-developed-by: JP Kobryn <[email protected]>
Signed-off-by: JP Kobryn <[email protected]>
Introduce BPF kfunc to access memory events, e.g.:
MEMCG_LOW, MEMCG_MAX, MEMCG_OOM, MEMCG_OOM_KILL etc.

Signed-off-by: JP Kobryn <[email protected]>
Add test coverage for the kfuncs that fetch memcg stats. Using some common
stats, test scenarios ensuring that the given stat increases by some
arbitrary amount. The stats selected cover the three categories represented
by the enums: node_stat_item, memcg_stat_item, vm_event_item.

Since only a subset of all stats are queried, use a static struct made up
of fields for each stat. Write to the struct with the fetched values when
the bpf program is invoked and read the fields in the user mode program for
verification.

Signed-off-by: JP Kobryn <[email protected]>
Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
an out of memory events and trigger the corresponding kernel OOM
handling mechanism.

It takes a trusted memcg pointer (or NULL for system-wide OOMs)
as an argument, as well as the page order.

If the BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK flag is not set, only one OOM
can be declared and handled in the system at once, so if the function
is called in parallel to another OOM handling, it bails out with -EBUSY.
This mode is suited for global OOM's: any concurrent OOMs will likely
do the job and release some memory. In a blocking mode (which is
suited for memcg OOMs) the execution will wait on the oom_lock mutex.

The function is declared as sleepable. It guarantees that it won't
be called from an atomic context. It's required by the OOM handling
code, which shouldn't be called from a non-blocking context.

Handling of a memcg OOM almost always requires taking of the
css_set_lock spinlock. The fact that bpf_out_of_memory() is sleepable
also guarantees that it can't be called with acquired css_set_lock,
so the kernel can't deadlock on it.

Please, note that this function will be inaccessible as of now.
Calling bpf_out_of_memory() from a random context is dangerous
because e.g. it's easy to deadlock the system on oom_lock.
The following commit in the series will provide one safe context
where this kfunc can be used.

Signed-off-by: Roman Gushchin <[email protected]>
Currently there is a hard-coded list of possible oom constraints:
NONE, CPUSET, MEMORY_POLICY & MEMCG. Add a new one: CONSTRAINT_BPF.
Also, add an ability to specify a custom constraint name
when calling bpf_out_of_memory(). If an empty string is passed
as an argument, CONSTRAINT_BPF is displayed.

The resulting output in dmesg will look like this:

[  315.224875] kworker/u17:0 invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=0
               oom_policy=default
[  315.226532] CPU: 1 UID: 0 PID: 74 Comm: kworker/u17:0 Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
[  315.226534] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
[  315.226536] Workqueue: bpf_psi_wq bpf_psi_handle_event_fn
[  315.226542] Call Trace:
[  315.226545]  <TASK>
[  315.226548]  dump_stack_lvl+0x4d/0x70
[  315.226555]  dump_header+0x59/0x1c6
[  315.226561]  oom_kill_process.cold+0x8/0xef
[  315.226565]  out_of_memory+0x111/0x5c0
[  315.226577]  bpf_out_of_memory+0x6f/0xd0
[  315.226580]  ? srso_alias_return_thunk+0x5/0xfbef5
[  315.226589]  bpf_prog_3018b0cf55d2c6bb_handle_psi_event+0x5d/0x76
[  315.226594]  bpf__bpf_psi_ops_handle_psi_event+0x47/0xa7
[  315.226599]  bpf_psi_handle_event_fn+0x63/0xb0
[  315.226604]  process_one_work+0x1fc/0x580
[  315.226616]  ? srso_alias_return_thunk+0x5/0xfbef5
[  315.226624]  worker_thread+0x1d9/0x3b0
[  315.226629]  ? __pfx_worker_thread+0x10/0x10
[  315.226632]  kthread+0x128/0x270
[  315.226637]  ? lock_release+0xd4/0x2d0
[  315.226645]  ? __pfx_kthread+0x10/0x10
[  315.226649]  ret_from_fork+0x81/0xd0
[  315.226652]  ? __pfx_kthread+0x10/0x10
[  315.226655]  ret_from_fork_asm+0x1a/0x30
[  315.226667]  </TASK>
[  315.239745] memory: usage 42240kB, limit 9007199254740988kB, failcnt 0
[  315.240231] swap: usage 0kB, limit 0kB, failcnt 0
[  315.240585] Memory cgroup stats for /cgroup-test-work-dir673/oom_test/cg2:
[  315.240603] anon 42897408
[  315.241317] file 0
[  315.241493] kernel 98304
...
[  315.255946] Tasks state (memory values in pages):
[  315.256292] [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[  315.257107] [    675]     0   675   162013    10969    10712      257         0   155648        0             0 test_progs
[  315.257927] oom-kill:constraint=CONSTRAINT_BPF_PSI_MEM,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/cgroup-test-work-dir673/oom_test/cg2,task_memcg=/cgroup-test-work-dir673/oom_test/cg2,task=test_progs,pid=675,uid=0
[  315.259371] Memory cgroup out of memory: Killed process 675 (test_progs) total-vm:648052kB, anon-rss:42848kB, file-rss:1028kB, shmem-rss:0kB, UID:0 pgtables:152kB oom_score_adj:0

Signed-off-by: Roman Gushchin <[email protected]>
Export tsk_is_oom_victim() helper as a BPF kfunc.
It's very useful to avoid redundant oom kills.

Signed-off-by: Roman Gushchin <[email protected]>
Introduce bpf_map__attach_struct_ops_opts(), an extended version of
bpf_map__attach_struct_ops(), which takes additional struct
bpf_struct_ops_opts argument.

struct bpf_struct_ops_opts has the relative_fd member, which allows
to pass an additional file descriptor argument. It can be used to
attach struct ops maps to cgroups.

Signed-off-by: Roman Gushchin <[email protected]>
Implement read_cgroup_file() helper to read from cgroup control files,
e.g. statistics.

Signed-off-by: Roman Gushchin <[email protected]>
Implement a kselftest for the OOM handling functionality.

The OOM handling policy which is implemented in BPF is to
kill all tasks belonging to the biggest leaf cgroup, which
doesn't contain unkillable tasks (tasks with oom_score_adj
set to -1000). Pagecache size is excluded from the accounting.

The test creates a hierarchy of memory cgroups, causes an
OOM at the top level, checks that the expected process will be
killed and checks memcg's oom statistics.

Please, note that the same BPF OOM policy is attached to a memory
cgroup and system-wide. In the first case the program does nothing
and returns false, so it's executed the second time, when it properly
handles the OOM.

Signed-off-by: Roman Gushchin <[email protected]>
Currently psi_trigger_create() does a lot of things:
parses the user text input, allocates and initializes
the psi_trigger structure and turns on the trigger.
It does it slightly different for two existing types
of psi_triggers: system-wide and cgroup-wide.

In order to support a new type of PSI triggers, which
will be owned by a BPF program and won't have a user's
text description, let's refactor psi_trigger_create().

1. Introduce psi_trigger_type enum:
   currently PSI_SYSTEM and PSI_CGROUP are valid values.
2. Introduce psi_trigger_params structure to avoid passing
   a large number of parameters to psi_trigger_create().
3. Move out the user's input parsing into the new
   psi_trigger_parse() helper.
4. Move out the capabilities check into the new
   psi_file_privileged() helper.
5. Stop relying on t->of for detecting trigger type.

This commit is a pure refactoring and doesn't bring any
functional changes.

Signed-off-by: Roman Gushchin <[email protected]>
@kernel-patches-daemon-bpf
Copy link
Author

Upstream branch: f9db3a3
series: https://patchwork.kernel.org/project/netdevbpf/list/?series=1016490
version: 2

rgushchin and others added 4 commits October 27, 2025 19:50
This patch implements a BPF struct ops-based mechanism to create
PSI triggers, attach them to cgroups or system wide and handle
PSI events in BPF.

The struct ops provides 3 callbacks:
  - init() called once at load, handy for creating PSI triggers
  - handle_psi_event() called every time a PSI trigger fires
  - handle_cgroup_online() called when a new cgroup is created
  - handle_cgroup_offline() called if a cgroup with an attached
    trigger is deleted

A single struct ops can create a number of PSI triggers, both
cgroup-scoped and system-wide.

All 4 struct ops callbacks can be sleepable. handle_psi_event()
handlers are executed using a separate workqueue, so it won't
affect the latency of other PSI triggers.

Signed-off-by: Roman Gushchin <[email protected]>
Implement a new bpf_psi_create_trigger() BPF kfunc, which allows
to create new PSI triggers and attach them to cgroups or be
system-wide.

Created triggers will exist until the struct ops is loaded and
if they are attached to a cgroup until the cgroup exists.

Due to a limitation of 5 arguments, the resource type and the "full"
bit are squeezed into a single u32.

Signed-off-by: Roman Gushchin <[email protected]>
Include CONFIG_PSI to allow dependent tests to build.

Suggested-by: Song Liu <[email protected]>
Signed-off-by: JP Kobryn <[email protected]>
Add a PSI struct ops test.

The test creates a cgroup with two child sub-cgroups, sets up
memory.high for one of those and puts there a memory hungry
process (initially frozen).

Then it creates 2 PSI triggers from within a init() BPF callback and
attaches them to these cgroups.  Then it deletes the first cgroup,
creates another one and runs the memory hungry task. From the cgroup
creation callback the test is creating another trigger.

The memory hungry task is creating a high memory pressure in one
memory cgroup, which triggers a PSI event. The PSI BPF handler
declares a memcg oom in the corresponding cgroup. Finally the checks
that both handle_cgroup_free() and handle_psi_event() handlers were
executed, the correct process was killed and oom counters were
updated.

Signed-off-by: Roman Gushchin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants