Skip to content

nl80211_send() hangs due to ENOBUFS after scan #23

@TjeuKayim

Description

@TjeuKayim

Function require"iwinfo".type("phy1-ap0") can get stuck if called a while after scanlist() in the same Lua state because the netlink multicast subscription stays active. I reproduced the issue in OpenWRT version 23.05.3 and a snapshot release from this month, and other versions are probably also affected. Likely all NL80211 drivers are affected.

strace shows:

sendmsg(4, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=28, nlmsg_type=0x18 /* NLMSG_??? */, nlmsg_flags=NLM_F_REQUEST|NLM_F_ACK, nlmsg_seq=1747314795, nlmsg_pid=10774}, "\x05\x00\x00\x00\x08\x00\x03\x00\x2a\x00\x00\x00"], iov_len=28}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 28
recvmsg(4, {msg_namelen=12}, 0)         = -1 ENOBUFS (No buffer space available)

Afterwards, nl80211_send() keeps looping because err > 0. While it might receive multicast messages, it keeps waiting for a reply to the original command. Though that reply never arrives because the kernel threw it away since the receive buffer ran out of space. So, iwinfo_L_type never returns.

$ ./scripts/remote-gdb 127.0.0.1:9000 ./build_dir/target-mips_24kc_musl/lua-5.1.5/.pkgdir/lua/usr/bin/lua5.1
# I configured some breakpoints
(gdb) bt
#0  recvmsgs (cb=0x76d57560, sk=0x77518d50) at libnl-tiny-2023-07-27-bc92a280/nl.c:537
#1  nl_recvmsgs (sk=<optimized out>, cb=0x76d57560) at libnl-tiny-2023-07-27-bc92a280/nl.c:697
#2  0x774bfb5b in nl80211_send (cv=0x774dc3a8 <cv>, cb_func=<optimized out>, cb_arg=<optimized out>) at iwinfo_nl80211.c:509
#3  0x774bfe31 in nl80211_request (ifname=<optimized out>, cmd=<optimized out>, flags=<optimized out>, 
    cb_func=0x774c3c19 <nl80211_phyname_cb>, cb_arg=0x774dc360 <phy>) at iwinfo_nl80211.c:527
#4  0x774bfee1 in nl80211_ifname2phy (ifname=<optimized out>) at iwinfo_nl80211.c:779
#5  0x774bff13 in nl80211_probe (ifname=<optimized out>) at iwinfo_nl80211.c:1222
#6  0x774bee31 in iwinfo_backend (ifname=<optimized out>) at iwinfo_lib.c:413
#7  0x774bee57 in iwinfo_type (ifname=<optimized out>) at iwinfo_lib.c:401
#8  0x774e0c7b in iwinfo_L_type (L=0x77ea3820) at iwinfo_lua.c:26
#9  0x77db9dd7 in luaD_precall (L=0x77ea3820, func=0x76ee5690, nresults=<optimized out>) at ldo.c:320
# … the rest of the backtrace is only Lua functions
(gdb) c
Continuing.

Breakpoint 2, nl80211_send (cv=0x774dc3a8 <cv>, cb_func=<optimized out>, cb_arg=<optimized out>) at iwinfo_nl80211.c:508
508		while (err > 0)
(gdb) c
Continuing.

Breakpoint 4, recvmsgs (cb=0x76d57560, sk=0x77518d50) at libnl-tiny-2023-07-27-bc92a280/nl.c:537
537			if (cb->cb_set[NL_CB_MSG_IN])
(gdb) p *(struct genlmsghdr*)((unsigned char *) msg->nm_nlh + NLMSG_HDRLEN)
$37 = {cmd = 34 '"', version = 1 '\001', reserved = 0}
# meaning NL80211_CMD_NEW_SCAN_RESULTS

Script to reproduce:

#!/usr/bin/env lua
-- dependencies: `apk add libiwinfo-lua lua luaposix`
local device_interface = 'phy1-ap0' -- might differ on your router

local iwinfo = require 'iwinfo'
local p = require'posix'
local SOL_NETLINK = 270
local NETLINK_DROP_MEMBERSHIP = 2
local NETLINK_LIST_MEMBERSHIPS = 9

local ap_type = iwinfo.type(device_interface)

local function scan()
	print('scan', ap_type)
	for _, scan_entry in ipairs(iwinfo[ap_type].scanlist(device_interface)) do
	        print('result', scan_entry.signal)
	end
end

local function find_iwinfo_fd()
        for i = 1, 90 do
                addr = p.getsockname(i)
                if addr and addr.family == p.AF_NETLINK then
                        return i
                end
        end
end

scan()
local iwinfo_fd = find_iwinfo_fd()

local scan_group = p.getsockopt(iwinfo_fd, SOL_NETLINK, NETLINK_LIST_MEMBERSHIPS)
assert(type(scan_group) == 'number' and scan_group > 0)
local rvcbuf = p.getsockopt(iwinfo_fd, p.SOL_SOCKET, p.SO_RCVBUF)
assert(type(rvcbuf) == 'number' and rvcbuf > 2048)

local function work_around_bug()
	local group_id = 1 + (math.log(scan_group) / math.log(2))
	local res, err = p.setsockopt(iwinfo_fd, SOL_NETLINK, NETLINK_DROP_MEMBERSHIP, group_id)
	assert(res == 0, err or res)
end

-- This simulates other processes that start scans, for example `iw dev phy1-ap0 scan`.
-- In reality, it will take a few minutes for the buffer to accumulate. This is quicker.
simulation_fd=p.socket(p.AF_NETLINK, p.SOCK_RAW, p.NETLINK_GENERIC)
local function simulate_mcast(bytes)
	kib = math.floor(bytes / 1024)
	print(string.format('sending %d messages of 1KiB', kib))
	for i = 1, kib do
		p.sendto(simulation_fd, string.rep('\0', 1024), {
			family = p.AF_NETLINK,
			pid = 0,
			groups = scan_group,
		})
	end
end

local function dbg_iwinfo_type()
	print('calling iwinfo.type')
	print('result: ', iwinfo.type(device_interface))
end

simulate_mcast(rvcbuf * .4)
dbg_iwinfo_type()

print("the workaround isn't necessary usually")
scan()
simulate_mcast(rvcbuf * .4)
dbg_iwinfo_type()

print("only when too many multicast message arrive")
scan()
work_around_bug()
simulate_mcast(rvcbuf * .6)
dbg_iwinfo_type()

print("now repeated without the workaround, should reproduce the bug")
scan()
simulate_mcast(rvcbuf * .6)
dbg_iwinfo_type() -- expected this to print `"nl80211"`, but it actually hangs indefinitely

What didn't help:

  1. Calling iwinfo.__gc() after scanlist makes iwinfo.type() return nil, and that is also undesired.
  2. Only calling iwinfo.type() once and reusing the result caused other operations to get stuck in nl80211_send().

Potential solutions (that I did not yet attempt to implement in C):

  1. Unsubscribe at the end of __nl80211_wait just like https://github.com/openwrt/usteer/blob/e218150979b40a1b3c59ad0aaa3bbb943814db1e/nl80211.c#L404
  2. Or setting NETLINK_NO_ENOBUFS might help?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions