-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Function require"iwinfo".type("phy1-ap0") can get stuck if called a while after scanlist() in the same Lua state because the netlink multicast subscription stays active. I reproduced the issue in OpenWRT version 23.05.3 and a snapshot release from this month, and other versions are probably also affected. Likely all NL80211 drivers are affected.
strace shows:
sendmsg(4, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=28, nlmsg_type=0x18 /* NLMSG_??? */, nlmsg_flags=NLM_F_REQUEST|NLM_F_ACK, nlmsg_seq=1747314795, nlmsg_pid=10774}, "\x05\x00\x00\x00\x08\x00\x03\x00\x2a\x00\x00\x00"], iov_len=28}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 28
recvmsg(4, {msg_namelen=12}, 0) = -1 ENOBUFS (No buffer space available)
Afterwards, nl80211_send() keeps looping because err > 0. While it might receive multicast messages, it keeps waiting for a reply to the original command. Though that reply never arrives because the kernel threw it away since the receive buffer ran out of space. So, iwinfo_L_type never returns.
$ ./scripts/remote-gdb 127.0.0.1:9000 ./build_dir/target-mips_24kc_musl/lua-5.1.5/.pkgdir/lua/usr/bin/lua5.1
# I configured some breakpoints
(gdb) bt
#0 recvmsgs (cb=0x76d57560, sk=0x77518d50) at libnl-tiny-2023-07-27-bc92a280/nl.c:537
#1 nl_recvmsgs (sk=<optimized out>, cb=0x76d57560) at libnl-tiny-2023-07-27-bc92a280/nl.c:697
#2 0x774bfb5b in nl80211_send (cv=0x774dc3a8 <cv>, cb_func=<optimized out>, cb_arg=<optimized out>) at iwinfo_nl80211.c:509
#3 0x774bfe31 in nl80211_request (ifname=<optimized out>, cmd=<optimized out>, flags=<optimized out>,
cb_func=0x774c3c19 <nl80211_phyname_cb>, cb_arg=0x774dc360 <phy>) at iwinfo_nl80211.c:527
#4 0x774bfee1 in nl80211_ifname2phy (ifname=<optimized out>) at iwinfo_nl80211.c:779
#5 0x774bff13 in nl80211_probe (ifname=<optimized out>) at iwinfo_nl80211.c:1222
#6 0x774bee31 in iwinfo_backend (ifname=<optimized out>) at iwinfo_lib.c:413
#7 0x774bee57 in iwinfo_type (ifname=<optimized out>) at iwinfo_lib.c:401
#8 0x774e0c7b in iwinfo_L_type (L=0x77ea3820) at iwinfo_lua.c:26
#9 0x77db9dd7 in luaD_precall (L=0x77ea3820, func=0x76ee5690, nresults=<optimized out>) at ldo.c:320
# … the rest of the backtrace is only Lua functions
(gdb) c
Continuing.
Breakpoint 2, nl80211_send (cv=0x774dc3a8 <cv>, cb_func=<optimized out>, cb_arg=<optimized out>) at iwinfo_nl80211.c:508
508 while (err > 0)
(gdb) c
Continuing.
Breakpoint 4, recvmsgs (cb=0x76d57560, sk=0x77518d50) at libnl-tiny-2023-07-27-bc92a280/nl.c:537
537 if (cb->cb_set[NL_CB_MSG_IN])
(gdb) p *(struct genlmsghdr*)((unsigned char *) msg->nm_nlh + NLMSG_HDRLEN)
$37 = {cmd = 34 '"', version = 1 '\001', reserved = 0}
# meaning NL80211_CMD_NEW_SCAN_RESULTSScript to reproduce:
#!/usr/bin/env lua
-- dependencies: `apk add libiwinfo-lua lua luaposix`
local device_interface = 'phy1-ap0' -- might differ on your router
local iwinfo = require 'iwinfo'
local p = require'posix'
local SOL_NETLINK = 270
local NETLINK_DROP_MEMBERSHIP = 2
local NETLINK_LIST_MEMBERSHIPS = 9
local ap_type = iwinfo.type(device_interface)
local function scan()
print('scan', ap_type)
for _, scan_entry in ipairs(iwinfo[ap_type].scanlist(device_interface)) do
print('result', scan_entry.signal)
end
end
local function find_iwinfo_fd()
for i = 1, 90 do
addr = p.getsockname(i)
if addr and addr.family == p.AF_NETLINK then
return i
end
end
end
scan()
local iwinfo_fd = find_iwinfo_fd()
local scan_group = p.getsockopt(iwinfo_fd, SOL_NETLINK, NETLINK_LIST_MEMBERSHIPS)
assert(type(scan_group) == 'number' and scan_group > 0)
local rvcbuf = p.getsockopt(iwinfo_fd, p.SOL_SOCKET, p.SO_RCVBUF)
assert(type(rvcbuf) == 'number' and rvcbuf > 2048)
local function work_around_bug()
local group_id = 1 + (math.log(scan_group) / math.log(2))
local res, err = p.setsockopt(iwinfo_fd, SOL_NETLINK, NETLINK_DROP_MEMBERSHIP, group_id)
assert(res == 0, err or res)
end
-- This simulates other processes that start scans, for example `iw dev phy1-ap0 scan`.
-- In reality, it will take a few minutes for the buffer to accumulate. This is quicker.
simulation_fd=p.socket(p.AF_NETLINK, p.SOCK_RAW, p.NETLINK_GENERIC)
local function simulate_mcast(bytes)
kib = math.floor(bytes / 1024)
print(string.format('sending %d messages of 1KiB', kib))
for i = 1, kib do
p.sendto(simulation_fd, string.rep('\0', 1024), {
family = p.AF_NETLINK,
pid = 0,
groups = scan_group,
})
end
end
local function dbg_iwinfo_type()
print('calling iwinfo.type')
print('result: ', iwinfo.type(device_interface))
end
simulate_mcast(rvcbuf * .4)
dbg_iwinfo_type()
print("the workaround isn't necessary usually")
scan()
simulate_mcast(rvcbuf * .4)
dbg_iwinfo_type()
print("only when too many multicast message arrive")
scan()
work_around_bug()
simulate_mcast(rvcbuf * .6)
dbg_iwinfo_type()
print("now repeated without the workaround, should reproduce the bug")
scan()
simulate_mcast(rvcbuf * .6)
dbg_iwinfo_type() -- expected this to print `"nl80211"`, but it actually hangs indefinitelyWhat didn't help:
- Calling
iwinfo.__gc()afterscanlistmakesiwinfo.type()returnnil, and that is also undesired. - Only calling
iwinfo.type()once and reusing the result caused other operations to get stuck innl80211_send().
Potential solutions (that I did not yet attempt to implement in C):
- Unsubscribe at the end of
__nl80211_waitjust like https://github.com/openwrt/usteer/blob/e218150979b40a1b3c59ad0aaa3bbb943814db1e/nl80211.c#L404 - Or setting
NETLINK_NO_ENOBUFSmight help?