-
Couldn't load subscription status.
- Fork 1.8k
Description
Bug Report
Describe the bug
Fluent-bit 4.0.x and 4.1.x crashes with a coredump when running on RHEL10.
The bug seems to be related to the systemd input plugin. When started, the agent works fine for a while before crashing with a coredump. When this happen, any subsequent attempts to start the agent will result in an immediate crash with another coredump.
If we delete all contents from storage.path (systemd.0/ and systemd.db), the agent will start without problems, and run for a while before crashing again with a coredump.
It seems to me that the systemd chunk file gets corrupted for some reason, and when this happens, the agent crashes.
This happens with packages (4.0.13, 4.1.0, 4.1.1) from the almalinux repo at packages.fluentbit.io on multiple servers. But I have compiled 4.1.0 and 4.1.1 from source to activate FLB_DEBUG and I get the same problem.
To Reproduce
- Journald logs related to the crash:
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: [2025/10/24 15:11:22] [engine] caught signal (SIGSEGV)
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #0 0x7f2e9f071608 in ???() at ???:0
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #1 0x7f2ea0057b08 in ???() at ???:0
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #2 0x7f2e9ffb4d32 in ???() at ???:0
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #3 0x5c899e in ???() at ???:0
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #4 0x55488e in ???() at ???:0
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #5 0x577e2b in ???() at ???:0
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #6 0x7f2e9f2bbb67 in ???() at ???:0
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #7 0x7f2e9f32c6bb in ???() at ???:0
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #8 0xffffffffffffffff in ???() at ???:0
Oct 24 15:11:22 hostname.domain systemd-coredump[946015]: Process 939013 (fluent-bit) of user 0 terminated abnormally with signal 6/ABRT, processing...
Oct 24 15:11:22 hostname.domain systemd[1]: Started [email protected] - Process Core Dump (PID 946015/UID 0).
Oct 24 15:11:22 hostname.domain systemd-coredump[946016]: Removed old coredump core.fluent-bit.0.dfd7beb07d594c77bef0090bd555891f.834839.1761112301000000.zst.
Oct 24 15:11:22 hostname.domain systemd-coredump[946016]: [🡕] Process 939013 (fluent-bit) of user 0 dumped core.
Module libzstd.so.1 from rpm zstd-1.5.5-9.el10.x86_64
Module libpcre2-8.so.0 from rpm pcre2-10.44-1.el10.3.x86_64
Module libcrypt.so.2 from rpm libxcrypt-4.4.36-10.el10.x86_64
Module libselinux.so.1 from rpm libselinux-3.8-2.el10_0.x86_64
Module libsasl2.so.3 from rpm cyrus-sasl-2.1.28-27.el10.x86_64
Module libevent-2.1.so.7 from rpm libevent-2.1.12-16.el10.x86_64
Module libkeyutils.so.1 from rpm keyutils-1.6.3-5.el10.x86_64
Module libkrb5support.so.0 from rpm krb5-1.21.3-8.el10_0.x86_64
Module libcom_err.so.2 from rpm e2fsprogs-1.47.1-3.el10.x86_64
Module libk5crypto.so.3 from rpm krb5-1.21.3-8.el10_0.x86_64
Module libkrb5.so.3 from rpm krb5-1.21.3-8.el10_0.x86_64
Module libgssapi_krb5.so.2 from rpm krb5-1.21.3-8.el10_0.x86_64
Module libz.so.1 from rpm zlib-ng-2.2.3-1.el10.x86_64
Module libcap.so.2 from rpm libcap-2.69-7.el10.x86_64
Module libcrypto.so.3 from rpm openssl-3.2.2-16.el10_0.4.x86_64
Module libssl.so.3 from rpm openssl-3.2.2-16.el10_0.4.x86_64
Module libsystemd.so.0 from rpm systemd-257-9.el10_0.1.x86_64
Module libyaml-0.so.2 from rpm libyaml-0.2.5-16.el10.x86_64
Stack trace of thread 939016:
#0 0x00007f2e9f2bd9dc __pthread_kill_implementation (libc.so.6 + 0x969dc)
#1 0x00007f2e9f267a96 raise (libc.so.6 + 0x40a96)
#2 0x00007f2e9f24f8fa abort (libc.so.6 + 0x288fa)
#3 0x00000000004b1498 n/a (n/a + 0x0)
#4 0x313a35312034322f n/a (n/a + 0x0)
ELF object binary architecture: AMD x86-64
Oct 24 15:11:22 hostname.domain systemd[1]: [email protected]: Deactivated successfully.
Oct 24 15:11:22 hostname.domain systemd[1]: [email protected]: Consumed 268ms CPU time, 106.8M memory peak.
Oct 24 15:11:22 hostname.domain systemd[1]: fluent-bit.service: Main process exited, code=dumped, status=6/ABRT
Oct 24 15:11:22 hostname.domain systemd[1]: fluent-bit.service: Failed with result 'core-dump'.
Oct 24 15:11:22 hostname.domain systemd[1]: fluent-bit.service: Consumed 22.833s CPU time, 48.4M memory peak.
Oct 24 15:11:22 hostname.domain systemd[1]: fluent-bit.service: Scheduled restart job, restart counter is at 1.
-
Trace logs from fluent-bit, before, during, after the crash: fluent-bit.log.txt
-
I have been able to generate this backtrace from the coredump generated by the subsequent attempts to start the agent :
(gdb) backtrace
#0 0x00007fd5df4bd9dc in __pthread_kill_implementation () from /lib64/libc.so.6
#1 0x00007fd5df467a96 in raise () from /lib64/libc.so.6
#2 0x00007fd5df44f8fa in abort () from /lib64/libc.so.6
#3 0x00000000004ae47d in flb_signal_handler (signal=11) at /root/rhel10-test/fluent-bit-4.1.1/src/fluent-bit.c:636
#4 <signal handler called>
#5 0x00000000011d0e84 in ZSTD_freeDDict (ddict=0x1) at /root/rhel10-test/fluent-bit-4.1.1/lib/zstd-1.5.7/lib/decompress/zstd_ddict.c:215
#6 0x00007fd5df9ad609 in ZSTD_freeDCtx () from /lib64/libzstd.so.1
#7 0x00007fd5e01e5b09 in journal_file_data_payload.isra () from /lib64/libsystemd.so.0
#8 0x00007fd5e0142d33 in sd_journal_enumerate_data () from /lib64/libsystemd.so.0
#9 0x00000000007466ea in in_systemd_collect (ins=0x390d8d60, config=0x390a7490, in_context=0x7fd5d0001090) at /root/rhel10-test/fluent-bit-4.1.1/plugins/in_systemd/systemd.c:387
#10 0x0000000000746b07 in in_systemd_collect_archive (ins=0x390d8d60, config=0x390a7490, in_context=0x7fd5d0001090) at /root/rhel10-test/fluent-bit-4.1.1/plugins/in_systemd/systemd.c:512
#11 0x0000000000504fcd in input_collector_fd (fd=39, ins=0x390d8d60) at /root/rhel10-test/fluent-bit-4.1.1/src/flb_input_thread.c:166
#12 0x0000000000505b0a in engine_handle_event (fd=39, mask=1, ins=0x390d8d60, config=0x390a7490) at /root/rhel10-test/fluent-bit-4.1.1/src/flb_input_thread.c:181
#13 input_thread (data=0x7fd5d801f4e0) at /root/rhel10-test/fluent-bit-4.1.1/src/flb_input_thread.c:420
#14 0x000000000057ca77 in step_callback (data=0x7fd5d8024b70) at /root/rhel10-test/fluent-bit-4.1.1/src/flb_worker.c:43
#15 0x00007fd5df4bbb68 in start_thread () from /lib64/libc.so.6
#16 0x00007fd5df52c6bc in clone3 () from /lib64/libc.so.6
- Steps to reproduce the problem:
The first crash after the agent has been working without problems for a while is random, I have not been able to identify the reason. After the first crash (and when the chunk file probably gets corrupted) the crash will be reproducible if you used the attached chunk/db files under storage.path/systemd.0/ and storage.path/systemd.db
Expected behavior
The agent should not crash with a coredump.
Screenshots
Your Environment
- Version used:
fluent-bit 4.0.13, 4.1.0 and 4.1.1 - Configuration:
[SERVICE]
# Flush
# =====
# set an interval of seconds before to flush records to a destination
flush 10
# Daemon
# ======
# instruct Fluent Bit to run in foreground or background mode.
daemon Off
# Log_file
# ========
# Absolute path for an optional log file. By default all logs are
# redirected to the standard error interface (stderr).
log_file /var/log/fluent-bit/fluent-bit.log
# Log_Level
# =========
# Set the verbosity level of the service, values can be:
#
# - error
# - warning
# - info
# - debug
# - trace
#
# by default 'info' is set, that means it includes 'error' and 'warning'.
log_level trace
# Parsers File
# ============
# specify an optional 'Parsers' configuration file
parsers_file parsers.conf
# Plugins File
# ============
# specify an optional 'Plugins' configuration file to load external plugins.
plugins_file plugins.conf
# HTTP Server
# ===========
# Enable/Disable the built-in HTTP Server for metrics
http_server On
http_listen 127.0.0.1
http_port 2020
# Storage
# =======
# Fluent Bit can use memory and filesystem buffering based mechanisms
#
# - https://docs.fluentbit.io/manual/administration/buffering-and-storage
#
# storage metrics
# ---------------
# publish storage pipeline metrics in '/api/v1/storage'. The metrics are
# exported only if the 'http_server' option is enabled.
#
storage.metrics on
# storage.path
# ------------
# absolute file system path to store filesystem data buffers (chunks).
#
storage.path /var/lib/fluent-bit/storage
# storage.sync
# ------------
# configure the synchronization mode used to store the data into the
# filesystem. It can take the values normal or full.
#
storage.sync normal
# storage.checksum
# ----------------
# enable the data integrity check when writing and reading data from the
# filesystem. The storage layer uses the CRC32 algorithm.
#
storage.checksum off
# storage.backlog.mem_limit
# -------------------------
# if storage.path is set, Fluent Bit will look for data chunks that were
# not delivered and are still in the storage layer, these are called
# backlog data. This option configure a hint of maximum value of memory
# to use when processing these records.
#
storage.backlog.mem_limit 100M
# storage.max_chunks_up
# ---------------------
# If the input plugin has enabled filesystem storage type, this
# property sets the maximum number of chunks that can be up in
# memory. Use this setting to control memory usage when you enable
# storage.type filesystem.
#
storage.max_chunks_up 128
# storage.delete_irrecoverable_chunks
# -----------------------------------
# When enabled, irrecoverable chunks will be deleted during
# runtime, and any other irrecoverable chunk located in the
# configured storage path directory will be deleted when
# Fluent-Bit starts. Accepted values: 'Off, 'On.
#
storage.delete_irrecoverable_chunks on
# scheduler.base
# ---------------
# Set a base of exponential backoff in seconds.
scheduler.base 5
# scheduler.cap
# -------------
# Set a maximum retry time in seconds.
scheduler.cap 900
[INPUT]
Name systemd
Tag logs_5000_systemd
db /var/lib/fluent-bit/storage/systemd.db
db.Sync Normal
Mem_Buf_Limit 100MB
storage.type filesystem
storage.pause_on_chunks_overlimit on
Read_From_Tail On
Lowercase On
Threaded true
[FILTER]
Name modify
Match logs_5000_systemd
Add dataops.data_processor dataops-logs-systemd
Add event.module systemd
Add event.provider systemd
Add event.dataset systemd.journald
Add data_stream.namespace prod
Add data_stream.dataset systemd.journald
Add service.name linux-systemd
[FILTER]
Name nest
Match *
Operation nest
Wildcard dataops.*
Nest_under dataops
Remove_prefix dataops.
[FILTER]
Name nest
Match *
Operation nest
Wildcard event.*
Nest_under event
Remove_prefix event.
[FILTER]
Name nest
Match *
Operation nest
Wildcard data_stream.*
Nest_under data_stream
Remove_prefix data_stream.
[FILTER]
Name nest
Match *
Operation nest
Wildcard service.*
Nest_under service
Remove_prefix service.
[FILTER]
Name modify
Match *
Add agent.type fluent-bit
[FILTER]
Name sysinfo
Match *
Fluentbit_version_key agent.version
Os_name_key os.name
Os_version_key os.version
Kernel_version_key os.kernel
Hostname_key host.name
[FILTER]
Name nest
Match *
Operation nest
Wildcard agent.*
Nest_under agent
Remove_prefix agent.
[FILTER]
Name nest
Match *
Operation nest
Wildcard os.*
Nest_under os
Remove_prefix os.
[FILTER]
Name nest
Match *
Operation nest
Wildcard host.*
Wildcard os*
Nest_under host
Remove_prefix host.
[OUTPUT]
Name http
Match logs_5000_*
Host server-receiver.example.org
Port 5000
Format json
Workers 1
storage.total_limit_size 100M
Retry_Limit no_limits
tls On
tls.verify On
tls.ca_file /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
tls.crt_file /path/my.crt
tls.key_file /path/my.key
fluent-bit.conf.txt / fluent-bit-systemd.conf.txt
- Server type and version:
Linux 6.12.0-55.38.1.el10_0.x86_64 x86_64 GNU/Linux - Operating System and version:
Red Hat Enterprise Linux release 10.0 (Coughlan) - Filters and plugins:
systemd(input), modify, nest, sysinfo, http(output)
Additional context
- Fluent-bit systemd chunk file and systemd db after the crash: fluent-bit-storage_path_files.zip