Skip to content

Fluent-bit crashes with a coredump when running on RHEL10 #11068

@rafaelma

Description

@rafaelma

Bug Report

Describe the bug

Fluent-bit 4.0.x and 4.1.x crashes with a coredump when running on RHEL10.

The bug seems to be related to the systemd input plugin. When started, the agent works fine for a while before crashing with a coredump. When this happen, any subsequent attempts to start the agent will result in an immediate crash with another coredump.

If we delete all contents from storage.path (systemd.0/ and systemd.db), the agent will start without problems, and run for a while before crashing again with a coredump.

It seems to me that the systemd chunk file gets corrupted for some reason, and when this happens, the agent crashes.

This happens with packages (4.0.13, 4.1.0, 4.1.1) from the almalinux repo at packages.fluentbit.io on multiple servers. But I have compiled 4.1.0 and 4.1.1 from source to activate FLB_DEBUG and I get the same problem.

To Reproduce

  • Journald logs related to the crash:
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: [2025/10/24 15:11:22] [engine] caught signal (SIGSEGV)
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #0  0x7f2e9f071608      in  ???() at ???:0
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #1  0x7f2ea0057b08      in  ???() at ???:0
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #2  0x7f2e9ffb4d32      in  ???() at ???:0
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #3  0x5c899e            in  ???() at ???:0
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #4  0x55488e            in  ???() at ???:0
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #5  0x577e2b            in  ???() at ???:0
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #6  0x7f2e9f2bbb67      in  ???() at ???:0
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #7  0x7f2e9f32c6bb      in  ???() at ???:0
Oct 24 15:11:22 hostname.domain fluent-bit[939013]: #8  0xffffffffffffffff  in  ???() at ???:0
Oct 24 15:11:22 hostname.domain systemd-coredump[946015]: Process 939013 (fluent-bit) of user 0 terminated abnormally with signal 6/ABRT, processing...
Oct 24 15:11:22 hostname.domain systemd[1]: Started [email protected] - Process Core Dump (PID 946015/UID 0).
Oct 24 15:11:22 hostname.domain systemd-coredump[946016]: Removed old coredump core.fluent-bit.0.dfd7beb07d594c77bef0090bd555891f.834839.1761112301000000.zst.
Oct 24 15:11:22 hostname.domain systemd-coredump[946016]: [🡕] Process 939013 (fluent-bit) of user 0 dumped core.
                                                        
                                                        Module libzstd.so.1 from rpm zstd-1.5.5-9.el10.x86_64
                                                        Module libpcre2-8.so.0 from rpm pcre2-10.44-1.el10.3.x86_64
                                                        Module libcrypt.so.2 from rpm libxcrypt-4.4.36-10.el10.x86_64
                                                        Module libselinux.so.1 from rpm libselinux-3.8-2.el10_0.x86_64
                                                        Module libsasl2.so.3 from rpm cyrus-sasl-2.1.28-27.el10.x86_64
                                                        Module libevent-2.1.so.7 from rpm libevent-2.1.12-16.el10.x86_64
                                                        Module libkeyutils.so.1 from rpm keyutils-1.6.3-5.el10.x86_64
                                                        Module libkrb5support.so.0 from rpm krb5-1.21.3-8.el10_0.x86_64
                                                        Module libcom_err.so.2 from rpm e2fsprogs-1.47.1-3.el10.x86_64
                                                        Module libk5crypto.so.3 from rpm krb5-1.21.3-8.el10_0.x86_64
                                                        Module libkrb5.so.3 from rpm krb5-1.21.3-8.el10_0.x86_64
                                                        Module libgssapi_krb5.so.2 from rpm krb5-1.21.3-8.el10_0.x86_64
                                                        Module libz.so.1 from rpm zlib-ng-2.2.3-1.el10.x86_64
                                                        Module libcap.so.2 from rpm libcap-2.69-7.el10.x86_64
                                                        Module libcrypto.so.3 from rpm openssl-3.2.2-16.el10_0.4.x86_64
                                                        Module libssl.so.3 from rpm openssl-3.2.2-16.el10_0.4.x86_64
                                                        Module libsystemd.so.0 from rpm systemd-257-9.el10_0.1.x86_64
                                                        Module libyaml-0.so.2 from rpm libyaml-0.2.5-16.el10.x86_64
                                                        Stack trace of thread 939016:
                                                        #0  0x00007f2e9f2bd9dc __pthread_kill_implementation (libc.so.6 + 0x969dc)
                                                        #1  0x00007f2e9f267a96 raise (libc.so.6 + 0x40a96)
                                                        #2  0x00007f2e9f24f8fa abort (libc.so.6 + 0x288fa)
                                                        #3  0x00000000004b1498 n/a (n/a + 0x0)
                                                        #4  0x313a35312034322f n/a (n/a + 0x0)
                                                        ELF object binary architecture: AMD x86-64
Oct 24 15:11:22 hostname.domain systemd[1]: [email protected]: Deactivated successfully.
Oct 24 15:11:22 hostname.domain systemd[1]: [email protected]: Consumed 268ms CPU time, 106.8M memory peak.
Oct 24 15:11:22 hostname.domain systemd[1]: fluent-bit.service: Main process exited, code=dumped, status=6/ABRT
Oct 24 15:11:22 hostname.domain systemd[1]: fluent-bit.service: Failed with result 'core-dump'.
Oct 24 15:11:22 hostname.domain systemd[1]: fluent-bit.service: Consumed 22.833s CPU time, 48.4M memory peak.
Oct 24 15:11:22 hostname.domain systemd[1]: fluent-bit.service: Scheduled restart job, restart counter is at 1.
  • Trace logs from fluent-bit, before, during, after the crash: fluent-bit.log.txt

  • I have been able to generate this backtrace from the coredump generated by the subsequent attempts to start the agent :


(gdb) backtrace 
#0  0x00007fd5df4bd9dc in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007fd5df467a96 in raise () from /lib64/libc.so.6
#2  0x00007fd5df44f8fa in abort () from /lib64/libc.so.6
#3  0x00000000004ae47d in flb_signal_handler (signal=11) at /root/rhel10-test/fluent-bit-4.1.1/src/fluent-bit.c:636
#4  <signal handler called>
#5  0x00000000011d0e84 in ZSTD_freeDDict (ddict=0x1) at /root/rhel10-test/fluent-bit-4.1.1/lib/zstd-1.5.7/lib/decompress/zstd_ddict.c:215
#6  0x00007fd5df9ad609 in ZSTD_freeDCtx () from /lib64/libzstd.so.1
#7  0x00007fd5e01e5b09 in journal_file_data_payload.isra () from /lib64/libsystemd.so.0
#8  0x00007fd5e0142d33 in sd_journal_enumerate_data () from /lib64/libsystemd.so.0
#9  0x00000000007466ea in in_systemd_collect (ins=0x390d8d60, config=0x390a7490, in_context=0x7fd5d0001090) at /root/rhel10-test/fluent-bit-4.1.1/plugins/in_systemd/systemd.c:387
#10 0x0000000000746b07 in in_systemd_collect_archive (ins=0x390d8d60, config=0x390a7490, in_context=0x7fd5d0001090) at /root/rhel10-test/fluent-bit-4.1.1/plugins/in_systemd/systemd.c:512
#11 0x0000000000504fcd in input_collector_fd (fd=39, ins=0x390d8d60) at /root/rhel10-test/fluent-bit-4.1.1/src/flb_input_thread.c:166
#12 0x0000000000505b0a in engine_handle_event (fd=39, mask=1, ins=0x390d8d60, config=0x390a7490) at /root/rhel10-test/fluent-bit-4.1.1/src/flb_input_thread.c:181
#13 input_thread (data=0x7fd5d801f4e0) at /root/rhel10-test/fluent-bit-4.1.1/src/flb_input_thread.c:420
#14 0x000000000057ca77 in step_callback (data=0x7fd5d8024b70) at /root/rhel10-test/fluent-bit-4.1.1/src/flb_worker.c:43
#15 0x00007fd5df4bbb68 in start_thread () from /lib64/libc.so.6
#16 0x00007fd5df52c6bc in clone3 () from /lib64/libc.so.6

  • Steps to reproduce the problem:

The first crash after the agent has been working without problems for a while is random, I have not been able to identify the reason. After the first crash (and when the chunk file probably gets corrupted) the crash will be reproducible if you used the attached chunk/db files under storage.path/systemd.0/ and storage.path/systemd.db

Expected behavior

The agent should not crash with a coredump.

Screenshots

Your Environment

  • Version used: fluent-bit 4.0.13, 4.1.0 and 4.1.1
  • Configuration:
[SERVICE]
    # Flush
    # =====
    # set an interval of seconds before to flush records to a destination
    flush        10

    # Daemon
    # ======
    # instruct Fluent Bit to run in foreground or background mode.
    daemon       Off

    # Log_file
    # ========
    # Absolute path for an optional log file. By default all logs are
    # redirected to the standard error interface (stderr).
    log_file  /var/log/fluent-bit/fluent-bit.log

    # Log_Level
    # =========
    # Set the verbosity level of the service, values can be:
    #
    # - error
    # - warning
    # - info
    # - debug
    # - trace
    #
    # by default 'info' is set, that means it includes 'error' and 'warning'.
    log_level    trace

    # Parsers File
    # ============
    # specify an optional 'Parsers' configuration file
    parsers_file parsers.conf

    # Plugins File
    # ============
    # specify an optional 'Plugins' configuration file to load external plugins.
    plugins_file plugins.conf

    # HTTP Server
    # ===========
    # Enable/Disable the built-in HTTP Server for metrics
    http_server  On
    http_listen  127.0.0.1
    http_port    2020

    # Storage
    # =======
    # Fluent Bit can use memory and filesystem buffering based mechanisms
    #
    # - https://docs.fluentbit.io/manual/administration/buffering-and-storage
    #
    # storage metrics
    # ---------------
    # publish storage pipeline metrics in '/api/v1/storage'. The metrics are
    # exported only if the 'http_server' option is enabled.
    #
    storage.metrics on

    # storage.path
    # ------------
    # absolute file system path to store filesystem data buffers (chunks).
    #
    storage.path /var/lib/fluent-bit/storage


    # storage.sync
    # ------------
    # configure the synchronization mode used to store the data into the
    # filesystem. It can take the values normal or full.
    #
    storage.sync normal

    # storage.checksum
    # ----------------
    # enable the data integrity check when writing and reading data from the
    # filesystem. The storage layer uses the CRC32 algorithm.
    #
    storage.checksum off

    # storage.backlog.mem_limit
    # -------------------------
    # if storage.path is set, Fluent Bit will look for data chunks that were
    # not delivered and are still in the storage layer, these are called
    # backlog data. This option configure a hint of maximum value of memory
    # to use when processing these records.
    #
    storage.backlog.mem_limit 100M

    # storage.max_chunks_up
    # ---------------------
    # If the input plugin has enabled filesystem storage type, this
    # property sets the maximum number of chunks that can be up in
    # memory. Use this setting to control memory usage when you enable
    # storage.type filesystem.
    #
    storage.max_chunks_up 128

    # storage.delete_irrecoverable_chunks
    # -----------------------------------
    # When enabled, irrecoverable chunks will be deleted during
    # runtime, and any other irrecoverable chunk located in the
    # configured storage path directory will be deleted when
    # Fluent-Bit starts. Accepted values: 'Off, 'On.
    #
    storage.delete_irrecoverable_chunks on

    # scheduler.base
    # ---------------
    # Set a base of exponential backoff in seconds. 
    scheduler.base 5

    # scheduler.cap
    # -------------
    # Set a maximum retry time in seconds.
    scheduler.cap 900
    
[INPUT]
    Name    systemd
    Tag     logs_5000_systemd
    db      /var/lib/fluent-bit/storage/systemd.db
    db.Sync   Normal

    Mem_Buf_Limit 100MB
    storage.type filesystem
    storage.pause_on_chunks_overlimit on

    Read_From_Tail On
    Lowercase On
    Threaded true

[FILTER]
    Name modify
    Match logs_5000_systemd

    Add dataops.data_processor dataops-logs-systemd

    Add event.module systemd
    Add event.provider systemd
    Add event.dataset systemd.journald

    Add data_stream.namespace prod
    Add data_stream.dataset systemd.journald

    Add service.name linux-systemd

[FILTER]
    Name nest
    Match *

    Operation nest
    Wildcard dataops.*
    Nest_under dataops
    Remove_prefix dataops.

[FILTER]
    Name nest
    Match *

    Operation nest
    Wildcard event.*
    Nest_under event
    Remove_prefix event.

[FILTER]
    Name nest
    Match *

    Operation nest
    Wildcard data_stream.*
    Nest_under data_stream
    Remove_prefix data_stream.

[FILTER]
    Name nest
    Match *

    Operation nest
    Wildcard service.*
    Nest_under service
    Remove_prefix service.
    
[FILTER]
    Name modify
    Match *
    Add agent.type fluent-bit

[FILTER]
    Name sysinfo
    Match *
    Fluentbit_version_key agent.version
    Os_name_key os.name
    Os_version_key os.version
    Kernel_version_key os.kernel
    Hostname_key host.name

[FILTER]
    Name nest
    Match *
    
    Operation nest
    Wildcard agent.*
    Nest_under agent
    Remove_prefix agent.

[FILTER]
    Name nest
    Match *
    
    Operation nest
    Wildcard os.*
    Nest_under os
    Remove_prefix os.

[FILTER]
    Name nest
    Match *

    Operation nest
    Wildcard host.*
    Wildcard os*
    Nest_under host
    Remove_prefix host.

[OUTPUT]
    Name   http
    Match  logs_5000_*
    
    Host   server-receiver.example.org
    Port   5000

    Format json
    Workers 1
    storage.total_limit_size  100M
    Retry_Limit no_limits
    
    tls On
    tls.verify On
    tls.ca_file /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
    tls.crt_file /path/my.crt
    tls.key_file /path/my.key

fluent-bit.conf.txt / fluent-bit-systemd.conf.txt

  • Server type and version: Linux 6.12.0-55.38.1.el10_0.x86_64 x86_64 GNU/Linux
  • Operating System and version: Red Hat Enterprise Linux release 10.0 (Coughlan)
  • Filters and plugins: systemd(input), modify, nest, sysinfo, http(output)

Additional context

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions