indexer errors using latest edge #5639

mzupan · 2025-01-20T20:21:58Z

I've been using the latest edge and starting to see errors on the indexers

starting to see a lot of

quickwit-indexer-4 quickwit 2025-01-20T19:26:37.105Z  WARN quickwit_ingest::ingest_v2::ingester: failed to persist records to ingester `quickwit-indexer-4`: write-ahead log memory buffer is full: capacity: 6.4 GB, usage: 6.4 GB, requested: 10.2 MB

which seem to trigger these

quickwit-indexer-4 quickwit 2025-01-20T19:26:49.270Z  WARN quickwit_ingest::ingest_v2::router: failed to persist records on ingester `quickwit-indexer-2`: too many requests
quickwit-indexer-4 quickwit 2025-01-20T19:26:49.303Z ERROR quickwit_ingest::ingest_v2::router: failed to persist records on ingester `quickwit-indexer-1`: too many requests
quickwit-indexer-4 quickwit 2025-01-20T19:26:49.738Z  WARN quickwit_ingest::ingest_v2::router: failed to persist records on ingester `quickwit-indexer-1`: too many requests
quickwit-indexer-4 quickwit 2025-01-20T19:26:49.794Z  WARN quickwit_ingest::ingest_v2::router: failed to persist records on ingester `quickwit-indexer-2`: too many requests

Also see these with indexes

quickwit-indexer-3 quickwit 2025-01-20T20:16:54.033Z  WARN quickwit_indexing::actors::merge_planner: Rebuilding the known split ids set ended up not halving its size. Please report. This is likely a bug, please report. known_split_ids_len_after=286 known_split_ids_len_before=355

and

quickwit-indexer-4 quickwit 2025-01-20T20:17:05.622Z ERROR quickwit_ingest::ingest_v2::fetch: ingester `quickwit-indexer-69` is unavailable: closing fetch stream client_id=indexer/quickwit-indexer-4/infra-logs:01JERD66B3SAQHD5BKJC51TFGZ/_ingest-source/01JJ2DXW528JNMBBPZMJ6CRPQY index_uid=infra-logs:01JERD66B3SAQHD5BKJC51TFGZ source_id=_ingest-source shard_id=01JJ2BG1XM4JZXHP6R2SZNWCT1

My configmap is

apiVersion: v1
data:
  node.yaml: |-
    version: 0.8
    listen_address: 0.0.0.0
    gossip_listen_port: 7282
    data_dir: /quickwit/qwdata
    default_index_root_uri: s3://iterable-ue1-observability-quickwit-prod/indexes
    storage:
      s3:
        region: us-east-1
    ingest_api:
      max_queue_disk_usage: 30GiB
      max_queue_memory_usage: 6GiB
    searcher:
      fast_field_cache_capacity: 2G
    metastore:
      postgres:
        max_num_connections: 50
kind: ConfigMap

I do have a bunch of indexes which are all basically the same with dynamic mapping

doc mapping

{
  "doc_mapping_uid": "00000000000000000000000000",
  "mode": "dynamic",
  "dynamic_mapping": {
    "indexed": true,
    "tokenizer": "raw",
    "record": "basic",
    "stored": true,
    "expand_dots": true,
    "fast": {
      "normalizer": "raw"
    }
  },
  "field_mappings": [
    {
      "name": "timestamp",
      "type": "datetime",
      "fast": true,
      "fast_precision": "seconds",
      "indexed": true,
      "input_formats": [
        "iso8601"
      ],
      "output_format": "unix_timestamp_secs",
      "stored": true
    },
    {
      "name": "message",
      "type": "text",
      "fast": false,
      "fieldnorms": false,
      "indexed": true,
      "record": "position",
      "stored": true,
      "tokenizer": "default"
    }
  ],
  "timestamp_field": "timestamp",
  "tag_fields": [],
  "max_num_partitions": 200,
  "index_field_presence": false,
  "store_document_size": false,
  "store_source": false,
  "tokenizers": []
}

index settings

{
  "commit_timeout_secs": 60,
  "docstore_compression_level": 8,
  "docstore_blocksize": 1000000,
  "split_num_docs_target": 10000000,
  "merge_policy": {
    "type": "limit_merge",
    "merge_factor": 10,
    "max_merge_factor": 12,
    "max_merge_ops": 3,
    "maturation_period": "2days"
  },
  "resources": {
    "heap_size": "2.0 GB"
  }
}

There are 6 indexes that get a decent amount of traffic. Around 761 MB/s and 203619 docs/s

Are things too overloaded?

On my disks.. nothing jumps out at me that the EBS disks are IO bound or max IOPs. I've tried giving more throughput and IOPs with no help.

I'm using

    Image:         quickwit/quickwit@sha256:c873b6f9aa7f5ee20628d0304f4ce7044d6f297844e4c841af0bdf792a2ad375

The text was updated successfully, but these errors were encountered:

rdettai · 2025-01-23T10:01:40Z

@mzupan if the WAL is full it likely means that indexing cannot keep up with your ingestion rate. I don't think you mentioned the resource spec of your indexers (except that there are 6 of them) , but 761MB/s is a pretty high throughput 😅 . As a rule of thumb, we usually estimate the indexing rate at around 7.5MB/s/core (see sizing docs).

guilload · 2025-01-23T21:39:24Z

Our WAL implementation is such that we only accept records if we have space in memory and on disk.

6GB / (761 / 6) = 46 seconds on average to fill up the WAL memory buffer of an indexer, but your commit timeout is 60 seconds. Increase the WAL memory buffer size (max_queue_memory_usage) to 8 GB or decrease your commit timeout to 45s.

Let us know how that goes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

indexer errors using latest edge #5639

indexer errors using latest edge #5639

mzupan commented Jan 20, 2025

rdettai commented Jan 23, 2025

guilload commented Jan 23, 2025

indexer errors using latest edge #5639

indexer errors using latest edge #5639

Comments

mzupan commented Jan 20, 2025

rdettai commented Jan 23, 2025

guilload commented Jan 23, 2025