Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

indexer errors using latest edge #5639

Open
mzupan opened this issue Jan 20, 2025 · 2 comments
Open

indexer errors using latest edge #5639

mzupan opened this issue Jan 20, 2025 · 2 comments

Comments

@mzupan
Copy link

mzupan commented Jan 20, 2025

I've been using the latest edge and starting to see errors on the indexers

starting to see a lot of

quickwit-indexer-4 quickwit 2025-01-20T19:26:37.105Z  WARN quickwit_ingest::ingest_v2::ingester: failed to persist records to ingester `quickwit-indexer-4`: write-ahead log memory buffer is full: capacity: 6.4 GB, usage: 6.4 GB, requested: 10.2 MB

which seem to trigger these

quickwit-indexer-4 quickwit 2025-01-20T19:26:49.270Z  WARN quickwit_ingest::ingest_v2::router: failed to persist records on ingester `quickwit-indexer-2`: too many requests
quickwit-indexer-4 quickwit 2025-01-20T19:26:49.303Z ERROR quickwit_ingest::ingest_v2::router: failed to persist records on ingester `quickwit-indexer-1`: too many requests
quickwit-indexer-4 quickwit 2025-01-20T19:26:49.738Z  WARN quickwit_ingest::ingest_v2::router: failed to persist records on ingester `quickwit-indexer-1`: too many requests
quickwit-indexer-4 quickwit 2025-01-20T19:26:49.794Z  WARN quickwit_ingest::ingest_v2::router: failed to persist records on ingester `quickwit-indexer-2`: too many requests

Also see these with indexes

quickwit-indexer-3 quickwit 2025-01-20T20:16:54.033Z  WARN quickwit_indexing::actors::merge_planner: Rebuilding the known split ids set ended up not halving its size. Please report. This is likely a bug, please report. known_split_ids_len_after=286 known_split_ids_len_before=355

and

quickwit-indexer-4 quickwit 2025-01-20T20:17:05.622Z ERROR quickwit_ingest::ingest_v2::fetch: ingester `quickwit-indexer-69` is unavailable: closing fetch stream client_id=indexer/quickwit-indexer-4/infra-logs:01JERD66B3SAQHD5BKJC51TFGZ/_ingest-source/01JJ2DXW528JNMBBPZMJ6CRPQY index_uid=infra-logs:01JERD66B3SAQHD5BKJC51TFGZ source_id=_ingest-source shard_id=01JJ2BG1XM4JZXHP6R2SZNWCT1

My configmap is

apiVersion: v1
data:
  node.yaml: |-
    version: 0.8
    listen_address: 0.0.0.0
    gossip_listen_port: 7282
    data_dir: /quickwit/qwdata
    default_index_root_uri: s3://iterable-ue1-observability-quickwit-prod/indexes
    storage:
      s3:
        region: us-east-1
    ingest_api:
      max_queue_disk_usage: 30GiB
      max_queue_memory_usage: 6GiB
    searcher:
      fast_field_cache_capacity: 2G
    metastore:
      postgres:
        max_num_connections: 50
kind: ConfigMap

I do have a bunch of indexes which are all basically the same with dynamic mapping

doc mapping

{
  "doc_mapping_uid": "00000000000000000000000000",
  "mode": "dynamic",
  "dynamic_mapping": {
    "indexed": true,
    "tokenizer": "raw",
    "record": "basic",
    "stored": true,
    "expand_dots": true,
    "fast": {
      "normalizer": "raw"
    }
  },
  "field_mappings": [
    {
      "name": "timestamp",
      "type": "datetime",
      "fast": true,
      "fast_precision": "seconds",
      "indexed": true,
      "input_formats": [
        "iso8601"
      ],
      "output_format": "unix_timestamp_secs",
      "stored": true
    },
    {
      "name": "message",
      "type": "text",
      "fast": false,
      "fieldnorms": false,
      "indexed": true,
      "record": "position",
      "stored": true,
      "tokenizer": "default"
    }
  ],
  "timestamp_field": "timestamp",
  "tag_fields": [],
  "max_num_partitions": 200,
  "index_field_presence": false,
  "store_document_size": false,
  "store_source": false,
  "tokenizers": []
}

index settings

{
  "commit_timeout_secs": 60,
  "docstore_compression_level": 8,
  "docstore_blocksize": 1000000,
  "split_num_docs_target": 10000000,
  "merge_policy": {
    "type": "limit_merge",
    "merge_factor": 10,
    "max_merge_factor": 12,
    "max_merge_ops": 3,
    "maturation_period": "2days"
  },
  "resources": {
    "heap_size": "2.0 GB"
  }
}

There are 6 indexes that get a decent amount of traffic. Around 761 MB/s and 203619 docs/s

Are things too overloaded?

On my disks.. nothing jumps out at me that the EBS disks are IO bound or max IOPs. I've tried giving more throughput and IOPs with no help.

I'm using

    Image:         quickwit/quickwit@sha256:c873b6f9aa7f5ee20628d0304f4ce7044d6f297844e4c841af0bdf792a2ad375
@rdettai
Copy link
Collaborator

rdettai commented Jan 23, 2025

@mzupan if the WAL is full it likely means that indexing cannot keep up with your ingestion rate. I don't think you mentioned the resource spec of your indexers (except that there are 6 of them) , but 761MB/s is a pretty high throughput 😅 . As a rule of thumb, we usually estimate the indexing rate at around 7.5MB/s/core (see sizing docs).

@guilload
Copy link
Member

Our WAL implementation is such that we only accept records if we have space in memory and on disk.

6GB / (761 / 6) = 46 seconds on average to fill up the WAL memory buffer of an indexer, but your commit timeout is 60 seconds. Increase the WAL memory buffer size (max_queue_memory_usage) to 8 GB or decrease your commit timeout to 45s.

Let us know how that goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants