Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Db-sync Hang Causing Midnight Node Outages #1949

Open
Fentonhaslam opened this issue Mar 4, 2025 · 5 comments
Open

Db-sync Hang Causing Midnight Node Outages #1949

Fentonhaslam opened this issue Mar 4, 2025 · 5 comments

Comments

@Fentonhaslam
Copy link

Fentonhaslam commented Mar 4, 2025

We have observed multiple incidents where db-sync hangs, causing Midnight node outages due to a failure in importing blocks. This issue has occurred intermittently and has been temporarily resolved with pod restarts, but a more robust solution is needed to prevent manual intervention.

Problem:
db-sync hanging results in the node failing to import blocks, with the following error:

sync: :broken_heart: Verification failed for block 0xd0c306e29f09841635ae13ade6f9dce33e6ed2b3b565eac826154e23a89d475a received from (12D3KooWQF1x9ffPo73DRK8XKPw1Ev9BnJhNQc6QBke1tLnssumX): "Main chain state d23e68ee90dcc4677b2f67152daf8e08ebb3cf9507b9a587120882c280ed0c05 referenced in imported block at slot 290165258 with timestamp 1740991548000 not found"

Image

db-sync is 3 hours behind the Cardano tip when discovered. The Cardano node is synced and importing blocks, indicating that the issue is isolated to db-sync.
One db-sync pod entered this stuck state, requiring a manual restart to recover.
Another db-sync pod self-recovered without intervention, though logs suggest it may have undergone an automatic pod refresh.
We need a root cause analysis to determine why db-sync enters this state and a fix that eliminates the need for manual restarts.

Logs & Observations:
The last log message from db-sync showed a successful tip import before pausing indefinitely.
If db-sync stops receiving blocks, it does not necessarily throw an error, making detection and recovery more difficult.

@rdlrt
Copy link
Contributor

rdlrt commented Mar 4, 2025

Not from the team, but just commenting on basics:

  • Version details for infrastructure you're running is missing, also the config options from dbsync (which has high impact on system requirements)
  • It's unclear what you're trying to show with screenshot, it doesnt (atleast on quick brief) highlight any issues
  • What is that log message you pasted from? it does not look like a typical dbsync formatted message.. What does dbsync logs show prior to / after mentioned hang?
  • What are the infra specs where the mentioned container/pod is running?
  • If pod was refreshed/restarted - it might especially point to infrastructure sizing

@sgillespie
Copy link
Contributor

sgillespie commented Mar 4, 2025

Another db-sync pod self-recovered without intervention, though logs suggest it may have undergone an automatic pod refresh.

What do you mean it "refreshed"? Did Kubernetes restart it? If so, can you find out out why it did? Also, are you saving off all the logs?

EDIT: Also worth noting, this is a preview node

@sgillespie
Copy link
Contributor

sgillespie commented Mar 4, 2025

  • What is that log message you pasted from? it does not look like a typical dbsync formatted message.. What does dbsync logs show prior to / after mentioned hang?

This looks like a midnight node message

@ozgb
Copy link

ozgb commented Mar 10, 2025

I'm also on the Midnight node team - I can answer the questions here:

Version details for infrastructure you're running is missing, also the config options from dbsync (which has high impact on system requirements)

This pod was running the ghcr.io/intersectmbo/cardano-db-sync:13.5.0.2 image - I checked the changelog to see if a fix was included in more recent versions - I didn't immediately spot any issue that might cause the hang, but I'll upgrade our images to the latest (13.6.0.4)

In terms on config, we have the following environment vars set:

│     Environment:                                                                                                                                                                                                                            │
│       NETWORK:                   preview                                                                                                                                                                                                    │
│       CARDANO_NODE_SOCKET_PATH:  /node-ipc/node.socket                                                                                                                                                                                      │
│       POSTGRES_DB:               cexplorer                                                                                                                                                                                                  │
│       POSTGRES_USER:             cardano                                                                                                                                                                                                    │
│       POSTGRES_PORT:             5432                                                                                                                                                                                                       │
│       POSTGRES_HOST:             psql-dbsync-cardano-01-primary                                                                                                                                                                             │
│       POSTGRES_DB:               cexplorer                                                                                                                                                                                                  │
│       POSTGRES_USER:             cardano                                                                                                                                                                                                    │
│       POSTGRES_PASSWORD:         <set to the key 'password' in secret 'psql-dbsync-cardano-01-pguser-cardano'>  Optional: false                                                                                                             │
│       POSTGRES_PORT:             5432                                                                                                                                                                                                       │

It's unclear what you're trying to show with screenshot, it doesnt (atleast on quick brief) highlight any issues

The screenshot is only really important for the timestamps - the screenshot was taken at ~12:00, so it shows that db-sync has not progressed since then

What is that log message you pasted from? it does not look like a typical dbsync formatted message.. What does dbsync logs show prior to / after mentioned hang?

As pointed out by @sgillespie , the log is from the midnight-node. We're running as a partner-chain, that log message is from their code. It shows that it can't find the referenced state in the Cardano preview network ("main chain") - the source of the data is db-sync.

What are the infra specs where the mentioned container/pod is running?

I'll double-check this and comment back here

If pod was refreshed/restarted - it might especially point to infrastructure sizing

The pods are refreshed at regular intervals - I'll have to double check the reasoning behind it, partly it's a chaos engineering strategy.

@Fentonhaslam
Copy link
Author

What are the infra specs where the mentioned container/pod is running?

db-sync-cardano-10-0 Service Deployment Information

Pod Details

  • Pod Name: db-sync-cardano-10-0
  • Namespace: testnet-02
  • Controller: StatefulSet/db-sync-cardano-10
  • Status: Running
  • Pod IP: 10.14.90.5
  • Node: ip-10-14-90-40.eu-west-1.compute.internal

Image Information

  • db-sync Container Image: ghcr.io/intersectmbo/cardano-db-sync:13.5.0.2

Node

  • CPU: 8 cores
  • Memory: 32.3 GiB total (31.3 GiB allocatable)
  • Ephemeral Storage: ~40 GiB total (~37.5 GiB allocatable)

Volume

  • Available Volume for Syncing Data: 80 GiB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants