-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Db-sync Hang Causing Midnight Node Outages #1949
Comments
Not from the team, but just commenting on basics:
|
What do you mean it "refreshed"? Did Kubernetes restart it? If so, can you find out out why it did? Also, are you saving off all the logs? EDIT: Also worth noting, this is a preview node |
This looks like a midnight node message |
I'm also on the Midnight node team - I can answer the questions here:
This pod was running the In terms on config, we have the following environment vars set:
The screenshot is only really important for the timestamps - the screenshot was taken at ~12:00, so it shows that db-sync has not progressed since then
As pointed out by @sgillespie , the log is from the midnight-node. We're running as a partner-chain, that log message is from their code. It shows that it can't find the referenced state in the Cardano preview network ("main chain") - the source of the data is db-sync.
I'll double-check this and comment back here
The pods are refreshed at regular intervals - I'll have to double check the reasoning behind it, partly it's a chaos engineering strategy. |
What are the infra specs where the mentioned container/pod is running? db-sync-cardano-10-0 Service Deployment InformationPod Details
Image Information
Node
Volume
|
We have observed multiple incidents where db-sync hangs, causing Midnight node outages due to a failure in importing blocks. This issue has occurred intermittently and has been temporarily resolved with pod restarts, but a more robust solution is needed to prevent manual intervention.
Problem:
db-sync hanging results in the node failing to import blocks, with the following error:
sync: :broken_heart: Verification failed for block 0xd0c306e29f09841635ae13ade6f9dce33e6ed2b3b565eac826154e23a89d475a received from (12D3KooWQF1x9ffPo73DRK8XKPw1Ev9BnJhNQc6QBke1tLnssumX): "Main chain state d23e68ee90dcc4677b2f67152daf8e08ebb3cf9507b9a587120882c280ed0c05 referenced in imported block at slot 290165258 with timestamp 1740991548000 not found"
db-sync is 3 hours behind the Cardano tip when discovered. The Cardano node is synced and importing blocks, indicating that the issue is isolated to db-sync.
One db-sync pod entered this stuck state, requiring a manual restart to recover.
Another db-sync pod self-recovered without intervention, though logs suggest it may have undergone an automatic pod refresh.
We need a root cause analysis to determine why db-sync enters this state and a fix that eliminates the need for manual restarts.
Logs & Observations:
The last log message from db-sync showed a successful tip import before pausing indefinitely.
If db-sync stops receiving blocks, it does not necessarily throw an error, making detection and recovery more difficult.
The text was updated successfully, but these errors were encountered: