Description
We have observed multiple incidents where db-sync hangs, causing Midnight node outages due to a failure in importing blocks. This issue has occurred intermittently and has been temporarily resolved with pod restarts, but a more robust solution is needed to prevent manual intervention.
Problem:
db-sync hanging results in the node failing to import blocks, with the following error:
sync: :broken_heart: Verification failed for block 0xd0c306e29f09841635ae13ade6f9dce33e6ed2b3b565eac826154e23a89d475a received from (12D3KooWQF1x9ffPo73DRK8XKPw1Ev9BnJhNQc6QBke1tLnssumX): "Main chain state d23e68ee90dcc4677b2f67152daf8e08ebb3cf9507b9a587120882c280ed0c05 referenced in imported block at slot 290165258 with timestamp 1740991548000 not found"
db-sync is 3 hours behind the Cardano tip when discovered. The Cardano node is synced and importing blocks, indicating that the issue is isolated to db-sync.
One db-sync pod entered this stuck state, requiring a manual restart to recover.
Another db-sync pod self-recovered without intervention, though logs suggest it may have undergone an automatic pod refresh.
We need a root cause analysis to determine why db-sync enters this state and a fix that eliminates the need for manual restarts.
Logs & Observations:
The last log message from db-sync showed a successful tip import before pausing indefinitely.
If db-sync stops receiving blocks, it does not necessarily throw an error, making detection and recovery more difficult.