Skip to content

fix(eth): eth client timeout while watching #553

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: synapse
Choose a base branch
from

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Jul 1, 2025

Still failing in watch_eth.go, but I have a little bit more information with additional logging turned on.

Last output for watch_eth.go working on stuff is:

2025-06-30T13:11:00.927Z        INFO    curio/message   message/watch_eth.go:104        Processing pending transactions {"count": 12, "machineID": 1}
2025-06-30T13:11:00.927Z        DEBUG   curio/message   message/watch_eth.go:112        Checking transaction    {"txHash": "0x0ee2d4acfb636e314a0a544a84e60873aebaa39c607f500e5ac0b4d009953f6e"}
2025-06-30T13:11:00.928Z        DEBUG   curio/message   message/watch_eth.go:119        Transaction still pending       {"txHash": "0x0ee2d4acfb636e314a0a544a84e60873aebaa39c607f500e5ac0b4d009953f6e"}
2025-06-30T13:11:00.928Z        DEBUG   curio/message   message/watch_eth.go:112        Checking transaction    {"txHash": "0x6a04975f33972c2d7a0977ba88bae36c6123a292b5ba512dc086b09de2c0acc4"}

After that we just have this:

2025-07-01T01:54:30.072Z        DEBUG   curio/message   message/watch_eth.go:246        Head change received    {"newHeight": "2800283"}
2025-07-01T01:54:30.072Z        DEBUG   curio/message   message/watch_eth.go:251        Update channel full, skipping   {"height": "2800283"}
2025-07-01T01:55:00.080Z        DEBUG   curio/message   message/watch_eth.go:246        Head change received    {"newHeight": "2800284"}
2025-07-01T01:55:00.080Z        DEBUG   curio/message   message/watch_eth.go:251        Update channel full, skipping   {"height": "2800284"}

Which looks to me like a lock-up here since either return path should log something:

receipt, err := mw.api.TransactionReceipt(ctx, txHash)

At the same time, Lotus shows us:

{"level":"error","ts":"2025-06-30T13:11:00.929Z","logger":"node/eth","caller":"eth/transaction.go:414","msg":"failed to lookup transaction hash 0x6a04975f33972c2d7a0977ba88bae36c6123a292b5ba512dc086b09de2c0acc4 in chain indexer: failed to read data from index: context canceled"}
{"level":"warn","ts":"2025-06-30T13:11:00.929Z","logger":"rpc","caller":"[email protected]/handler.go:421","msg":"error in RPC call to 'eth_getTransactionReceipt': failed to lookup transaction hash 0x6a04975f33972c2d7a0977ba88bae36c6123a292b5ba512dc086b09de2c0acc4 in chain indexer:\n    github.com/filecoin-project/lotus/node/impl/eth.(*ethTransaction).getCidForTransaction\n        /root/repos/lotus/node/impl/eth/transaction.go:415\n  - failed to read data from index:\n    github.com/filecoin-project/lotus/chain/index.(*SqliteIndexer).readWithHeadIndexWait\n        /root/repos/lotus/chain/index/read.go:88\n  - context canceled"}
{"level":"error","ts":"2025-06-30T13:11:00.929Z","logger":"rpc","caller":"[email protected]/websocket.go:145","msg":"handle me:write tcp4 127.0.0.1:1234->127.0.0.1:37802: use of closed network connection"}

That path suggests that we hit the moment in the Lotus ChainIndexer where it had a new chain head but it wasn't quite indexed, so it had to wait a little while to see it be indexed, which should be pretty quick but there's a 5 second timeout on the wait. That particular timeout isn't what's returned here though, it's a timeout from somewhere up the stack, I believe go-jsonrpc, maybe a deadline in the websocket handler, although I can't figure out exactly where. I think it might be the generic timeout that go-jsonrpc sets up for listening for responses which is 30 seconds, but that seems far too long to be relevant here. The timing says Curio asked Lotus at 13:11:00.928 for an eth_getTransactionReceipt and Lotus experienced a timeout error at 13:11:00.929!

The fact that we don't progress further than this might suggest that go-ethereum/ethclient doesn't know how to deal with the error that comes out of Lotus for this event, or maybe one of the websocket handlers (on either side) doesn't know. So my strategy here is to put a 1s timeout per transaction in the processing loop and move on if we don't get a response after that. Hopefully whatever part of the stack is pausing will respond to a context kick.

@rvagg rvagg changed the title Rvagg/eth client timeout and log fix(eth): eth client timeout while watching Jul 1, 2025
@rvagg
Copy link
Member Author

rvagg commented Jul 2, 2025

Caught one with this PR:

2025-07-01T12:03:00.264Z        DEBUG   curio/message   message/watch_eth.go:119        Checking transaction    {"txHash": "0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83"}
2025-07-01T12:03:01.264Z        DEBUG   curio/message   message/watch_eth.go:134        Eth calls timed out - continuing with next tx   {"txHash": "0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83"}

And in Lotus:

{"level":"error","ts":"2025-07-01T12:03:00.264Z","logger":"node/eth","caller":"eth/transaction.go:414","msg":"failed to lookup transaction hash 0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83 in chain indexer: failed to read data from index: context canceled"}
{"level":"warn","ts":"2025-07-01T12:03:00.264Z","logger":"rpc","caller":"[email protected]/handler.go:421","msg":"error in RPC call to 'eth_getTransactionReceipt': failed to lookup transaction hash 0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83 in chain indexer:\n    github.com/filecoin-project/lotus/node/impl/eth.(*ethTransaction).getCidForTransaction\n        /root/repos/lotus/node/impl/eth/transaction.go:415\n  - failed to read data from index:\n    github.com/filecoin-project/lotus/chain/index.(*SqliteIndexer).readWithHeadIndexWait\n        /root/repos/lotus/chain/index/read.go:88\n  - context canceled"}

But, now that tx is in perpetual pending mode:

$ grep 0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83 /var/log/curio/daemon.log | grep 'still pending' | wc -l
25742

But it's on chain: https://calibration.filfox.info/en/message/0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83?t=1

Lotus just doesn't know about it:

curl http://localhost:1234/rpc/v1 -X POST -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","id":0,"method":"eth_getTransactionByHash","params":["0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83"]}'
{"id":0,"jsonrpc":"2.0","result":null}

But on my Calibnet node:

$ curl http://localhost:1235/rpc/v1 -X POST -H 'Content-Type: application/json' -d
 '{"jsonrpc":"2.0","id":0,"method":"eth_getTransactionByHash","params":["0x3005f4a91b26648cf54d94933d147798885aa38cce2dc6cdebb66f53693cc75d"]}' | jq .result.blockHash
"0xae6133f78a2d8d47814e46e9139807f2dca1e77ec66e309bd83513bf916d9136"

So this looks kind of like a ChainIndexer bug as well.

@rvagg
Copy link
Member Author

rvagg commented Jul 2, 2025

Oh, actually, that tx is from 2025-06-21 17:16:00, which is why it's not in the ChainIndexer or in the blockstore, it's too old. So this is back to the original problem, hitting Lotus, getting that timeout error and then locking up. Our timeout in this PR is just kicking it into action again. This particular transaction is a re-processing of one that was missed previously on this Curio node while it was locked up and splitstore removed it before it could be observed and processed, now it's in an endless loop looking for it and it'll never find it unfortunately. And I don't have a calibnet node I can point it to that has deeper history.

So, a couple of problems here:

  1. Something to do with the Eth transaction lookup; it could be ChainIndexer related or maybe it's something to do with go-jsonrpc, or something else, but go-ethereum/ethclient really doesn't like it and it locks up and we shouldn't be triggering that behaviour on Lotus.
  2. Dealing with the lock-up here. Ideally ethclient wouldn't lock up and would be more robust, but it's not apparently so we have to deal with it. So this PR does that.

I'll observe this a little longer and then tidy it up for potential merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant