fix(eth): eth client timeout while watching #553

rvagg · 2025-07-01T03:47:39Z

Still failing in watch_eth.go, but I have a little bit more information with additional logging turned on.

Last output for watch_eth.go working on stuff is:

2025-06-30T13:11:00.927Z        INFO    curio/message   message/watch_eth.go:104        Processing pending transactions {"count": 12, "machineID": 1}
2025-06-30T13:11:00.927Z        DEBUG   curio/message   message/watch_eth.go:112        Checking transaction    {"txHash": "0x0ee2d4acfb636e314a0a544a84e60873aebaa39c607f500e5ac0b4d009953f6e"}
2025-06-30T13:11:00.928Z        DEBUG   curio/message   message/watch_eth.go:119        Transaction still pending       {"txHash": "0x0ee2d4acfb636e314a0a544a84e60873aebaa39c607f500e5ac0b4d009953f6e"}
2025-06-30T13:11:00.928Z        DEBUG   curio/message   message/watch_eth.go:112        Checking transaction    {"txHash": "0x6a04975f33972c2d7a0977ba88bae36c6123a292b5ba512dc086b09de2c0acc4"}

After that we just have this:

2025-07-01T01:54:30.072Z        DEBUG   curio/message   message/watch_eth.go:246        Head change received    {"newHeight": "2800283"}
2025-07-01T01:54:30.072Z        DEBUG   curio/message   message/watch_eth.go:251        Update channel full, skipping   {"height": "2800283"}
2025-07-01T01:55:00.080Z        DEBUG   curio/message   message/watch_eth.go:246        Head change received    {"newHeight": "2800284"}
2025-07-01T01:55:00.080Z        DEBUG   curio/message   message/watch_eth.go:251        Update channel full, skipping   {"height": "2800284"}

Which looks to me like a lock-up here since either return path should log something:

curio/tasks/message/watch_eth.go

Line 114 in 289b2db

receipt, err := mw.api.TransactionReceipt(ctx, txHash)

At the same time, Lotus shows us:

{"level":"error","ts":"2025-06-30T13:11:00.929Z","logger":"node/eth","caller":"eth/transaction.go:414","msg":"failed to lookup transaction hash 0x6a04975f33972c2d7a0977ba88bae36c6123a292b5ba512dc086b09de2c0acc4 in chain indexer: failed to read data from index: context canceled"}
{"level":"warn","ts":"2025-06-30T13:11:00.929Z","logger":"rpc","caller":"[email protected]/handler.go:421","msg":"error in RPC call to 'eth_getTransactionReceipt': failed to lookup transaction hash 0x6a04975f33972c2d7a0977ba88bae36c6123a292b5ba512dc086b09de2c0acc4 in chain indexer:\n    github.com/filecoin-project/lotus/node/impl/eth.(*ethTransaction).getCidForTransaction\n        /root/repos/lotus/node/impl/eth/transaction.go:415\n  - failed to read data from index:\n    github.com/filecoin-project/lotus/chain/index.(*SqliteIndexer).readWithHeadIndexWait\n        /root/repos/lotus/chain/index/read.go:88\n  - context canceled"}
{"level":"error","ts":"2025-06-30T13:11:00.929Z","logger":"rpc","caller":"[email protected]/websocket.go:145","msg":"handle me:write tcp4 127.0.0.1:1234->127.0.0.1:37802: use of closed network connection"}

That path suggests that we hit the moment in the Lotus ChainIndexer where it had a new chain head but it wasn't quite indexed, so it had to wait a little while to see it be indexed, which should be pretty quick but there's a 5 second timeout on the wait. That particular timeout isn't what's returned here though, it's a timeout from somewhere up the stack, I believe go-jsonrpc, maybe a deadline in the websocket handler, although I can't figure out exactly where. I think it might be the generic timeout that go-jsonrpc sets up for listening for responses which is 30 seconds, but that seems far too long to be relevant here. The timing says Curio asked Lotus at 13:11:00.928 for an eth_getTransactionReceipt and Lotus experienced a timeout error at 13:11:00.929!

The fact that we don't progress further than this might suggest that go-ethereum/ethclient doesn't know how to deal with the error that comes out of Lotus for this event, or maybe one of the websocket handlers (on either side) doesn't know. So my strategy here is to put a 1s timeout per transaction in the processing loop and move on if we don't get a response after that. Hopefully whatever part of the stack is pausing will respond to a context kick.

rvagg · 2025-07-02T06:10:27Z

Caught one with this PR:

2025-07-01T12:03:00.264Z        DEBUG   curio/message   message/watch_eth.go:119        Checking transaction    {"txHash": "0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83"}
2025-07-01T12:03:01.264Z        DEBUG   curio/message   message/watch_eth.go:134        Eth calls timed out - continuing with next tx   {"txHash": "0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83"}

And in Lotus:

{"level":"error","ts":"2025-07-01T12:03:00.264Z","logger":"node/eth","caller":"eth/transaction.go:414","msg":"failed to lookup transaction hash 0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83 in chain indexer: failed to read data from index: context canceled"}
{"level":"warn","ts":"2025-07-01T12:03:00.264Z","logger":"rpc","caller":"[email protected]/handler.go:421","msg":"error in RPC call to 'eth_getTransactionReceipt': failed to lookup transaction hash 0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83 in chain indexer:\n    github.com/filecoin-project/lotus/node/impl/eth.(*ethTransaction).getCidForTransaction\n        /root/repos/lotus/node/impl/eth/transaction.go:415\n  - failed to read data from index:\n    github.com/filecoin-project/lotus/chain/index.(*SqliteIndexer).readWithHeadIndexWait\n        /root/repos/lotus/chain/index/read.go:88\n  - context canceled"}

But, now that tx is in perpetual pending mode:

$ grep 0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83 /var/log/curio/daemon.log | grep 'still pending' | wc -l
25742

But it's on chain: https://calibration.filfox.info/en/message/0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83?t=1

Lotus just doesn't know about it:

curl http://localhost:1234/rpc/v1 -X POST -H 'Content-Type: application/json' -d '{"jsonrpc":"2.0","id":0,"method":"eth_getTransactionByHash","params":["0x4dbdfd2bada73c098fc1b46ecb435cf8dd1fb31270b0bd2f844111c08dc94a83"]}'
{"id":0,"jsonrpc":"2.0","result":null}

But on my Calibnet node:

$ curl http://localhost:1235/rpc/v1 -X POST -H 'Content-Type: application/json' -d
 '{"jsonrpc":"2.0","id":0,"method":"eth_getTransactionByHash","params":["0x3005f4a91b26648cf54d94933d147798885aa38cce2dc6cdebb66f53693cc75d"]}' | jq .result.blockHash
"0xae6133f78a2d8d47814e46e9139807f2dca1e77ec66e309bd83513bf916d9136"

So this looks kind of like a ChainIndexer bug as well.

rvagg · 2025-07-02T06:31:38Z

Oh, actually, that tx is from 2025-06-21 17:16:00, which is why it's not in the ChainIndexer or in the blockstore, it's too old. So this is back to the original problem, hitting Lotus, getting that timeout error and then locking up. Our timeout in this PR is just kicking it into action again. This particular transaction is a re-processing of one that was missed previously on this Curio node while it was locked up and splitstore removed it before it could be observed and processed, now it's in an endless loop looking for it and it'll never find it unfortunately. And I don't have a calibnet node I can point it to that has deeper history.

So, a couple of problems here:

Something to do with the Eth transaction lookup; it could be ChainIndexer related or maybe it's something to do with go-jsonrpc, or something else, but go-ethereum/ethclient really doesn't like it and it locks up and we shouldn't be triggering that behaviour on Lotus.
Dealing with the lock-up here. Ideally ethclient wouldn't lock up and would be more robust, but it's not apparently so we have to deal with it. So this PR does that.

I'll observe this a little longer and then tidy it up for potential merge.

rvagg added 3 commits June 27, 2025 17:05

fix(chain): mutex around callbacks modification

01365ce

fix(chain): make chain scheduler more robust and add more logging

0493cb2

fix(eth): eth client timeout while watching

35383cd

rvagg changed the title ~~Rvagg/eth client timeout and log~~ fix(eth): eth client timeout while watching Jul 1, 2025

rjan90 mentioned this pull request Jul 2, 2025

Investigate ETH RPC implementation incompatibility causing go-ethereum/ethclient lockup filecoin-project/lotus#13195

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(eth): eth client timeout while watching #553

fix(eth): eth client timeout while watching #553

rvagg commented Jul 1, 2025

Uh oh!

rvagg commented Jul 2, 2025

Uh oh!

rvagg commented Jul 2, 2025

Uh oh!

Uh oh!

fix(eth): eth client timeout while watching #553

Are you sure you want to change the base?

fix(eth): eth client timeout while watching #553

Conversation

rvagg commented Jul 1, 2025

Uh oh!

rvagg commented Jul 2, 2025

Uh oh!

rvagg commented Jul 2, 2025

Uh oh!

Uh oh!