-
Notifications
You must be signed in to change notification settings - Fork 23
Safety script for RPC nodes #37
base: master
Are you sure you want to change the base?
Conversation
safety.sh
Outdated
| elif [ $DIFFSLOT -gt $FSNAPSHOT ]; then | ||
| cd ~ | ||
| ./stop | ||
| ./fetch-snapshot.sh bv1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's your thinking behind the fetch snapshot case? The problem with doing this is that it will create holes in the local rocksdb ledger data. When fetching data over RPC, we only fallback to BigTable for slots that are older than the first local rocskdb slot. So holes in the local ledger will not fallback to BigTable.
I'm thinking the rm -rf ledger/ case is the only one we need
Co-authored-by: Michael Vines <[email protected]>
joedenis01
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about now @mvines?
Or do you think 1,500 is too aggressive?
mvines
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1500 seems like a good place to start at least
safety.sh
Outdated
| logfile=`date +"%d-%b-%Y"`_log.log | ||
| echo "================= $currenttime" >> $logfile | ||
| RMLEDGER=1500 | ||
| CLUSTERSLOT=$(solana slot -u http://10.142.0.4:8899) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please turn 10.142.0.4 into a constant, so it's more clear what host this is
| @@ -0,0 +1,18 @@ | |||
| #/bin/bash | |||
| currenttime=`date +"%d-%b-%Y %H:%M:%S"` | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a fan of adding set -e to the top of these kinds of scripts so they bail on failures (usually atleast, there's lots of caveats with set -e) instead of stumble along and start causing splash damage
safety.sh
Outdated
| ./stop | ||
| rm -rf ledger/ | ||
| ./restart | ||
| echo "Node was:" $DIFFSLOT " slots behind, Services has been stopped, ledger deleted and service restarted" >> $logfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we emit a metric here so we can get flagged in pager-duty and or on http://bit.ly/solana-cluster-sanity?
Emiting a metric is easy, start by sourcing this script:
Line 31 in 19a50f1
| source "$here"/configure-metrics.sh |
Co-authored-by: Michael Vines <[email protected]>
Co-authored-by: Michael Vines <[email protected]>
Co-authored-by: Michael Vines <[email protected]>
Co-authored-by: Michael Vines <[email protected]>
No description provided.