You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Vespa version file in var still points to an older version after a bunch of successful upgrades and suddenly causes a container recreation to fail with cannot upgrade more than 30 versions
#33214
Open
nehajatav opened this issue
Jan 29, 2025
· 1 comment
Describe the bug
We did a bunch of upgrades to get from 8.406.26 to 8.466.24 on Jan 23rd on a two node cluster with one node running config and container and another node running the content node. Things have been running with smoothly with all logs only suggesting that configserver was restarted successfully a couple of times with 8.466.24 after weekend reboots on 25th. Today we did a redeployment to recreate the podman container without nuking the data folders. However, the config server refused to come up with error Cannot upgrade from 8.406.26 to 8.466.24. ...... is too large (› 30 releases). Setting VESPA_SKIP_UPGRADE_CHECK-true will skip this check at your own risk, see https://vespa.al/releases.html#versions\n\tat com.yahoo.vespa.config.server....
To Reproduce
Not sure if this can be reproduced
Expected behavior
8.466.24 should be detected for previous version
Environment (please complete the following information):
RHEL8v8
Infrastructure: Podman
Versions 4.4.1
Vespa version
While upgrading from 8.406.26 to 8.466.24
Additional context
Upon checking /opt/vespa/var/db/vespa/config_server/server_db/vespa_version today we noticed it has 8.406.26. How is it possible that this happened today but last two reboots at-least were fine. On 27th we see cluster controller restarted a few times with oom, apart from that no other service restarted since last host reboot on 26th
The text was updated successfully, but these errors were encountered:
The config server writes /opt/vespa/var/db/vespa/config_server/server_db/vespa_version and a node in ZooKeeper if it starts successfully (you can do vespa-zkcat /config/v2/vespa_version to see which version it stores in ZooKeeper). When starting the config server it compares the version running with what it finds in ZooKeeper, or the version in the file vespa_version if there is no data in ZooKeeper. Then it compares versions and stops if the difference is more than 30 versions.
It's hard to say what has happened here, but some guesses:
Config server was upgraded but not started successfully
You somehow switched container and config server nodes, so there was old state on the one that was now used as config server.
If you are able to reproduce this we can look into it, otherwise it's really hard to say what happened. We have never seen this before in our own servers, where we have upgraded hundreds of servers hundreds of times each without seeing this.
Describe the bug
We did a bunch of upgrades to get from 8.406.26 to 8.466.24 on Jan 23rd on a two node cluster with one node running config and container and another node running the content node. Things have been running with smoothly with all logs only suggesting that configserver was restarted successfully a couple of times with 8.466.24 after weekend reboots on 25th. Today we did a redeployment to recreate the podman container without nuking the data folders. However, the config server refused to come up with error
Cannot upgrade from 8.406.26 to 8.466.24. ...... is too large (› 30 releases). Setting VESPA_SKIP_UPGRADE_CHECK-true will skip this check at your own risk, see https://vespa.al/releases.html#versions\n\tat com.yahoo.vespa.config.server....
To Reproduce
Not sure if this can be reproduced
Expected behavior
8.466.24 should be detected for previous version
Environment (please complete the following information):
Vespa version
While upgrading from 8.406.26 to 8.466.24
Additional context
Upon checking /opt/vespa/var/db/vespa/config_server/server_db/vespa_version today we noticed it has 8.406.26. How is it possible that this happened today but last two reboots at-least were fine. On 27th we see cluster controller restarted a few times with oom, apart from that no other service restarted since last host reboot on 26th
The text was updated successfully, but these errors were encountered: