Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vespa version file in var still points to an older version after a bunch of successful upgrades and suddenly causes a container recreation to fail with cannot upgrade more than 30 versions #33214

Open
nehajatav opened this issue Jan 29, 2025 · 1 comment
Assignees

Comments

@nehajatav
Copy link

Describe the bug
We did a bunch of upgrades to get from 8.406.26 to 8.466.24 on Jan 23rd on a two node cluster with one node running config and container and another node running the content node. Things have been running with smoothly with all logs only suggesting that configserver was restarted successfully a couple of times with 8.466.24 after weekend reboots on 25th. Today we did a redeployment to recreate the podman container without nuking the data folders. However, the config server refused to come up with error
Cannot upgrade from 8.406.26 to 8.466.24. ...... is too large (› 30 releases). Setting VESPA_SKIP_UPGRADE_CHECK-true will skip this check at your own risk, see https://vespa.al/releases.html#versions\n\tat com.yahoo.vespa.config.server....

To Reproduce
Not sure if this can be reproduced

Expected behavior
8.466.24 should be detected for previous version

Environment (please complete the following information):

  • RHEL8v8
  • Infrastructure: Podman
  • Versions 4.4.1

Vespa version
While upgrading from 8.406.26 to 8.466.24

Additional context
Upon checking /opt/vespa/var/db/vespa/config_server/server_db/vespa_version today we noticed it has 8.406.26. How is it possible that this happened today but last two reboots at-least were fine. On 27th we see cluster controller restarted a few times with oom, apart from that no other service restarted since last host reboot on 26th

@hmusum
Copy link
Member

hmusum commented Jan 30, 2025

The config server writes /opt/vespa/var/db/vespa/config_server/server_db/vespa_version and a node in ZooKeeper if it starts successfully (you can do vespa-zkcat /config/v2/vespa_version to see which version it stores in ZooKeeper). When starting the config server it compares the version running with what it finds in ZooKeeper, or the version in the file vespa_version if there is no data in ZooKeeper. Then it compares versions and stops if the difference is more than 30 versions.

It's hard to say what has happened here, but some guesses:

  • Config server was upgraded but not started successfully
  • You somehow switched container and config server nodes, so there was old state on the one that was now used as config server.

If you are able to reproduce this we can look into it, otherwise it's really hard to say what happened. We have never seen this before in our own servers, where we have upgraded hundreds of servers hundreds of times each without seeing this.

@hmusum hmusum self-assigned this Jan 30, 2025
@hmusum hmusum added this to Support Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

2 participants