Skip to content

[Bug] Lease stuck / lease_closed_invalid + GPU shown as unavailable (available.gpu=0) after workload cleanup Summary #430

@Zluaclub

Description

@Zluaclub

[Bug] Lease stuck / lease_closed_invalid + GPU shown as unavailable (available.gpu=0) after workload cleanup
Summary
My provider shows Active Leases: 1 in Console, but GPU availability is stuck at 0 (available.gpu=0) even after cleaning up the Kubernetes namespace / workloads.
On-chain lease query returns state: active but reason: lease_closed_invalid.

This prevents provider audit testing and consumes time/funds.

Provider details
Provider address: akash1cvcfyadrm6gcghzspyhacw924c90lwnjkhcglm
host_uri: https://provider.proxyua.top:8443
Region attribute: location-region=eu-west
GPU: NVIDIA RTX4090 24Gi (pcie) (node labels present)
Bidding mode during reproduction: both tested; currently can set AKASH_BID_ENABLED=false for safety.
Provider attributes (from provider-services query provider get):

host=akash
tier=community
organization=proxyua
capabilities/gpu/vendor/nvidia/model/rtx4090=true
capabilities/gpu/vendor/nvidia/model/rtx4090/ram/24Gi=true
capabilities/gpu/vendor/nvidia/model/rtx4090/ram/24Gi/interface/pcie=true
feat-persistent-storage=true (hdd), plus beta3 on NVMe
Lease details (problematic lease)
Output (trimmed) from:
provider-services query market lease list --provider --state active -o json

owner: akash1pzh38kysyrmtfg00v36hxedvy4jla3tc3hah9v
dseq: 25500454
gseq: 1
oseq: 1
provider: akash1cvcfyadrm6gcghzspyhacw924c90lwnjkhcglm
state: active
reason: lease_closed_invalid
escrow_payment.state.state: open
escrow_payment.state.withdrawn.amount: 613081 uakt
Observed cluster inventory mismatch
Provider logs contain:
cluster resources dump shows allocatable.gpu: 1 but available.gpu: 0.

Example:

However, checking Kubernetes workloads shows no obvious GPU pods:

kubectl get pods -A -o wide | grep -i gpu => empty
kubectl get pods -A -o json | jq ... (gpu requests/limits >0) => empty (after cleanup)
Yet provider inventory still reports GPU unavailable.

What I tried
Deleted the akash workload namespace (generated namespace) with force:
kubectl delete ns --force --grace-period=0
Restarted provider statefulset:
kubectl -n akash-services rollout restart sts akash-provider
Verified AKASH_BID_ENABLED toggling, and ran inventory/lease queries from inside provider pod.
Expected behavior
If a lease is invalid (lease_closed_invalid), it should transition to a closed state or be recoverable with a documented procedure.
Provider inventory should reflect actual cluster GPU availability after workloads are removed.
Console should not show a lease as Active if it is invalid / not runnable.
Questions to core team
What is the exact semantic of lease_closed_invalid in akashnet-2? Which module emits it and under which conditions?
Is there a timeout/auto-close mechanism for invalid leases? If yes, what are the parameters and expected timeframe?
Is there a known inventory-service bug where available.gpu stays at 0 after cleanup? Any recommended remediation (cache reset, provider restart sequence, inventory operator restart, etc.)?
What is the correct way for a provider operator to “force release” stuck resources when the tenant does not close the deployment?
Environment / versions
Kubernetes: k3s (1.29.x)
provider pod: ghcr.io/akash-network/provider:0.10.5
provider-services: v0.10.5 (available inside provider container)
Node labels show RTX4090 present and nvidia-device-plugin installed.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions