Skip to content

[BUG]: Attaching/detaching NVLink logical partitions to an instance takes too much time - and even stuck #379

Description

@prazumovsky

Describe the bug

Both actions of attaching and detaching NVLink interfaces to Carbide instance takes too much time to make interface's status "Ready":

I detached one partition from instance at :22

    "id": "2c149132-24cc-47df-82cc-78bd2d214d4b",                                                                                                                     
    "instanceId": "<id>",                                                                                                             
    "nvLinklogicalPartitionId": "a24a6d68-ae9b-4cfc-83af-69739e28288f",                                                                                               
    "nvLinkDomainId": "<id>",                                                                                                         
    "deviceInstance": 3,                                                                                                                                              
    "gpuGuid": "<guid>",                                                                                                                                
    "status": "Deleting",                                                                                                                                             
    "created": "2026-02-25T13:14:40.621245Z",                                                                                                                         
    "updated": "2026-02-25T13:22:10.044117Z"                                                                                                                          
  } 

and it is deleted after 4.5 minutes (and in my case replaced by new partition):

  {
    "id": "536302b7-a600-4efc-bd21-395c75c04c7e",
    "instanceId": "<id>",
    "nvLinklogicalPartitionId": "41aa70d0-faf5-4166-81be-b929142eeb78",
    "nvLinkDomainId": null,
    "deviceInstance": 3,
    "gpuGuid": null,
    "status": "Pending",
    "created": "2026-02-25T13:26:42.843422Z",
    "updated": "2026-02-25T13:26:42.843422Z"
  }

The same is for attaching NVLink interface - new partition attachment created at :26 min. Now it is :37 min (11 min since create) and it is still in Pending:

  {
    "id": "536302b7-a600-4efc-bd21-395c75c04c7e",
    "instanceId": "<id>",
    "nvLinklogicalPartitionId": "41aa70d0-faf5-4166-81be-b929142eeb78",
    "nvLinkDomainId": "<id>",
    "deviceInstance": 3,
    "gpuGuid": "<guid>",
    "status": "Pending",
    "created": "2026-02-25T13:26:42.843422Z",
    "updated": "2026-02-25T13:32:13.550321Z"
  }

However, Pending interfaces are updated since 4-5 minutes with correct GPU GUIDs but still stuck in Pending. Initially pending interfaces have null GPU GUID field's value.
Partition with that ID is Ready itself:

  {
    "id": "41aa70d0-faf5-4166-81be-b929142eeb78",
    "name": "kubevirt-node-02-pt-0-1-2-3",
    "description": "",
...
...
    "nvLinkLogicalPartitionStats": null,
    "status": "Ready",
    "statusHistory": [
      {
        "status": "Ready",
        "message": "NVLink Logical Partition is ready for use",
        "created": "2026-02-25T13:22:09.137991Z",
        "updated": "2026-02-25T13:22:09.137991Z"
      },
      {
        "status": "Pending",
        "message": "received NVLink Logical Partition creation request, pending",
        "created": "2026-02-25T13:21:23.205157Z",
        "updated": "2026-02-25T13:21:23.205157Z"
      }
    ],
    "created": "2026-02-25T13:21:23.191649Z",
    "updated": "2026-02-25T13:22:09.132546Z"
  }
]

Some of NVLink partitions attach to instances after 4+ minutes with no problems and others - stuck in Pending. Easy example: I have partition

    "id": "a24a6d68-ae9b-4cfc-83af-69739e28288f",                                                                                                                     
    "name": "kubevirt-node-02-default",

which was created 2 hours ago and it is attaching/detaching to/from the instance's all GPUs with no issue.

And now I'm creating another partition

    "id": "41aa70d0-faf5-4166-81be-b929142eeb78",
    "name": "kubevirt-node-02-pt-0-1-2-3",

which I'm attaching to the same instance's all GPUs (after detaching kubevirt-node-02-default partition from instance's GPUs) and they are stuck in "Pending".

30 minutes since interface attachment create - still in Pending:

  {
    "id": "536302b7-a600-4efc-bd21-395c75c04c7e",
    "instanceId": "<id>",
    "nvLinklogicalPartitionId": "41aa70d0-faf5-4166-81be-b929142eeb78",
    "nvLinkDomainId": "<id>",
    "deviceInstance": 3,
    "gpuGuid": "<guid>",
    "status": "Pending",
    "created": "2026-02-25T13:26:42.843422Z",
    "updated": "2026-02-25T13:32:13.550321Z"
  }
]
$ date
Wed Feb 25 06:06:07 AM PST 2026 (i.e. 14:06:07)

Interestingly, kubevirt-node-02-default attachments become Ready after 5 minutes but kubevirt-node-02-pt-0-1-2-3 for the same GPUs - stuck in Pending.

Steps/Code to reproduce bug

I repeat the following scenario about 5-6 times and had the same effect:

  1. before workload all gpus (deviceInstance 0-3) of an instance are attached to partition "kubevirt-node-02-default" - means I attach all deviceInstances to this partition. All attachments (nvlink interfaces) become Ready in 5-6 minutes since PATCH API call.
  2. I create new partition and always name it "kubevirt-node-02-pt-0-1-2-3". It becomes Ready in 4-5 minutes after POST API call.
  3. I'm deleting all attachments of "kubevirt-node-02-default" partition from an instance. As a result, attachments deleted and nvlink interfaces list of an instance is empty after 5-6 minutes since PATCH API call.
  4. I attach all gpus (deviceInstance 0-3) of an instance to the new partition "kubevirt-node-02-pt-0-1-2-3". All attachments stuck in Pending state for any time (the longest time I waited was ~40 minutes).
  5. I delete all attachments of "kubevirt-node-02-pt-0-1-2-3" from an instance. As a result, after 5-6 minutes all attachments are deleted and nvlink interfaces list becomes empty for an instance.

Expected behavior

Both partitions are capable of attaching/detaching and becoming "Ready". Time of moving to "Ready" or deleted state less than 4-5 minutes.

Metadata

Metadata

Assignees

Labels

bugA defect in existing software (deprecated - use issue type, but it's needed for reporting now)partitioningNetwork Partitioning

Type

No fields configured for Bug.

Projects

Status
Verify

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions