Skip to content

Commit

Permalink
Add kuryr-sriov spec proposal
Browse files Browse the repository at this point in the history
Targets bp: kuryr-kubernetes-sriov-support

Change-Id: If7463cdb050db1ffa4a80e4b2e825db6ad578a9f
  • Loading branch information
teferi committed Jun 8, 2017
1 parent 87039c6 commit 07f2029
Show file tree
Hide file tree
Showing 2 changed files with 292 additions and 0 deletions.
1 change: 1 addition & 0 deletions doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ Design Specs

specs/pike/contrail_support
specs/pike/fuxi_kubernetes
specs/pike/sriov

Indices and tables
==================
Expand Down
291 changes: 291 additions & 0 deletions doc/source/specs/pike/sriov.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,291 @@
..
Licensed under the Apache License, Version 2.0 (the "License"); you may
not use this file except in compliance with the License. You may obtain
a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations
under the License.

Convention for heading levels in Neutron devref:
======= Heading 0 (reserved for the title in a document)
------- Heading 1
~~~~~~~ Heading 2
+++++++ Heading 3
''''''' Heading 4
(Avoid deeper levels because they do not render well.)



Kuryr Kubernetes SR-IOV Integration
===================================

https://blueprints.launchpad.net/kuryr-kubernetes/+spec/kuryr-kubernetes-sriov-support

This spec proposes an approach to allow kuryr-kubernetes manage pods that
require SR-IOV interfaces.

Problem Description
-------------------

SR-IOV (Single-root input/output virtualization) is a technique that allows a
single physical PCIe device to be shared across several clients (VMs or
otherwise). Each such network card would have a single PF (physical function)
and multiple VFs (Virtual Functions), essentially appearing as multiple PCIe
devices. These VFs can be then passed-through to VMs bypassing the hypervisor
and virtual switch. This allows performance comparable to non-virtualized
environments. SR-IOV support is present in nova and neutron, see docs [#]_.

It is possible to implement a similar approach within Kubernetes. Since
Kubernetes uses separate network namespaces for Pods, it is possible to
implement pass-through, simply by assigning a VF device to the desired Pod's
namespace.

There are several challenges that this task poses:

* SR-IOV interfaces are limited and not every Pod would require them. This means
that a Pod should be able to request 0(zero) or more VFs. Since not all Pods
will require VFs, these interfaces should be optional.
* For SR-IOV support to be practical the Pods should be able to request multiple
VFs, possibly from multiple PFs. It's important to note
that Kubernetes only stores information about a single IP
address per Pod, however it does not restrict configuring additional network
interfaces and/or IP addresses for it.
* Different PFs may map to different neutron physical networks(physnets).
Pods need to be able to request VFs specific physnet and physnet information
(vlan id, specifically) should be passed to the CNI for configuration.
* Kubernetes does not have any knowledge about SR-IOV interfaces on the Node it
runs. This can be mitigated by utilising Opaque Integer Resources [#2d]_
feature from 1.5.x and later series.
* This feature would be limited to bare metal installations of Kubernetes,
since it's currently impossible to manage VFs of a PF inside a VM. (There is
work to allow this in newer kernels, but latest stable kernels do not support
it yet)


Proposed Change
---------------
Proposed solution consists of two major parts: add SR-IOV capabilities to VIF
handler of kuryr-kubernetes controller and enhance CNI to allow it
associate VFs to Pods.


Pod scheduling and resource management
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Since Kubernetes is the one who actually schedules the Pods on a Node we need a
way to tell it that a particular node is capable of handling a SR-IOV-enabled
Pods. There are several techniques in Kubernetes, that allow limiting where a
pod should be scheduled (i.e. Labels and NodeSelectors, Taints and Tolerations),
but only Opaque Integer Resources [#2d]_ (OIR) allows exact bookkeeping of VFs.
This spec proposes to use a predefined OIR pattern to track VFs on a node:::

pod.alpha.kubernetes.io/opaque-int-resource-sriov-vf-<PHYSNET_NAME>

For example to request VFs for ``physnet2`` it would be:::

pod.alpha.kubernetes.io/opaque-int-resource-sriov-vf-physnet2

It will be deployer's duty to set these resources, during node setup.
``kubectl`` does not support setting ORI as of yet, so it has to be done as a
PATCH request to Kubernetes API. For example to add 7 VFs from ``physnet2`` to
``k8s-node-1`` one would issue the following request::

curl --header "Content-Type: application/json-patch+json" \
--request PATCH \
--data '[{"op": "add", "path":
"/status/capacity/pod.alpha.kubernetes.io~1opaque-int-resource-sriov-vf-physnet2",
"value": "7"}]' \
http://k8s-master:8080/api/v1/nodes/k8s-node-1/status

For more information please refer to OIR docs. [#2d]_
This process may be automated, using Node Feature Discovery [#]_
or a similar service, however these details are out of the scope of this spec.

Here's how A Pod Spec might look like this:

.. code-block:: yaml
spec:
containers:
- name: vf-container
image: vf-image
resources:
requests:
pod.alpha.kubernetes.io/opaque-int-resource-sriov-vf-physnet2: 1
- name: vf-other-container
image: vf-other-image
resources:
requests:
pod.alpha.kubernetes.io/opaque-int-resource-sriov-vf-physnet2: 1
pod.alpha.kubernetes.io/opaque-int-resource-sriov-vf-physnet3: 1
These requests are per-container, and the total amount of VFs should be
totalled for the Pod, the same way Kubernetes does it.
The example above would require 2 VFs from ``physnet2``
and 1 from ``physnet3``.

An important note should be made about kubernetes Init Containers [#]_. If we
decide that it is important to support requests from Init Containers, they
would have to be treated differently. Init Containers are designed to run
sequentially, so we would need to scan them and get maximum request value
across all of them.

Requesting SR-IOV ports
~~~~~~~~~~~~~~~~~~~~~~~

To implement SR-IOV capabilities current VIF handler will be modified to handle
multiple VIFs.
As a prerequisite of this the following changes have to be implemented:

Multi-VIF capabilities of generic handler
+++++++++++++++++++++++++++++++++++++++++

Instead of storing a single VIF in the annotation VIFHandler would store
a dict, that maps desired interface name to a VIF object. As an alternative we
can store VIFs in a list, but dict would give finer control over interface
naming. Both handler and the CNI would have to be modified to understand
this new format of the annotation. The CNI may also be kept
backward-compatible, i.e. understand the old single-VIF format.

Even though this functionality is not a part of SR-IOV handling it acts as a
prerequisite and would be implemented as part of this spec.

SR-IOV capabilities of generic handler
++++++++++++++++++++++++++++++++++++++

The handler would read OIR requests of a
scheduled Pod and would see if the Pod has requested any SR-IOV VFs. (NOTE: at
this point the Pod should already be scheduled to a node, meaning there are
enough available VFs on that node). The handler would ask SR-IOV driver for
sufficient number of ``direct`` ports from neutron and pass them on
to the CNI via annotations. Network information should also include network's
VLAN info, to setup VF VLAN.

SR-IOV functionality requires additional knowledge of neutron subnets. The
controller needs to know a subnet where it would allocate direct ports for
certain physnet. This can be solved by adding a config setting that will map
physnets to a default neutron subnet
It might look like this:

.. code-block:: ini
default_physnet_subnets = "physnet2:e603a1cc-57e5-40fe-9af1-9fbb30905b10,physnet3:0919e15a-b619-440c-a07e-bb5a28c11a75"
Alternatively we can request this information from neutron. However since there
can be multiple networks within a single physnet and multiple subnets within a
single network there is a lot of space for ambiguity.
Finally we can combine the approaches: request info from neutron only if it's
not set in the config.

Kuryr-cni
~~~~~~~~~

On the CNI side we will implement a CNI binding driver for SR-IOV ports.
Since this work will be based on top of multi-vif support for both CNI and
controller, no additional format changes would be implemented.
The driver would configure the VF and pass it to the Pod's namespace.
It would scan ``/sys/class/net/<PF>/device`` directory for available
virtual functions and pass the acquired device to Pods namespace.

The driver would need to know which
devices map to which physnets. Therefore we would introduce a config
setting ``physical_device_mappings``. It will be identical to
neutron-sriov-nic-agent's setting. It might look like:

.. code-block:: ini
physical_device_mappings = "physnet2:enp1s0f0,physnet3:enp1s0f1"
As an alternative to storing this setting in ``kuryr.conf`` we may store it in
``/etc/cni/net.d/kuryr.conf`` file or in a kubernetes node annotation.


Caveats
~~~~~~~

* Current implementation does not concern itself with setting active status of
the Port on the neutron side. It is not required for the feature to function
properly, but may be undesired from operators standpoint. Doing so may
require some additional integration with neutron-sriov-nic-agent and
verification. There is a concern, that neutron-sriov-nic-agent does not
detect port status correctly all the times.

Optional 2-Phase Approach
~~~~~~~~~~~~~~~~~~~~~~~~~

Initial implementation followed an alternative path, where SR-IOV functionality
has been implemented as a separate handler/cni. This sparked several design
discussions, where community agreed, that multi-VIF handler is preferred over
multi-handler approach. However if implementing multi-vif handler would prove
to be lengthy and difficult we may go with a 2-phase approach. First phase:
polish and merge initial implementation. Second phase: Implement multi-vif
approach and convert sriov-handler to use it.

Alternatives
~~~~~~~~~~~~

* It is possible to implement SR-IOV functionality as a separate handler.
In this scenario both handlers would listen to Pod events and would handle
them separately. They would have to use different annotation keys inside the
Pod object. The CNI would have to be able to handle both annotation keys.
* Since this feature is only practical for bare metal we can implement it
entirely on the CNI side. (i.e. CNI would request ports from neutron).
However this would introduce an alternative control flow.
* It is also possible to implement a separate CNI, that would use static
configuration, compatible with neutrons, much like [#]_. This would eliminate
the need to talk to neutron at all, but would put the burden of configuring
multiple nodes network information on the deployer. This may be however
desirable for some installations and may be considered as an option. At the
same time in this scenario there would be little to no code shared
between this CNI and regular kuryr-kubernetes. In this case it feels like the
code will be more suited to a separate project, than kuryr-kubernetes.
* As an alternative we may implement a separate kuryr-sriov-cni, that would
only handle SR-IOV requests. This will allow a more granular approach and
would decouple SR-IOV functionality from the main code.
Implementing a kuryr-sriov-cni would mean, however, that operators would
need to pick one of the implementations (kuryr-cni vs kuryr-sriov-cni) or
use something like multus-cni [#]_ or CNI-Genie [#]_ to allow them
work together.


Assignee(s)
~~~~~~~~~~~

Primary assignee:
Zaitsev Kirill


Work Items
~~~~~~~~~~

* Implement Multi-VIF handler/CNI
* Implement SR-IOV capabilities
* Implement CNI SR-IOV handler
* Active state monitoring for kuryr-sriov direct ports
* Document deployment procedure for kuryr-sriov support

Possible Further Work
~~~~~~~~~~~~~~~~~~~~~

* It may be desirable to be able to request specific ports from
neutron subnet in the Pod Spec. This functionality may be extended to
normal VIFs, beyond SR-IOV handler.
* It may be desirable to add an option to assign network info to VFs
statically

References
----------

.. [#] https://docs.openstack.org/ocata/networking-guide/config-sriov.html
.. [#2d] https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#opaque-integer-resources-alpha-feature
.. [#] https://github.com/kubernetes-incubator/node-feature-discovery
.. [#] https://kubernetes.io/docs/concepts/workloads/pods/init-containers/
.. [#] https://github.com/hustcat/sriov-cni
.. [#] https://github.com/Intel-Corp/multus-cni
.. [#] https://github.com/Huawei-PaaS/CNI-Genie

0 comments on commit 07f2029

Please sign in to comment.