Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Istio Fork for Fixing Multicluster Registry Bug #1

Draft
wants to merge 7 commits into
base: release-1.0
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions FAIR.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Why this Fork
In a mutual TLS-disabled multicluster Kubernetes setup, liveness and readiness probes don't work in remote clusters. However, this behavior could not be replicated on the Kubernetes cluster running the Istio control plane.

We verified that the remote proxy sidecar containers had connectivity to Pilot Discovery. However, we noticed when comparing the logs in the sidecar containers on a remote cluster vs the local cluster that the local cluster sidecars were properly registering liveness and readiness probes, and the remote clusters weren't.

```bash
# GOOD - On the "local" control plane - we expect to see a listener registered for 8080
[2019-03-06 21:44:39.033][15][info][upstream] external/envoy/source/server/lds_api.cc:80] lds: add/update listener '10.21.3.242_8080'<Paste>

# BAD - On the "remote" cluster - where are 3333 and 9999 coming from???
[2019-03-06 22:00:16.701][10][info][upstream] external/envoy/source/common/upstream/cluster_manager_impl.cc:494] add/update cluster inbound|3333||mgmtCluster during init
[2019-03-06 22:00:16.702][10][info][upstream] external/envoy/source/common/upstream/cluster_manager_impl.cc:494] add/update cluster inbound|9999||mgmtCluster during init
```

After diving into the Istio source, we only found that 3333 and 9999 were ports returned from the [Envoy debug registry](https://github.com/istio/istio/blob/1.0.2/pilot/pkg/proxy/envoy/v2/debug.go#L324-L337).

In the Istio logs, we could see that a remote cluster registry was being picked up. However, based off of the [server bootup order](https://github.com/istio/istio/blob/1.0.2/pilot/pkg/bootstrap/server.go#L211-L219) (note that Discovery service is always init'd before Multicluster), it looks like Discovery has various registry providers that an [aggregate controller iterates through](https://github.com/istio/istio/blob/1.0.2/pilot/pkg/serviceregistry/aggregate/controller.go#L172-L192) to identify management and health check information from. Since the debug registry is always loaded before the multicluster K8 registries, and given that the debug registry will always return entries, it meant that the aggregate controller would get a hit on the debug ports and exit the loop. This means that remote cluster ports will never be returned properly, and liveness and readiness probes for the remote cluster will never work..

We added a lot more logs into this fork and deployed it in a test cluster to confirm our suspicious.

We then added a feature flag to disable registering the debug registry to see if it would fix our liveness probes. Sure enough, they do.

## Building and Running

```bash
make build docker.pilot # Builds the Docker container at istio/pilot:<sha>
```

When deploying and enabling the debug registry is desired behavior, set the following ENV var:
```bash
ENABLE_DEBUG_REGISTRY=true
```
3 changes: 2 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -72,7 +72,8 @@ else
# export GOOS_LOCAL ?= windows
endif

export GOOS ?= $(GOOS_LOCAL)
# export GOOS ?= $(GOOS_LOCAL)
export GOOS ?= linux
#-----------------------------------------------------------------------------
# Output control
#-----------------------------------------------------------------------------
Expand Down
20 changes: 13 additions & 7 deletions pilot/pkg/proxy/envoy/v2/debug.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import (
"io"
"io/ioutil"
"net/http"
"os"
"sync"

"github.com/gogo/protobuf/jsonpb"
Expand Down Expand Up @@ -48,13 +49,18 @@ func (s *DiscoveryServer) InitDebug(mux *http.ServeMux, sctl *aggregate.Controll
map[model.Hostname]*model.Service{ // mock.HelloService.Hostname: mock.HelloService,
}, 2)

sctl.AddRegistry(aggregate.Registry{
ClusterID: "v2-debug",
Name: serviceregistry.ServiceRegistry("memAdapter"),
ServiceDiscovery: s.MemRegistry,
ServiceAccounts: s.MemRegistry,
Controller: s.MemRegistry.controller,
})
// Feature flagging the debug registry because this is always going to hit before the remote registry.
// This is causing healthchecks to fail in multicluster.
debug, ok := os.LookupEnv("ENABLE_DEBUG_REGISTRY")
if ok && debug == "true" {
sctl.AddRegistry(aggregate.Registry{
ClusterID: "v2-debug",
Name: serviceregistry.ServiceRegistry("memAdapter"),
ServiceDiscovery: s.MemRegistry,
ServiceAccounts: s.MemRegistry,
Controller: s.MemRegistry.controller,
})
}

mux.HandleFunc("/ready", s.ready)

Expand Down
4 changes: 4 additions & 0 deletions pilot/pkg/serviceregistry/aggregate/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -173,21 +173,25 @@ func (c *Controller) GetService(hostname model.Hostname) (*model.Service, error)
// Return on the first hit.
func (c *Controller) ManagementPorts(addr string) model.PortList {
for _, r := range c.GetRegistries() {
log.Infof("Aggregate - trying to get management ports for pod IP %s cluster: %s, registry type: %s", addr, r.ClusterID, string(r.Name))
if portList := r.ManagementPorts(addr); portList != nil {
return portList
}
}
log.Infof("Aggregate - didn't find any management port info for %s", addr)
return nil
}

// WorkloadHealthCheckInfo returne the health check information for IP addr
// Return on the first hit.
func (c *Controller) WorkloadHealthCheckInfo(addr string) model.ProbeList {
for _, r := range c.GetRegistries() {
log.Infof("Aggregate - trying to get workload health for pod IP %s cluster: %s, registry type: %s", addr, r.ClusterID, string(r.Name))
if probeList := r.WorkloadHealthCheckInfo(addr); probeList != nil {
return probeList
}
}
log.Infof("Aggregate - didn't find any workload health info for %s", addr)
return nil
}

Expand Down
11 changes: 11 additions & 0 deletions pilot/pkg/serviceregistry/kube/controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -292,11 +292,14 @@ func (c *Controller) GetPodAZ(pod *v1.Pod) (string, bool) {

// ManagementPorts implements a service catalog operation
func (c *Controller) ManagementPorts(addr string) model.PortList {
log.Infof("ManagementPorts: looking for pod IP: %s", addr)
pod, exists := c.pods.getPodByIP(addr)
if !exists {
log.Infof("ManagementPorts: pod IP %s does not exist", addr)
return nil
}

log.Infof("ManagementPorts: pod IP found for IP %s, pod: %+v", addr, pod)
managementPorts, err := convertProbesToPorts(&pod.Spec)

if err != nil {
Expand All @@ -305,16 +308,20 @@ func (c *Controller) ManagementPorts(addr string) model.PortList {

// We continue despite the error because healthCheckPorts could return a partial
// list of management ports
log.Infof("ManagementPorts - list of ports for pod IP %s, %+v", addr, managementPorts)
return managementPorts
}

// WorkloadHealthCheckInfo implements a service catalog operation
func (c *Controller) WorkloadHealthCheckInfo(addr string) model.ProbeList {
log.Infof("WorkloadHealthCheckInfo: looking for pod IP: %s", addr)
pod, exists := c.pods.getPodByIP(addr)
if !exists {
log.Infof("WorkloadHealthCheckInfo: pod IP %s does not exist", addr)
return nil
}

log.Infof("WorkloadHealthCheckInfo: pod found for IP %s, pod: %+v", addr, pod)
probes := make([]*model.Probe, 0)

// Obtain probes from the readiness and liveness probes
Expand All @@ -324,6 +331,8 @@ func (c *Controller) WorkloadHealthCheckInfo(addr string) model.ProbeList {
if err != nil {
log.Infof("Error while parsing readiness probe port =%v", err)
}
log.Infof("WorkloadHealthCheckInfo: readiness probe found for IP: %s, port: %+v", addr, p)

probes = append(probes, &model.Probe{
Port: p,
Path: container.ReadinessProbe.Handler.HTTPGet.Path,
Expand All @@ -334,6 +343,8 @@ func (c *Controller) WorkloadHealthCheckInfo(addr string) model.ProbeList {
if err != nil {
log.Infof("Error while parsing liveness probe port =%v", err)
}
log.Infof("WorkloadHealthCheckInfo: liveness probe found for IP: %s, port: %+v", addr, p)

probes = append(probes, &model.Probe{
Port: p,
Path: container.LivenessProbe.Handler.HTTPGet.Path,
Expand Down