Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on Nested spire #5894

Open
vinod-ps opened this issue Feb 24, 2025 · 18 comments
Open

Clarification on Nested spire #5894

vinod-ps opened this issue Feb 24, 2025 · 18 comments
Assignees
Labels
triage/in-progress Issue triage is in progress

Comments

@vinod-ps
Copy link

Hi All,

I am trying to deploy nested spire for testing.
I have Cluster01, 02 and 03.
The cluster 01 needs to be the Root and the other two should be nested.
Could you help me to understand the below.

  1. By referring this ( https://github.com/lftraining/lfs482-labs/blob/main/lab-10-nested-spire/README.md ), i could see that the nested kubernetes cluster will have 2 root agents and the nested spire agents as well. Do i need root and nested agents in the nested cluster?
  2. Is there any detailed steps where we can follow regarding how to deploy the nested spire in Kubernetes clusters?
  3. I was trying to use https://artifacthub.io/packages/helm/spiffe/spire-nested this helm for the deployment but not clear on its configuration part. Is there any official documentation regarding this?

Thanks in advance.

@kfox1111
Copy link
Contributor

There is some work in progress documentation for the spire-nested chart here: https://deploy-preview-293--spiffe.netlify.app/docs/latest/spire-helm-charts-hardened-advanced/nested-spire/

See https://deploy-preview-293--spiffe.netlify.app/img/spire-helm-charts-hardened/multicluster-alternate3.png for a diagram that may answer your question about agents.

@MarcosDY MarcosDY added the triage/in-progress Issue triage is in progress label Feb 25, 2025
@vinod-ps
Copy link
Author

@kfox1111 - Thanks for the info. However, I am struggling to deploy the root and child spire in AWS EKS.
When I run this , I am getting the below error. However I am unable to find the option to set this value.
helm upgrade --install -n spire-mgmt spire spire-nested --repo https://spiffe.github.io/helm-charts-hardened/ -f root-value.yaml -f values.yaml --set trustdomain=example.com --dry-run
Release "spire" does not exist. Installing it now.
Error: execution error at (spire-nested/charts/upstream-spire-agent/templates/daemonset.yaml:1:19): trustDomain must be set

@kfox1111
Copy link
Contributor

you need to actually set your trust domain to your domain, not just example.com.

@vinod-ps
Copy link
Author

vinod-ps commented Feb 28, 2025

@kfox1111 - Thanks a lot for your help.
I am using EKS cluster. In the root EKS cluster, I have deployed without injecting kubeconfig of child. That is the next point I have doubt.

Image

Does this spire helm create a role in child which needs to be used in root?
If so, how can I get the kubeconfig of child clusters kubeconfig which is also EKS.? I am connecting to the cluster using enterprise SSO. Hence will be be applicable for the spire-root account created by helm?

We are getting an error in child cluster as the spire pods are not coming up hope this is dues to the trust or authentication issue between the root and child. The below is the error which i am getting.

Default container name "spire-agent" not found in pod spire-agent-upstream-nvbrs
Defaulted container "upstream-spire-agent" out of: upstream-spire-agent, ensure-alternate-names (init), fsgroupfix (init)
could not parse trust bundle: open /run/spire/bundle/bundle.crt: no such file or directory

Thanks in advance.

@kfox1111
Copy link
Contributor

the kubeconfigs are used to upload the root trust bundle to the child clusters. Without it, the child clusters wont be able to bootstrap, and thats why they are not coming up.

As for how to do the auth, we've only ever tested the chart with kubaedm generated user certs. It should be possible to use other auth plugins that kubectl supports, but are untested with exactly how to do that.

@vinod-ps
Copy link
Author

vinod-ps commented Mar 1, 2025

@kfox1111 - Thanks for your help.

  1. We used this method mentioned below to create kubeconfig file instead of using kubeadm. This was successful and the test access
    was successful from local system.
    https://archive-docs.d2iq.com/dkp/2.4/create-a-kubeconfig-file-for-your-cluster

  2. After that when we tried to apply helm in the root cluster, we were getting the below error

    helm upgrade --install -n spire-mgmt spire spire-nested --repo https://spiffe.github.io/helm-charts-hardened/
    --set "external-spire-server.kubeConfigs..kubeConfigBase64=$(cat .kubeconfig)"
    -f your-values.yaml -f root-values.yaml

    Error: UPGRADE FAILED: failed to create resource: Secret in version "v1" cannot be handled as a Secret: illegal base64 data at input
    byte 10

    However, this was fixed by using the below command. Hope this is fine.
    helm upgrade --install -n spire-mgmt spire spire-nested
    --set "external-spire-server.kubeConfigs.spire-cluster-2.kubeConfigBase64=$(base64 -w 0 ../kubeconf/spire-cluster-
    2.kubeconfig)"
    -f spire-nested/values.yaml -f spire-nested/root-value.yaml --create-namespace

  3. In the spire root cluster, the pod is getting crash loopback: [spire-external-server-0 1/3 CrashLoopBackOff].
    While checking the log, we can see the below error.

level=error msg="Fatal run error" error="one or more notifiers returned an error: rpc error: code = Internal desc = notifier(k8sbundle): unable to update: unable to get list: Get "https://xxx.xxx.xxx.xxx.amazonaws.com/api/v1/namespaces/spire-system/configmaps/spire-bundle-upstream\": read tcp xx.xx.xx.xx:39122->xx.xx.xx.xx:443: read: connection reset by peer - error from a previous attempt: read tcp xx.xx.xx.xx:39112->xx.xx.xx.xx:443: read: connection reset by peer"
time="2025-03-01T14:25:26Z" level=error msg="Server crashed" error="one or more notifiers returned an error: rpc error: code = Internal desc = notifier(k8sbundle): unable to update: unable to get list: Get "https://xxxx.xxx.xxxx.xxx.amazonaws.com/api/v1/namespaces/spire-system/configmaps/spire-bundle-upstream\": read tcp xx.xx.xx.xx:39122->xx.xx.xx.xx:443: read: connection reset by peer - error from a previous attempt: read tcp xx.xx.xx.xxx:39112->xx.xx.xxx.xxx:443: read: connection reset by peer"

Thanks in advance.

@kfox1111
Copy link
Contributor

kfox1111 commented Mar 1, 2025

#2 there looks correct. the kubeconfig must be base64 encoded to prevent mangling of special characters.

For number 3, it looks like a firewall of some sort may be blocking access from one cluster to the other?

@vinod-ps
Copy link
Author

vinod-ps commented Mar 3, 2025

@kfox1111 - Thank you. It was a firewall issue and this is fixed. In root server all pods are running.

Now in the child cluster, pods are not starting
spire-server spiffe-oidc-discovery-provider-687ff5b988-qcnmv 0/2 Init:CrashLoopBackOff 7 (4m4s ago) 15m
spire-server spire-internal-server-0 0/2 CrashLoopBackOff 7 (58s ago) 11m
spire-system spiffe-csi-driver-downstream-jmkfr 2/2 Running 0 13m
spire-system spiffe-csi-driver-downstream-l4bp2 2/2 Running 0 17m
spire-system spiffe-csi-driver-upstream-2wnwp 2/2 Running 0 17m
spire-system spiffe-csi-driver-upstream-rswg9 2/2 Running 0 13m
spire-system spire-agent-downstream-4xd4h 0/1 CrashLoopBackOff 7 (2m47s ago) 13m
spire-system spire-agent-downstream-gmtjf 0/1 CrashLoopBackOff 8 (68s ago) 17m
spire-system spire-agent-upstream-gc2lg 0/1 CrashLoopBackOff 8 (48s ago) 17m
spire-system spire-agent-upstream-m5m6q 0/1 CrashLoopBackOff 7 (2m57s ago) 13m

k -n spire-server logs spire-internal-server-0
Default container name "spire-server" not found in pod spire-internal-server-0
Defaulted container "internal-spire-server" out of: internal-spire-server, spire-controller-manager
time="2025-03-03T12:24:24Z" level=info msg="Using legacy downstream X509 CA TTL calculation by default; this default will change in a future release"
time="2025-03-03T12:24:24Z" level=warning msg="Current umask 0022 is too permissive; setting umask 0027"
time="2025-03-03T12:24:24Z" level=info msg=Configured admin_ids="[]" data_dir=/run/spire/data launch_log_level=info version=1.11.2
time="2025-03-03T12:24:24Z" level=warning msg="Agent is now configured to accept remote network connections for Prometheus stats collection. Please ensure access to this port is tightly controlled" subsystem_name=telemetry
time="2025-03-03T12:24:24Z" level=info msg="Starting prometheus exporter" host=0.0.0.0 port=9988 subsystem_name=telemetry
time="2025-03-03T12:24:24Z" level=info msg="Opening SQL database" db_type=sqlite3 subsystem_name=sql
time="2025-03-03T12:24:24Z" level=info msg="Connected to SQL database" read_only=false subsystem_name=sql type=sqlite3 version=3.46.1
time="2025-03-03T12:24:24Z" level=info msg="Configured DataStore" reconfigurable=false subsystem_name=catalog
time="2025-03-03T12:24:24Z" level=info msg="Configured plugin" external=false plugin_name=disk plugin_type=KeyManager reconfigurable=false subsystem_name=catalog
time="2025-03-03T12:24:24Z" level=info msg="Plugin loaded" external=false plugin_name=disk plugin_type=KeyManager subsystem_name=catalog
time="2025-03-03T12:24:24Z" level=info msg="Configured plugin" external=false plugin_name=k8s_psat plugin_type=NodeAttestor reconfigurable=false subsystem_name=catalog
time="2025-03-03T12:24:24Z" level=info msg="Plugin loaded" external=false plugin_name=k8s_psat plugin_type=NodeAttestor subsystem_name=catalog
time="2025-03-03T12:24:24Z" level=info msg="Configured plugin" external=false plugin_name=k8sbundle plugin_type=Notifier reconfigurable=false subsystem_name=catalog
time="2025-03-03T12:24:24Z" level=info msg="Plugin loaded" external=false plugin_name=k8sbundle plugin_type=Notifier subsystem_name=catalog
time="2025-03-03T12:24:24Z" level=info msg="Configured plugin" external=false plugin_name=spire plugin_type=UpstreamAuthority reconfigurable=false subsystem_name=catalog
time="2025-03-03T12:24:24Z" level=info msg="Plugin loaded" external=false plugin_name=spire plugin_type=UpstreamAuthority subsystem_name=catalog
time="2025-03-03T12:24:24Z" level=info msg="There is not a CA journal record that matches any of the local X509 authority IDs" subsystem_name=ca_manager
time="2025-03-03T12:24:24Z" level=info msg="Journal loaded" jwt_keys=0 subsystem_name=ca_manager x509_cas=0
time="2025-03-03T12:24:24Z" level=error msg="Failed to watch the Workload API: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/spire/upstream_agent/spire-agent.sock: connect: no such file or directory"" external=false plugin_name=spire plugin_type=UpstreamAuthority subsystem_name=catalog
time="2025-03-03T12:24:25Z" level=error msg="Failed to watch the Workload API: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/spire/upstream_agent/spire-agent.sock: connect: no such file or directory"" external=false plugin_name=spire plugin_type=UpstreamAuthority subsystem_name=catalog
time="2025-03-03T12:24:27Z" level=error msg="Failed to watch the Workload API: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/spire/upstream_agent/spire-agent.sock: connect: no such file or directory"" external=false plugin_name=spire plugin_type=UpstreamAuthority subsystem_name=catalog
time="2025-03-03T12:24:30Z" level=error msg="Failed to watch the Workload API: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/spire/upstream_agent/spire-agent.sock: connect: no such file or directory"" external=false plugin_name=spire plugin_type=UpstreamAuthority subsystem_name=catalog
time="2025-03-03T12:24:34Z" level=error msg="Failed to watch the Workload API: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/spire/upstream_agent/spire-agent.sock: connect: no such file or directory"" external=false plugin_name=spire plugin_type=UpstreamAuthority subsystem_name=catalog
time="2025-03-03T12:24:39Z" level=error msg="Failed to watch the Workload API: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/spire/upstream_agent/spire-agent.sock: connect: no such file or directory"" external=false plugin_name=spire plugin_type=UpstreamAuthority subsystem_name=catalog
time="2025-03-03T12:24:45Z" level=error msg="Failed to watch the Workload API: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/spire/upstream_agent/spire-agent.sock: connect: no such file or directory"" external=false plugin_name=spire plugin_type=UpstreamAuthority subsystem_name=catalog
time="2025-03-03T12:24:52Z" level=error msg="Failed to watch the Workload API: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/spire/upstream_agent/spire-agent.sock: connect: no such file or directory"" external=false plugin_name=spire plugin_type=UpstreamAuthority subsystem_name=catalog
time="2025-03-03T12:25:00Z" level=error msg="Failed to watch the Workload API: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/spire/upstream_agent/spire-agent.sock: connect: no such file or directory"" external=false plugin_name=spire plugin_type=UpstreamAuthority subsystem_name=catalog

k -n spire-system logs spire-agent-upstream-m5m6q
Default container name "spire-agent" not found in pod spire-agent-upstream-m5m6q
Defaulted container "upstream-spire-agent" out of: upstream-spire-agent, ensure-alternate-names (init), fsgroupfix (init)
could not parse trust bundle: open /run/spire/bundle/bundle.crt: no such file or directory

Do you have any clues.

Thanks in advance.

@kfox1111
Copy link
Contributor

kfox1111 commented Mar 3, 2025

the upstream agent establishes trust with the root cluster. the root cluster uploads a trust bundle to the child cluster. the agent is saying its not finding a trust bundle. so the external spire-server in the root cluster is not functioning properly yet. may check the logs for it.

@vinod-ps
Copy link
Author

vinod-ps commented Mar 5, 2025

@kfox1111 - Thanks for your inputs.
I have checked the root external spire pod and getting the below logs. i have checked the configmap spire-bundle-upstream which wasn't created via helm apply. but when i checked the helm dry run output i can see the configmap with no data in it.
Hence tried to manually apply the configmap. But still getting the same error.

k -n spire-mgmt logs spire-external-server-0 | grep -i error

Default container name "spire-server" not found in pod spire-external-server-0
Defaulted container "external-spire-server" out of: external-spire-server, spire-controller-manager, spire-controller-manager-spire-cluster-2, chown (init)
time="2025-03-05T05:53:01Z" level=warning msg="X509CA slot unusable" error="slot expired" issued_at="2025-03-05 01:47:17 +0000 UTC" local_authority_id=e3976412c6ecd735ffd0eb1b0f38946fceb9133e slot=B status=OLD subsystem_name=ca_manager upstream_authority_id=27f6b95a19bef6c4e9dcd8a1d2b90e437307f3f5
time="2025-03-05T05:53:01Z" level=warning msg="X509CA slot unusable" error="slot expired" issued_at="2025-03-04 23:44:22 +0000 UTC" local_authority_id=6887c1f44f05a6d8652bc4d632b779af52472ccb slot=A status=OLD subsystem_name=ca_manager upstream_authority_id=27f6b95a19bef6c4e9dcd8a1d2b90e437307f3f5
time="2025-03-05T05:53:01Z" level=warning msg="X509CA slot unusable" error="slot expired" issued_at="2025-03-04 21:41:26 +0000 UTC" local_authority_id=83a918141f2401ef593b1ed43b4606949a6e3fc5 slot=B status=OLD subsystem_name=ca_manager upstream_authority_id=27f6b95a19bef6c4e9dcd8a1d2b90e437307f3f5
time="2025-03-05T05:53:01Z" level=warning msg="X509CA slot unusable" error="slot expired" issued_at="2025-03-04 19:38:13 +0000 UTC" local_authority_id=73257888685e52be881aedf5b0b25725d291082e slot=A status=OLD subsystem_name=ca_manager upstream_authority_id=27f6b95a19bef6c4e9dcd8a1d2b90e437307f3f5
time="2025-03-05T05:53:01Z" level=warning msg="X509CA slot unusable" error="slot expired" issued_at="2025-03-04 17:35:27 +0000 UTC" local_authority_id=020f55045af7e05831d8c36f643fd27be85064c9 slot=B status=OLD subsystem_name=ca_manager upstream_authority_id=27f6b95a19bef6c4e9dcd8a1d2b90e437307f3f5
time="2025-03-05T05:53:01Z" level=warning msg="X509CA slot unusable" error="slot expired" issued_at="2025-03-04 15:32:27 +0000 UTC" local_authority_id=398264133ca74cad06e2451cba055b59b6ddd09c slot=A status=OLD subsystem_name=ca_manager upstream_authority_id=27f6b95a19bef6c4e9dcd8a1d2b90e437307f3f5
time="2025-03-05T05:53:01Z" level=warning msg="X509CA slot unusable" error="slot expired" issued_at="2025-03-04 13:29:09 +0000 UTC" local_authority_id=216a89ca4818274a210fa1ab9c6783cf61696070 slot=B status=OLD subsystem_name=ca_manager upstream_authority_id=27f6b95a19bef6c4e9dcd8a1d2b90e437307f3f5
time="2025-03-05T05:53:01Z" level=warning msg="X509CA slot unusable" error="slot expired" issued_at="2025-03-04 11:27:32 +0000 UTC" local_authority_id=89b05927f076b86e1e0ceed83f2d2fdb73f6dbf4 slot=A status=OLD subsystem_name=ca_manager upstream_authority_id=27f6b95a19bef6c4e9dcd8a1d2b90e437307f3f5
time="2025-03-05T05:53:01Z" level=warning msg="X509CA slot unusable" error="slot expired" issued_at="2025-03-04 08:09:35 +0000 UTC" local_authority_id=16ef845b576b3ae2c8e820de03a0522bebd2171d slot=B status=OLD subsystem_name=ca_manager upstream_authority_id=27f6b95a19bef6c4e9dcd8a1d2b90e437307f3f5
time="2025-03-05T05:53:02Z" level=error msg="Health check has failed" check=server error="subsystem is not live or ready" subsystem_name=health
time="2025-03-05T05:53:02Z" level=warning msg="Health check failed" check=server details="{false false {unable to fetch bundle: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix ///tmp/spire-server/private/api.sock: connect: no such file or directory"} {unable to fetch bundle: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix ///tmp/spire-server/private/api.sock: connect: no such file or directory"}}" error="subsystem is not live or ready" subsystem_name=health
time="2025-03-05T05:53:02Z" level=error msg="Notifier failed to handle event" error="rpc error: code = Internal desc = notifier(k8sbundle): unable to update: unable to get list: configmaps "spire-bundle-upstream" not found" event="bundle loaded" notifier=k8sbundle subsystem_name=ca_manager
time="2025-03-05T05:53:02Z" level=error msg="Fatal run error" error="one or more notifiers returned an error: rpc error: code = Internal desc = notifier(k8sbundle): unable to update: unable to get list: configmaps "spire-bundle-upstream" not found"
time="2025-03-05T05:53:02Z" level=error msg="Server crashed" error="one or more notifiers returned an error: rpc error: code = Internal desc = notifier(k8sbundle): unable to update: unable to get list: configmaps "spire-bundle-upstream" not found"

Thanks in advance.

@kfox1111
Copy link
Contributor

kfox1111 commented Mar 5, 2025

Definitely something wrong with the external spire server. Did you change any of its config?

@vinod-ps
Copy link
Author

vinod-ps commented Mar 11, 2025

@kfox1111 - There is no changes made in config.

I have only created the value.yaml and root value yaml file, similarly for the value file and child value file as mentioned in the doc.
The Root spire server (spire-external-server-0) is coming up but on one condition which is, the deployment should be in spire-system namespace in child cluster. I couldn't see any errors in the logs in spire-external-server-0 pod.

Additionally noticed this issue: spiffe/helm-charts-hardened#528

  1. Does the child and root should share a common persistent storage class?

In the child cluster, the below is the status.

k -n spire-system get pods
NAME READY STATUS RESTARTS AGE
spiffe-csi-driver-downstream-df8m9 2/2 Running 0 7m16s
spiffe-csi-driver-downstream-tr4b5 2/2 Running 0 7m16s
spiffe-csi-driver-upstream-4tjhv 2/2 Running 0 7m16s
spiffe-csi-driver-upstream-xbvzd 2/2 Running 0 7m16s
spiffe-oidc-discovery-provider-565c994c45-8htrz 0/2 Init:Error 1 (5s ago) 6s
spire-agent-downstream-8gc7d 0/1 Error 1 (2s ago) 5s
spire-agent-downstream-mbzqt 0/1 CrashLoopBackOff 1 (2s ago) 6s
spire-agent-upstream-t9kx5 0/1 Running 0 6s
spire-agent-upstream-xqpg4 0/1 Running 0 6s
spire-internal-server-0 0/2 Error 0 5s

k -n spire-system logs spire-internal-server-0

Default container name "spire-server" not found in pod spire-internal-server-0
Defaulted container "internal-spire-server" out of: internal-spire-server, spire-controller-manager, chown (init)
time="2025-03-11T19:06:11Z" level=info msg="Using legacy downstream X509 CA TTL calculation by default; this default will change in a future release"
time="2025-03-11T19:06:11Z" level=warning msg="Current umask 0022 is too permissive; setting umask 0027"
time="2025-03-11T19:06:11Z" level=info msg=Configured admin_ids="[]" data_dir=/run/spire/data launch_log_level=info version=1.11.2
time="2025-03-11T19:06:11Z" level=info msg="Opening SQL database" db_type=sqlite3 subsystem_name=sql
time="2025-03-11T19:06:11Z" level=info msg="Connected to SQL database" read_only=false subsystem_name=sql type=sqlite3 version=3.46.1
time="2025-03-11T19:06:11Z" level=info msg="Configured DataStore" reconfigurable=false subsystem_name=catalog
time="2025-03-11T19:06:11Z" level=info msg="Configured plugin" external=false plugin_name=disk plugin_type=KeyManager reconfigurable=false subsystem_name=catalog
time="2025-03-11T19:06:11Z" level=info msg="Plugin loaded" external=false plugin_name=disk plugin_type=KeyManager subsystem_name=catalog
time="2025-03-11T19:06:11Z" level=info msg="Configured plugin" external=false plugin_name=k8s_psat plugin_type=NodeAttestor reconfigurable=false subsystem_name=catalog
time="2025-03-11T19:06:11Z" level=info msg="Plugin loaded" external=false plugin_name=k8s_psat plugin_type=NodeAttestor subsystem_name=catalog
time="2025-03-11T19:06:11Z" level=info msg="Configured plugin" external=false plugin_name=k8sbundle plugin_type=Notifier reconfigurable=false subsystem_name=catalog
time="2025-03-11T19:06:11Z" level=info msg="Plugin loaded" external=false plugin_name=k8sbundle plugin_type=Notifier subsystem_name=catalog
time="2025-03-11T19:06:11Z" level=info msg="Configured plugin" external=false plugin_name=spire plugin_type=UpstreamAuthority reconfigurable=false subsystem_name=catalog
time="2025-03-11T19:06:11Z" level=info msg="Plugin loaded" external=false plugin_name=spire plugin_type=UpstreamAuthority subsystem_name=catalog
time="2025-03-11T19:06:11Z" level=info msg="There is not a CA journal record that matches any of the local X509 authority IDs" subsystem_name=ca_manager
time="2025-03-11T19:06:11Z" level=info msg="Journal loaded" jwt_keys=0 subsystem_name=ca_manager x509_cas=0
time="2025-03-11T19:06:11Z" level=error msg="Failed to watch the Workload API: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/spire/upstream_agent/spire-agent.sock: connect: no such file or directory"" external=false plugin_name=spire plugin_type=UpstreamAuthority subsystem_name=catalog
time="2025-03-11T19:06:12Z" level=error msg="Failed to watch the Workload API: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/spire/upstream_agent/spire-agent.sock: connect: no such file or directory"" external=false plugin_name=spire plugin_type=UpstreamAuthority subsystem_name=catalog
time="2025-03-11T19:06:14Z" level=error msg="Failed to watch the Workload API: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/spire/upstream_agent/spire-agent.sock: connect: no such file or directory"" external=false plugin_name=spire plugin_type=UpstreamAuthority subsystem_name=catalog
time="2025-03-11T19:06:17Z" level=error msg="Failed to watch the Workload API: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/spire/upstream_agent/spire-agent.sock: connect: no such file or directory"" external=false plugin_name=spire plugin_type=UpstreamAuthority subsystem_name=catalog
time="2025-03-11T19:06:21Z" level=error msg="Failed to watch the Workload API: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/spire/upstream_agent/spire-agent.sock: connect: no such file or directory"" external=false plugin_name=spire plugin_type=UpstreamAuthority subsystem_name=catalog

k -n spire-system logs spire-agent-upstream-t9kx5

Default container name "spire-agent" not found in pod spire-agent-upstream-t9kx5
Defaulted container "upstream-spire-agent" out of: upstream-spire-agent, ensure-alternate-names (init)
time="2025-03-11T19:16:58Z" level=warning msg="Current umask 0022 is too permissive; setting umask 0027"
time="2025-03-11T19:16:58Z" level=info msg="Starting agent" data_dir=/var/lib/spire version=1.11.2
time="2025-03-11T19:16:58Z" level=info msg="Plugin loaded" external=false plugin_name=memory plugin_type=KeyManager subsystem_name=catalog
time="2025-03-11T19:16:58Z" level=info msg="Configured plugin" external=false plugin_name=k8s_psat plugin_type=NodeAttestor reconfigurable=false subsystem_name=catalog
time="2025-03-11T19:16:58Z" level=info msg="Plugin loaded" external=false plugin_name=k8s_psat plugin_type=NodeAttestor subsystem_name=catalog
time="2025-03-11T19:16:58Z" level=info msg="Using the new container locator" external=false plugin_name=k8s plugin_type=WorkloadAttestor subsystem_name=catalog
time="2025-03-11T19:16:58Z" level=info msg="Configured plugin" external=false plugin_name=k8s plugin_type=WorkloadAttestor reconfigurable=false subsystem_name=catalog
time="2025-03-11T19:16:58Z" level=info msg="Plugin loaded" external=false plugin_name=k8s plugin_type=WorkloadAttestor subsystem_name=catalog
time="2025-03-11T19:16:58Z" level=info msg="Bundle loaded" subsystem_name=attestor trust_domain_id="spiffe://spire-xxxxxx.ec1.aws.xxxx.cloud.xxxx"
time="2025-03-11T19:16:58Z" level=info msg="SVID is not found. Starting node attestation" subsystem_name=attestor trust_domain_id="spiffe://spire-xxxxxx.ec1.aws.xxxx.cloud.xxxx"
time="2025-03-11T19:17:18Z" level=warning msg="Failed to retrieve attestation result" error="could not open attestation stream to SPIRE server: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp .xx125.56.138:443: i/o timeout"" retry_interval=5.077932121s
time="2025-03-11T19:17:23Z" level=info msg="Bundle loaded" subsystem_name=attestor trust_domain_id="spiffe://spire-xxxxxx.ec1.aws.xxxx.cloud.xxxx"
time="2025-03-11T19:17:23Z" level=warning msg="Keys recovered, but no SVID found. Generating new keypair" subsystem_name=attestor
time="2025-03-11T19:17:23Z" level=info msg="SVID is not found. Starting node attestation" subsystem_name=attestor trust_domain_id="spiffe://spire-xxxxxx.ec1.aws.xxxx.cloud.xxxx"
time="2025-03-11T19:17:43Z" level=warning msg="Failed to retrieve attestation result" error="could not open attestation stream to SPIRE server: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp xx.xx.xx.xx:443: i/o timeout"" retry_interval=8.211752039s
time="2025-03-11T19:17:51Z" level=info msg="Bundle loaded" subsystem_name=attestor trust_domain_id="spiffe://spire-xxxxxx.ec1.aws.xxxx.cloud.xxxx"
time="2025-03-11T19:17:51Z" level=warning msg="Keys recovered, but no SVID found. Generating new keypair" subsystem_name=attestor
time="2025-03-11T19:17:51Z" level=info msg="SVID is not found. Starting node attestation" subsystem_name=attestor trust_domain_id="spiffe://spire-xxxxxx.ec1.aws.xxxx.cloud.xxxx"
time="2025-03-11T19:18:11Z" level=info msg="Plugin unloaded" external=false plugin_name=k8s plugin_type=WorkloadAttestor subsystem_name=catalog
time="2025-03-11T19:18:11Z" level=info msg="Plugin unloaded" external=false plugin_name=k8s_psat plugin_type=NodeAttestor subsystem_name=catalog
time="2025-03-11T19:18:11Z" level=info msg="Plugin unloaded" external=false plugin_name=memory plugin_type=KeyManager subsystem_name=catalog
time="2025-03-11T19:18:11Z" level=info msg="Catalog closed" subsystem_name=catalog
time="2025-03-11T19:18:11Z" level=error msg="Agent crashed" error="could not open attestation stream to SPIRE server: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp xx.xx.xx.xx:443: i/o timeout""

Thanks in advance.

@vinod-ps
Copy link
Author

Hi,
Could anyone help here?
Thanks in advance.

@sorindumitru
Copy link
Collaborator

sorindumitru commented Mar 17, 2025

@vinod-ps The latest error seems to be because the agent can't connect to spire-server:

time="2025-03-11T19:18:11Z" level=error msg="Agent crashed" error="could not open attestation stream to SPIRE server: rpc error: code = Unavailable desc = last connection error: connection error: desc = "transport: Error while dialing: dial tcp xx.xx.xx.xx:443: i/o timeout""

Is that IP:PORT combination reachable from where the agent is running?

It might be easier/faster to get help for this issue on Slack, if you can join it, since it's mostly about debugging a deployment.

@vinod-ps
Copy link
Author

@sorindumitru - Yes, IP:PORT is reachable from the nested agent to the root as both clusters are in the same vpc & Subnet.
What is the expected response from the external spire and oidc in the root spire, because when I tried using kubectl port-forward to these pods , I was getting 404.
How to test the external and oidc service in the root.

@sorindumitru
Copy link
Collaborator

sorindumitru commented Mar 17, 2025

The logs show that the agent isn't able to talk to the upstream spire-server. I'd look into why that is happening.

Generally for your 3 cluster example (root, A, B). You'd need to have:

  • In root cluster:
    • spire-server running with the port directly exposed to callers from outside the cluster. The agents need to establish an mTLS connection with the server so if you use some kind of proxy/ingress it needs to do TLS passthrough for spire-server
    • the kubeconfig files for the A,B clusters and used from spire-server configuration for the k8s_psat node attestor.
  • In A and B clusters:
    • The root spire-agent, configured such that it can attest to the spire-server running in root cluster (e.g. cluster name and namespace/service accounts should match between the configurations in agent and server). The namespace/service account should likely be different from the ones used by the downstream servers, agents.
    • The downstream spire-server, configured to talk with the root spire-agent and root spire-server through the "spire" UpstreamAuthority plugin
    • The downstream agent, if you have any workloads in these clusters, confogired to talk with the downstream spire-server instances from their clusters.

I think this questions is probably better suited for the helm chart repo, since it seems like you are using the helm chart. If that is the case, we can move it there. The maintainers there would be better suited to help you with the helm chart.

@kfox1111
Copy link
Contributor

We run this configuration regularly in the helm chart gate tests: https://github.com/spiffe/helm-charts-hardened/tree/main/examples/nested-full

I'd go back and either:

  1. follow the instructions verbatim rather then making any customization around doing your own authentication the first time to establish a working baseline, then switch out the auth bits
  2. very closely consider anything that could be different when changing the auth bits between the established working pattern and what kind of auth you are trying to do.

@kfox1111
Copy link
Contributor

Also, double check the kubeconfig works with a stock kubectl without any other software installed, or make sure you install any plugin / component into a set of custom images needed for the auth plugin you have in mind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage/in-progress Issue triage is in progress
Projects
None yet
Development

No branches or pull requests

4 participants