Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final container logs missing when disposing DistributedApplicationTestingBuilder #8206

Open
1 task done
afscrome opened this issue Mar 20, 2025 · 2 comments
Open
1 task done
Assignees
Milestone

Comments

@afscrome
Copy link
Contributor

afscrome commented Mar 20, 2025

Is there an existing issue for this?

  • I have searched the existing issues

Describe the bug

When troubleshooting test failures with DistributedApplicationTestingBuilder I've been facing problems where I'm missing logs - I'll get a container move to FailedToStart, but no output to go alongside it - which makes fixing the failure very tricky.

After a bit of investigation, I worked out that by adding a Task.Delay after an exception (but before disposing DistributedApplicationTestingBuilder), the logs would come through, so it seems like DistributedApplicationTestingBuilder needs to do some kind of log flushing before disposing itself.

Expected Behavior

Test output should include all stdout/stderr output from the container that failed to start - particularly the very last lines as this likely contains details of the error you need to fix.

Steps To Reproduce

The following example starts a container which writes "Some Error" to standard output, and then exists with exit 1234 to cause the container to fail to start.

  1. Run the below code
  2. Expect to see the text Some Error somewhere int he output (It is not there)
  3. Uncomment the Task.Delay() and re-run the test
  4. Now you do see Some Error in the logs
   [Test]  // NUnit
   [CancelAfter(30_000)]
   public async Task FlushTest(CancellationToken ct)
   {
      var builder = DistributedApplicationTestingBuilder.Create();
      builder.Services.AddLogging(logging =>
      {
         logging
            .SetMinimumLevel(LogLevel.Debug)
            .AddFilter(builder.Environment.ApplicationName, LogLevel.Debug)
            .AddFilter("Aspire.", LogLevel.Debug);
      });

      builder.AddContainer("failure", "mcr.microsoft.com/dotnet/runtime")
         .WithEntrypoint("sh")
         .WithArgs("-c", "echo Some Error; exit 1234")
         .WithHttpEndpoint(targetPort: 8080)
         .WithHttpHealthCheck();

      var app = builder.Build();
      await app.StartAsync(ct);

      try
      {
         await app.ResourceNotifications.WaitForResourceHealthyAsync("failure", ct);
      }
      catch
      {
         // Try giving logs time to flush
         //await Task.Delay(5_000);
         throw;
      }
   }

Exceptions (if any)

Without the Task.Delay, you get the following output

dbug: Microsoft.Extensions.Hosting.Internal.Host[1]
      Hosting starting
info: Aspire.Hosting.DistributedApplication[0]
      Aspire version: 9.2.0-preview.1.25163.7+543a7dada97c3ec6a7758e1ce876889fd005873a
info: Aspire.Hosting.DistributedApplication[0]
      Distributed application starting.
info: Aspire.Hosting.DistributedApplication[0]
      Application host directory is: S:\MYORG.Aspire\src\MYORG.Aspire.AppHost.Tests
dbug: Aspire.Hosting.Cli.CliBackchannel[0]
      Aspire CLI backchannel socket path was not specified.
info: Aspire.Hosting.Dcp.DcpHost[0]
      Starting DCP with arguments: start-apiserver --monitor 39540 --detach --kubeconfig "C:\Users\USERNAME\AppData\Local\Temp\aspire.rpstdru0.hpv\kubeconfig"
dbug: Aspire.Hosting.Dcp.DcpExecutor[0]
      Watching over DCP Executable resources.
dbug: Aspire.Hosting.Dcp.DcpExecutor[0]
      Watching over DCP Container resources.
dbug: Aspire.Hosting.Dcp.DcpExecutor[0]
      Watching over DCP Service resources.
dbug: Aspire.Hosting.Dcp.DcpExecutor[0]
      Watching over DCP Endpoint resources.
info: Aspire.Hosting.Dcp.dcp.start-apiserver.api-server[0]
      Starting API server...
info: Aspire.Hosting.Dcp.dcp.start-apiserver.api-server[0]
      API server started	{"Address": "127.0.0.1", "Port": 57516}
info: Aspire.Hosting.Dcp.dcp.start-apiserver.dcp-host[0]
      Starting DCP controller host
info: Aspire.Hosting.Dcp.dcp.start-apiserver.dcp-host[0]
      Started all services	{"count": 1}
info: Aspire.Hosting.Dcp.dcpctrl.ContainerOrchestrator[0]
      runtime status	{"ContainerRuntime": "", "runtime": "podman", "status": {"Installed":false,"Running":false,"Error":"exec: \"podman\": executable file not found in %PATH%\nfailed to start podman command 'Info'"}}
dbug: Aspire.Hosting.Dcp.KubernetesService[0]
      Successfully read Kubernetes configuration from 'C:\Users\USERNAME\AppData\Local\Temp\aspire.rpstdru0.hpv\kubeconfig' after 3172 milliseconds.
info: Aspire.Hosting.Dcp.dcpctrl.ContainerOrchestrator[0]
      runtime status	{"ContainerRuntime": "", "runtime": "docker", "status": {"Installed":true,"Running":true,"Error":""}}
info: Aspire.Hosting.Dcp.dcpctrl[0]
      starting controller manager
dbug: Aspire.Hosting.ApplicationModel.ResourceNotificationService[0]
      Resource failure/failure-uxasffqy changed state: Starting
info: Aspire.Hosting.DistributedApplication[0]
      Distributed application started. Press Ctrl+C to shut down.
dbug: Microsoft.Extensions.Hosting.Internal.Host[2]
      Hosting started
dbug: Aspire.Hosting.ApplicationModel.ResourceNotificationService[0]
      Waiting for resource 'failure' to enter the 'Healthy' state.
dbug: Aspire.Hosting.ApplicationModel.ResourceNotificationService[0]
      Resource failure/failure-uxasffqy changed state: Starting
dbug: Aspire.Hosting.Dcp.DcpExecutor[0]
      Starting log streaming for failure-uxasffqy.
info: Aspire.Hosting.Dcp.dcp.start-apiserver.api-server.container-logstreamer.ContainerOrchestrator[0]
      runtime status	{"ContainerRuntime": "", "runtime": "podman", "status": {"Installed":false,"Running":false,"Error":"exec: \"podman\": executable file not found in %PATH%\nfailed to start podman command 'Info'"}}
info: Aspire.Hosting.Dcp.dcpctrl.NetworkReconciler[0]
      network created	{"NetworkName": {"name":"default-aspire-network"}, "Reconciliation": 2, "Network": "default-aspire-network-csbs4hjo28"}
info: Aspire.Hosting.Dcp.dcp.start-apiserver.api-server.container-logstreamer.ContainerOrchestrator[0]
      runtime status	{"ContainerRuntime": "", "runtime": "docker", "status": {"Installed":true,"Running":true,"Error":""}}
info: MYORG.Aspire.AppHost.Tests.Resources.failure[0]
      1: 2025-03-20T08:52:41.9090000Z f81b96b2a584facf8668015e8b4259f110235b76bf4e19cb6de0f3141574bb0b
info: Aspire.Hosting.Dcp.dcpctrl.ContainerReconciler[0]
      Added new ContainerNetworkConnection	{"Container": {"name":"failure-uxasffqy"}, "Reconciliation": 5, "Container": "f81b96b2a584facf8668015e8b4259f110235b76bf4e19cb6de0f3141574bb0b", "ContainerNetworkConnection": "/failure-uxasffqy-default-aspire-network-2h4auef9oo"}
fail: Aspire.Hosting.Dcp.dcpctrl.ContainerReconciler[0]
      failed to start Container	{"Container": {"name":"failure-uxasffqy"}, "Reconciliation": 7, "ContainerID": "f81b96b2a584facf8668015e8b4259f110235b76bf4e19cb6de0f3141574bb0b", "error": "container f81b96b2a584facf8668015e8b4259f110235b76bf4e19cb6de0f3141574bb0b start failed (exit code 210)\nobject not found\ncontainer not found"}
dbug: Aspire.Hosting.ApplicationModel.ResourceNotificationService[0]
      Resource failure/failure-uxasffqy changed state: Starting -> FailedToStart
fail: Aspire.Hosting.ApplicationModel.ResourceNotificationService[0]
      Stopped waiting for resource 'failure' to become healthy because it failed to start.

If you uncomment the Task.Delay(), you will then get the following additional output

    fail: Aspire.Hosting.Dcp.dcpproc.monitor[0]
          error finding process	{"pid": 43524, "error": "no process found with PID 43524"}
    fail: Aspire.Hosting.Dcp.dcpproc.monitor[0]
          child service process could not be monitored	{"PID": 43524, "error": "no process found with PID 43524"}
    fail: Aspire.Hosting.Dcp.dcpproc[0]
          the program finished with an error	{"ExitCode": 1, "error": "no process found with PID 43524"}
    info: Trayport.Aspire.AppHost.Tests.Resources.failure[0]
          2: 2025-03-20T09:03:45.5613832Z Some Error

.NET Version info

9.2.0-preview.1.25163.7+543a7dada97c3ec6a7758e1ce876889fd005873a

Anything else?

No response

@github-actions github-actions bot added the area-app-model Issues pertaining to the APIs in Aspire.Hosting, e.g. DistributedApplication label Mar 20, 2025
@afscrome afscrome changed the title Final logs missing when disposing DistributedApplicationTestingBuilder Final container logs missing when disposing DistributedApplicationTestingBuilder Mar 20, 2025
afscrome added a commit to afscrome/aspire that referenced this issue Mar 30, 2025
Fixes dotnet#8206

Updated `DcpExecutor` to wait for log streams to fully flush during `StopAsync`, rather than immediately cancelling them.  Log streams should shut down once the associated resource is shut down and the logs have been processed.

Previous behaviour could cause logs from the very end of a process to be lost as the log streams could be shut down before the logs were forwarded from DCP to ILogger.  This was particularly problematic for DistributedApplicationBuilderTests failing whilst waiting for a resource to enter a particular state as the final lines of error are usually the ones you want to work out why a container / executable exited / failed to start.
afscrome added a commit to afscrome/aspire that referenced this issue Mar 30, 2025
Updated `DcpExecutor` to wait for log streams to fully flush during `StopAsync`, rather than immediately cancelling them.
 Log streams should shut down once the associated resource is shut down and the logs have been processed.

Previous behaviour could cause logs from the very end of a process to be lost as the log streams could be shut down before the logs were forwarded from DCP to ILogger.
This was particularly problematic for DistributedApplicationBuilderTests failing whilst waiting for a resource to enter a particular state as the final lines of error are usually the ones you want to work out why a container / executable exited / failed to start.

Fixes dotnet#8206
afscrome added a commit to afscrome/aspire that referenced this issue Mar 30, 2025
Updated `DcpExecutor` to wait for log streams to fully flush during `StopAsync`, rather than immediately cancelling them.
 Log streams should shut down once the associated resource is shut down and the logs have been processed.

Previous behaviour could cause logs from the very end of a process to be lost as the log streams could be shut down before the logs were forwarded from DCP to ILogger.
This was particularly problematic for DistributedApplicationBuilderTests failing whilst waiting for a resource to enter a particular state as the final lines of error are usually the ones you want to work out why a container / executable exited / failed to start.

Fixes dotnet#8206
@afscrome afscrome added area-orchestrator and removed area-app-model Issues pertaining to the APIs in Aspire.Hosting, e.g. DistributedApplication labels Mar 31, 2025
@afscrome
Copy link
Contributor Author

I think I need to have to abandon my PR as this will need changes in DCP - I couldn't find a way to trigger DCP to shutdown in a way that the app host could fully read the logs before DCP shut down.

The test cases in the PR are still good though.

@dbreshears dbreshears added this to the Backlog milestone Mar 31, 2025
@karolz-ms
Copy link
Member

Note to myself: see #8423 (comment) and following for thoughts on where the problem might be.

@dbreshears dbreshears modified the milestones: Backlog, 9.3 Apr 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants