Skip to content

Fix GrpcWorkerChannel.StartWorkerProcessAsync timeout #10937

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: dev
Choose a base branch
from

Conversation

jviau
Copy link
Contributor

@jviau jviau commented Mar 21, 2025

Issue describing the changes in this PR

resolves #issue_for_this_pr

Pull request checklist

IMPORTANT: Currently, changes must be backported to the in-proc branch to be included in Core Tools and non-Flex deployments.

  • Backporting to the in-proc branch is not required
    • Otherwise: Link to backporting PR
  • My changes do not require documentation changes
    • Otherwise: Documentation issue linked to PR
  • My changes should not be added to the release notes for the next release
    • Otherwise: I've added my notes to release_notes.md -- TODO
  • My changes do not need to be backported to a previous version
    • Otherwise: Backport tracked by issue/PR #issue_or_pr
  • My changes do not require diagnostic events changes
    • Otherwise: I have added/updated all related diagnostic events and their documentation (Documentation issue linked to PR)
  • I have added all required tests (Unit tests, E2E tests) -- TODO

Additional information

This PR improves the ScriptHost startup experience with a bad worker. Today, if a worker crashes or exits immediately after startup, then the GrpcWorkerChannel.StartWorkerProcessAsync will block on _workerInitTask.Task until it eventually times out. This tends to fault the entire host (at least during debugging).

To address this, a WorkerProcess.WaitForExitAsync is added and GrpcWorkerChannel.StartWorkerProcessAsync will also wait on that, improving the responsiveness to a worker exiting before connecting gRPC events.

@jviau jviau requested a review from a team as a code owner March 21, 2025 21:25
await _rpcWorkerProcess.StartProcessAsync();
_state = _state | RpcWorkerChannelState.Initializing;
await _workerInitTask.Task;
await _rpcWorkerProcess.StartProcessAsync(cancellationToken);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the important change - we will now wait on either worker fully initialized (gRPC connection established) or worker exits (in which case, we will re-throw any failures the worker experience).

Copy link
Member

@brettsam brettsam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good -- would just like a test added

@jviau
Copy link
Contributor Author

jviau commented May 14, 2025

/azp run host.integration-tests

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jviau
Copy link
Contributor Author

jviau commented May 14, 2025

/azp run host.public

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jviau jviau requested a review from brettsam May 19, 2025 20:12
@liliankasem
Copy link
Member

liliankasem commented Jul 18, 2025

@jviau @mattchenderson I was just investigating this yesterday! Turns out this is the same reason that we are stuck for 60s when trying to exit the CLI during startup (Azure/azure-functions-core-tools#4355). Glad to see there is a already a fix before I opened one :) I'm going to test your changes out with core tools and see if it addresses the issue like I expect.

Looks like it works?

2025-07-18T16:13:56.776Z] Reading functions metadata (Worker)
[2025-07-18T16:13:57.920Z] {
[2025-07-18T16:13:57.920Z]   "ProcessId": 26145,
[2025-07-18T16:14:00.842Z]   "RuntimeIdentifier": "osx-arm64",
[2025-07-18T16:14:00.842Z]   "WorkerVersion": "2.0.0.0",
[2025-07-18T16:14:00.842Z]   "ProductVersion": "2.0.0\u002Bd8b5fe998a8c92819b8ee41d2569d2525413e9c5",
[2025-07-18T16:14:00.842Z]   "FrameworkDescription": ".NET 9.0.7",
[2025-07-18T16:14:00.842Z]   "OSDescription": "Darwin 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:54:29 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6030",
[2025-07-18T16:14:00.842Z]   "OSArchitecture": "Arm64",
[2025-07-18T16:14:00.842Z]   "CommandLine": "/Users/likasem/source/functions/azure-functions-core-tools/_dniapp/bin/output/_dniapp.dll --host 127.0.0.1 --port 50779 --workerId 3246a75a-129e-438f-9c2a-1602804bdd04 --requestId 08ba6582-1eea-4e25-b3ac-cc654b982ecd --grpcMaxMessageLength 2147483647 --functions-uri http://127.0.0.1:50779/ --functions-worker-id 3246a75a-129e-438f-9c2a-1602804bdd04 --functions-request-id 08ba6582-1eea-4e25-b3ac-cc654b982ecd --functions-grpc-max-message-length 2147483647"
[2025-07-18T16:14:00.842Z] }

^C  <I hit ctrl c here>

[2025-07-18T16:14:09.119Z] Failed to start language worker process for runtime: dotnet-isolated. workerId:3246a75a-129e-438f-9c2a-1602804bdd04
[2025-07-18T16:14:45.437Z] Removing errored webhost language worker channel for runtime: dotnet-isolated workerId:3246a75a-129e-438f-9c2a-1602804bdd04
[2025-07-18T16:14:45.437Z] System.AggregateException: One or more errors occurred. (Worker process exited before initializing.)
[2025-07-18T16:14:45.437Z]  ---> Microsoft.Azure.WebJobs.Script.Workers.WorkerProcessExitException: Worker process exited before initializing.
[2025-07-18T16:14:45.437Z]    at Microsoft.Azure.WebJobs.Script.Grpc.GrpcWorkerChannel.StartWorkerProcessAsync(CancellationToken cancellationToken) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs:line 397
[2025-07-18T16:14:45.437Z]    --- End of inner exception stack trace ---
[2025-07-18T16:14:45.438Z]    at Microsoft.Azure.WebJobs.Script.WorkerFunctionMetadataProvider.GetFunctionMetadataAsync(IEnumerable`1 workerConfigs, Boolean forceRefresh) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/WorkerFunctionMetadataProvider.cs:line 117
[2025-07-18T16:15:18.940Z] A host error has occurred during startup operation '393e57d4-33aa-48c8-8b45-6234887f2d7c'.
[2025-07-18T16:15:18.940Z] System.AggregateException: One or more errors occurred. (Worker process exited before initializing.)
[2025-07-18T16:15:18.940Z]  ---> Microsoft.Azure.WebJobs.Script.Workers.WorkerProcessExitException: Worker process exited before initializing.
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.Grpc.GrpcWorkerChannel.StartWorkerProcessAsync(CancellationToken cancellationToken) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs:line 397
[2025-07-18T16:15:18.940Z]    --- End of inner exception stack trace ---
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.WorkerFunctionMetadataProvider.GetFunctionMetadataAsync(IEnumerable`1 workerConfigs, Boolean forceRefresh) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/WorkerFunctionMetadataProvider.cs:line 117
[2025-07-18T16:15:18.940Z]    at System.ExceptionExtensions.ThrowIfErrorsPresent(IList`1 exceptions, String message) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Extensions/ExceptionExtensions.cs:line 47
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.WorkerFunctionMetadataProvider.GetFunctionMetadataAsync(IEnumerable`1 workerConfigs, Boolean forceRefresh) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/WorkerFunctionMetadataProvider.cs:line 149
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.WebHost.FunctionMetadataProvider.GetFunctionMetadataAsync(IEnumerable`1 workerConfigs, Boolean forceRefresh) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/FunctionMetadataProvider.cs:line 45
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.FunctionMetadataManager.LoadFunctionMetadata(Boolean forceRefresh, Boolean includeCustomProviders) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/FunctionMetadataManager.cs:line 151
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.FunctionMetadataManager.GetFunctionMetadata(Boolean forceRefresh, Boolean applyAllowList, Boolean includeCustomProviders) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/FunctionMetadataManager.cs:line 101
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.DependencyInjection.ExternalConfigurationStartupValidator.Validate(IConfigurationRoot originalConfig) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/DependencyInjection/ExternalConfigurationStartupValidator.cs:line 42
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.DependencyInjection.ExternalConfigurationStartupValidatorService.StartAsync(CancellationToken cancellationToken) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/DependencyInjection/ExternalConfigurationStartupValidatorService.cs:line 33
[2025-07-18T16:15:18.940Z]    at Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>b__15_1(IHostedService service, CancellationToken token)
[2025-07-18T16:15:18.940Z]    at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable`1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List`1 exceptions, Func`3 operation)
[2025-07-18T16:15:18.940Z]    at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.UnsynchronizedStartHostAsync(ScriptHostStartupOperation activeOperation, Int32 attemptCount, JobHostStartupMode startupMode) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script.WebHost/WebJobsScriptHostService.cs:line 417
[2025-07-18T16:15:18.945Z] Stopping JobHost
[2025-07-18T16:15:18.965Z] Failed to stop host instance '668de38c-aac1-4cab-98de-2bd581e33a16'.
[2025-07-18T16:15:18.965Z] System.AggregateException: One or more hosted services failed to stop. (The host has not yet started.) (Object reference not set to an instance of an object.)
[2025-07-18T16:15:18.965Z]  ---> System.InvalidOperationException: The host has not yet started.
[2025-07-18T16:15:18.965Z]    at Microsoft.Azure.WebJobs.JobHost.StopAsync() in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\JobHost.cs:line 119
[2025-07-18T16:15:18.965Z]    at Microsoft.Azure.WebJobs.Hosting.JobHostService.StopAsync(CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Hosting\JobHostService.cs:line 32
[2025-07-18T16:15:18.965Z]    at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable`1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List`1 exceptions, Func`3 operation)
[2025-07-18T16:15:18.965Z]    --- End of inner exception stack trace ---
[2025-07-18T16:15:18.965Z]    at Microsoft.Extensions.Hosting.Internal.Host.StopAsync(CancellationToken cancellationToken)
[2025-07-18T16:15:18.965Z]    at Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.Orphan(IHost instance, CancellationToken cancellationToken) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script.WebHost/WebJobsScriptHostService.cs:line 897
[2025-07-18T16:15:18.965Z]  ---> (Inner Exception #1) System.NullReferenceException: Object reference not set to an instance of an object.
[2025-07-18T16:15:18.965Z]    at Microsoft.Azure.WebJobs.Hosting.OptionsLoggingService.StopAsync(CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Hosting\OptionsFormatter\OptionsLoggingService.cs:line 38
[2025-07-18T16:15:18.965Z]    at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable`1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List`1 exceptions, Func`3 operation)<---
[2025-07-18T16:15:18.965Z] 
[2025-07-18T16:15:19.018Z] Host startup operation has been canceled
[2025-07-18T16:15:19.018Z] Initialization cancellation requested by runtime.
[2025-07-18T16:15:19.030Z] Stopping host...
[2025-07-18T16:15:19.031Z] Host shutdown completed.
[2025-07-18T16:15:19.032Z] Shutting down language worker channels for runtime:dotnet-isolated
[2025-07-18T16:15:19.063Z] Stopping the flush logs timer.

Before this change, it would hang for 60 seconds waiting on the timeout from PendingItem.

For my version of the fix, I was considering passing a CT to PendingItem and registering an OnCancelled event so that we do not have to wait the entire timeout period if there is a cancellation event. But I couldn't find the right CT propagate, and CT for lifetime.ApplicationStopping was not being canceled until after the timeout occured.

edit: more logs - this is outside of my debug session - it takes 5 seconds to shutdown after we cancel:

[2025-07-18T16:23:10.832Z] Reading functions metadata (Worker)
^CPress Ctrl+C again to force exit.
[2025-07-18T16:23:11.360Z] Failed to start language worker process for runtime: dotnet-isolated. workerId:15471ba1-bb7d-4a7f-967c-46edb899566c
[2025-07-18T16:23:11.360Z] Language Worker Process exited. Pid=31405.
[2025-07-18T16:23:11.360Z] Microsoft.Azure.WebJobs.Script: dotnet exited with code 134 (0x86). Unhandled exception. System.Threading.Tasks.TaskCanceledException: A task was canceled.
[2025-07-18T16:23:11.360Z] Removing errored webhost language worker channel for runtime: dotnet-isolated workerId:15471ba1-bb7d-4a7f-967c-46edb899566c
[2025-07-18T16:23:11.360Z] Microsoft.Azure.WebJobs.Script: dotnet exited with code 134 (0x86). Unhandled exception. System.Threading.Tasks.TaskCanceledException: A task was canceled.
[2025-07-18T16:23:11.652Z] A host error has occurred during startup operation '790aff1d-8f28-4cb7-8c1d-3b1000394db4'.
[2025-07-18T16:23:11.652Z] Microsoft.Azure.WebJobs.Script: dotnet exited with code 134 (0x86). Unhandled exception. System.Threading.Tasks.TaskCanceledException: A task was canceled.
[2025-07-18T16:23:11.657Z] Stopping JobHost
[2025-07-18T16:23:11.658Z] Failed to stop host instance '45b772e9-bc50-43b5-bd5b-00d6d2faff90'.
[2025-07-18T16:23:11.658Z] Microsoft.Azure.WebJobs.Host: The host has not yet started.
[2025-07-18T16:23:11.671Z] Host startup operation has been canceled
[2025-07-18T16:23:11.671Z] Initialization cancellation requested by runtime.
[2025-07-18T16:23:11.672Z] Stopping host...
[2025-07-18T16:23:11.672Z] Host shutdown completed.
[2025-07-18T16:23:11.673Z] Shutting down language worker channels for runtime:dotnet-isolated
[2025-07-18T16:23:16.576Z] Stopping the flush logs timer.
[2025-07-18T16:23:16.576Z] We couldn’t reach the Table service endpoint specified in the 'AzureWebJobsStorage' setting. We are unable to record diagnostic events, so the diagnostic logging service is being stopped. Please confirm network connectivity and endpoint accessibility.
[2025-07-18T16:23:16.576Z] Azure.Core: Connection refused (127.0.0.1:10002). System.Net.Http: Connection refused (127.0.0.1:10002). System.Net.Sockets: Connection refused.

For core tools at least, I would love to reduce this time but I can probably just inject options in the webhost to reduce the flush logs timer to a 1 second or something like that. Anywho not related to this PR

[2025-07-18T16:23:11.673Z] Shutting down language worker channels for runtime:dotnet-isolated
<this 5 second gap>
[2025-07-18T16:23:16.576Z] Stopping the flush logs timer.
[2025-07-18T16:23:16.576Z] We couldn’t reach the Table service endpoint specified in the 'AzureWebJobsStorage' setting. We are unable to record diagnostic events, so the diagnostic logging service is being stopped. Please confirm network connectivity and endpoint accessibility.
[2025-07-18T16:23:16.576Z] Azure.Core: Connection refused (127.0.0.1:10002). System.Net.Http: Connection refused (127.0.0.1:10002). System.Net.Sockets: Connection refused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants