Fix GrpcWorkerChannel.StartWorkerProcessAsync timeout #10937

jviau · 2025-03-21T21:25:47Z

Issue describing the changes in this PR

resolves #issue_for_this_pr

Pull request checklist

IMPORTANT: Currently, changes must be backported to the in-proc branch to be included in Core Tools and non-Flex deployments.

Backporting to the in-proc branch is not required
- Otherwise: Link to backporting PR
My changes do not require documentation changes
- Otherwise: Documentation issue linked to PR
My changes should not be added to the release notes for the next release
- Otherwise: I've added my notes to release_notes.md -- TODO
My changes do not need to be backported to a previous version
- Otherwise: Backport tracked by issue/PR #issue_or_pr
My changes do not require diagnostic events changes
- Otherwise: I have added/updated all related diagnostic events and their documentation (Documentation issue linked to PR)
I have added all required tests (Unit tests, E2E tests) -- TODO

Additional information

This PR improves the ScriptHost startup experience with a bad worker. Today, if a worker crashes or exits immediately after startup, then the GrpcWorkerChannel.StartWorkerProcessAsync will block on _workerInitTask.Task until it eventually times out. This tends to fault the entire host (at least during debugging).

To address this, a WorkerProcess.WaitForExitAsync is added and GrpcWorkerChannel.StartWorkerProcessAsync will also wait on that, improving the responsiveness to a worker exiting before connecting gRPC events.

jviau · 2025-03-21T21:26:45Z

src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs

-            await _rpcWorkerProcess.StartProcessAsync();
-            _state = _state | RpcWorkerChannelState.Initializing;
-            await _workerInitTask.Task;
+            await _rpcWorkerProcess.StartProcessAsync(cancellationToken);


This is the important change - we will now wait on either worker fully initialized (gRPC connection established) or worker exits (in which case, we will re-throw any failures the worker experience).

src/WebJobs.Script/Host/WorkerFunctionMetadataProvider.cs

src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs

brettsam

looks good -- would just like a test added

jviau · 2025-05-14T22:23:27Z

/azp run host.integration-tests

azure-pipelines · 2025-05-14T22:23:39Z

Azure Pipelines successfully started running 1 pipeline(s).

jviau · 2025-05-14T22:23:54Z

/azp run host.public

azure-pipelines · 2025-05-14T22:24:05Z

Azure Pipelines successfully started running 1 pipeline(s).

…functions-host into jviau/abort-channel

liliankasem · 2025-07-18T16:01:08Z

@jviau @mattchenderson I was just investigating this yesterday! Turns out this is the same reason that we are stuck for 60s when trying to exit the CLI during startup (Azure/azure-functions-core-tools#4355). Glad to see there is a already a fix before I opened one :) I'm going to test your changes out with core tools and see if it addresses the issue like I expect.

Looks like it works?

2025-07-18T16:13:56.776Z] Reading functions metadata (Worker)
[2025-07-18T16:13:57.920Z] {
[2025-07-18T16:13:57.920Z]   "ProcessId": 26145,
[2025-07-18T16:14:00.842Z]   "RuntimeIdentifier": "osx-arm64",
[2025-07-18T16:14:00.842Z]   "WorkerVersion": "2.0.0.0",
[2025-07-18T16:14:00.842Z]   "ProductVersion": "2.0.0\u002Bd8b5fe998a8c92819b8ee41d2569d2525413e9c5",
[2025-07-18T16:14:00.842Z]   "FrameworkDescription": ".NET 9.0.7",
[2025-07-18T16:14:00.842Z]   "OSDescription": "Darwin 24.5.0 Darwin Kernel Version 24.5.0: Tue Apr 22 19:54:29 PDT 2025; root:xnu-11417.121.6~2/RELEASE_ARM64_T6030",
[2025-07-18T16:14:00.842Z]   "OSArchitecture": "Arm64",
[2025-07-18T16:14:00.842Z]   "CommandLine": "/Users/likasem/source/functions/azure-functions-core-tools/_dniapp/bin/output/_dniapp.dll --host 127.0.0.1 --port 50779 --workerId 3246a75a-129e-438f-9c2a-1602804bdd04 --requestId 08ba6582-1eea-4e25-b3ac-cc654b982ecd --grpcMaxMessageLength 2147483647 --functions-uri http://127.0.0.1:50779/ --functions-worker-id 3246a75a-129e-438f-9c2a-1602804bdd04 --functions-request-id 08ba6582-1eea-4e25-b3ac-cc654b982ecd --functions-grpc-max-message-length 2147483647"
[2025-07-18T16:14:00.842Z] }

^C  <I hit ctrl c here>

[2025-07-18T16:14:09.119Z] Failed to start language worker process for runtime: dotnet-isolated. workerId:3246a75a-129e-438f-9c2a-1602804bdd04
[2025-07-18T16:14:45.437Z] Removing errored webhost language worker channel for runtime: dotnet-isolated workerId:3246a75a-129e-438f-9c2a-1602804bdd04
[2025-07-18T16:14:45.437Z] System.AggregateException: One or more errors occurred. (Worker process exited before initializing.)
[2025-07-18T16:14:45.437Z]  ---> Microsoft.Azure.WebJobs.Script.Workers.WorkerProcessExitException: Worker process exited before initializing.
[2025-07-18T16:14:45.437Z]    at Microsoft.Azure.WebJobs.Script.Grpc.GrpcWorkerChannel.StartWorkerProcessAsync(CancellationToken cancellationToken) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs:line 397
[2025-07-18T16:14:45.437Z]    --- End of inner exception stack trace ---
[2025-07-18T16:14:45.438Z]    at Microsoft.Azure.WebJobs.Script.WorkerFunctionMetadataProvider.GetFunctionMetadataAsync(IEnumerable`1 workerConfigs, Boolean forceRefresh) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/WorkerFunctionMetadataProvider.cs:line 117
[2025-07-18T16:15:18.940Z] A host error has occurred during startup operation '393e57d4-33aa-48c8-8b45-6234887f2d7c'.
[2025-07-18T16:15:18.940Z] System.AggregateException: One or more errors occurred. (Worker process exited before initializing.)
[2025-07-18T16:15:18.940Z]  ---> Microsoft.Azure.WebJobs.Script.Workers.WorkerProcessExitException: Worker process exited before initializing.
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.Grpc.GrpcWorkerChannel.StartWorkerProcessAsync(CancellationToken cancellationToken) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs:line 397
[2025-07-18T16:15:18.940Z]    --- End of inner exception stack trace ---
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.WorkerFunctionMetadataProvider.GetFunctionMetadataAsync(IEnumerable`1 workerConfigs, Boolean forceRefresh) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/WorkerFunctionMetadataProvider.cs:line 117
[2025-07-18T16:15:18.940Z]    at System.ExceptionExtensions.ThrowIfErrorsPresent(IList`1 exceptions, String message) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Extensions/ExceptionExtensions.cs:line 47
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.WorkerFunctionMetadataProvider.GetFunctionMetadataAsync(IEnumerable`1 workerConfigs, Boolean forceRefresh) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/WorkerFunctionMetadataProvider.cs:line 149
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.WebHost.FunctionMetadataProvider.GetFunctionMetadataAsync(IEnumerable`1 workerConfigs, Boolean forceRefresh) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/FunctionMetadataProvider.cs:line 45
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.FunctionMetadataManager.LoadFunctionMetadata(Boolean forceRefresh, Boolean includeCustomProviders) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/FunctionMetadataManager.cs:line 151
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.FunctionMetadataManager.GetFunctionMetadata(Boolean forceRefresh, Boolean applyAllowList, Boolean includeCustomProviders) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/Host/FunctionMetadataManager.cs:line 101
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.DependencyInjection.ExternalConfigurationStartupValidator.Validate(IConfigurationRoot originalConfig) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/DependencyInjection/ExternalConfigurationStartupValidator.cs:line 42
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.DependencyInjection.ExternalConfigurationStartupValidatorService.StartAsync(CancellationToken cancellationToken) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script/DependencyInjection/ExternalConfigurationStartupValidatorService.cs:line 33
[2025-07-18T16:15:18.940Z]    at Microsoft.Extensions.Hosting.Internal.Host.<StartAsync>b__15_1(IHostedService service, CancellationToken token)
[2025-07-18T16:15:18.940Z]    at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable`1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List`1 exceptions, Func`3 operation)
[2025-07-18T16:15:18.940Z]    at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
[2025-07-18T16:15:18.940Z]    at Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.UnsynchronizedStartHostAsync(ScriptHostStartupOperation activeOperation, Int32 attemptCount, JobHostStartupMode startupMode) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script.WebHost/WebJobsScriptHostService.cs:line 417
[2025-07-18T16:15:18.945Z] Stopping JobHost
[2025-07-18T16:15:18.965Z] Failed to stop host instance '668de38c-aac1-4cab-98de-2bd581e33a16'.
[2025-07-18T16:15:18.965Z] System.AggregateException: One or more hosted services failed to stop. (The host has not yet started.) (Object reference not set to an instance of an object.)
[2025-07-18T16:15:18.965Z]  ---> System.InvalidOperationException: The host has not yet started.
[2025-07-18T16:15:18.965Z]    at Microsoft.Azure.WebJobs.JobHost.StopAsync() in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\JobHost.cs:line 119
[2025-07-18T16:15:18.965Z]    at Microsoft.Azure.WebJobs.Hosting.JobHostService.StopAsync(CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Hosting\JobHostService.cs:line 32
[2025-07-18T16:15:18.965Z]    at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable`1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List`1 exceptions, Func`3 operation)
[2025-07-18T16:15:18.965Z]    --- End of inner exception stack trace ---
[2025-07-18T16:15:18.965Z]    at Microsoft.Extensions.Hosting.Internal.Host.StopAsync(CancellationToken cancellationToken)
[2025-07-18T16:15:18.965Z]    at Microsoft.Azure.WebJobs.Script.WebHost.WebJobsScriptHostService.Orphan(IHost instance, CancellationToken cancellationToken) in /Users/likasem/source/functions/azure-functions-host/src/WebJobs.Script.WebHost/WebJobsScriptHostService.cs:line 897
[2025-07-18T16:15:18.965Z]  ---> (Inner Exception #1) System.NullReferenceException: Object reference not set to an instance of an object.
[2025-07-18T16:15:18.965Z]    at Microsoft.Azure.WebJobs.Hosting.OptionsLoggingService.StopAsync(CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Hosting\OptionsFormatter\OptionsLoggingService.cs:line 38
[2025-07-18T16:15:18.965Z]    at Microsoft.Extensions.Hosting.Internal.Host.ForeachService[T](IEnumerable`1 services, CancellationToken token, Boolean concurrent, Boolean abortOnFirstException, List`1 exceptions, Func`3 operation)<---
[2025-07-18T16:15:18.965Z] 
[2025-07-18T16:15:19.018Z] Host startup operation has been canceled
[2025-07-18T16:15:19.018Z] Initialization cancellation requested by runtime.
[2025-07-18T16:15:19.030Z] Stopping host...
[2025-07-18T16:15:19.031Z] Host shutdown completed.
[2025-07-18T16:15:19.032Z] Shutting down language worker channels for runtime:dotnet-isolated
[2025-07-18T16:15:19.063Z] Stopping the flush logs timer.

Before this change, it would hang for 60 seconds waiting on the timeout from PendingItem.

azure-functions-host/src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs

Line 1748 in a20b307

private sealed class PendingItem

For my version of the fix, I was considering passing a CT to PendingItem and registering an OnCancelled event so that we do not have to wait the entire timeout period if there is a cancellation event. But I couldn't find the right CT propagate, and CT for lifetime.ApplicationStopping was not being canceled until after the timeout occured.

edit: more logs - this is outside of my debug session - it takes 5 seconds to shutdown after we cancel:

[2025-07-18T16:23:10.832Z] Reading functions metadata (Worker)
^CPress Ctrl+C again to force exit.
[2025-07-18T16:23:11.360Z] Failed to start language worker process for runtime: dotnet-isolated. workerId:15471ba1-bb7d-4a7f-967c-46edb899566c
[2025-07-18T16:23:11.360Z] Language Worker Process exited. Pid=31405.
[2025-07-18T16:23:11.360Z] Microsoft.Azure.WebJobs.Script: dotnet exited with code 134 (0x86). Unhandled exception. System.Threading.Tasks.TaskCanceledException: A task was canceled.
[2025-07-18T16:23:11.360Z] Removing errored webhost language worker channel for runtime: dotnet-isolated workerId:15471ba1-bb7d-4a7f-967c-46edb899566c
[2025-07-18T16:23:11.360Z] Microsoft.Azure.WebJobs.Script: dotnet exited with code 134 (0x86). Unhandled exception. System.Threading.Tasks.TaskCanceledException: A task was canceled.
[2025-07-18T16:23:11.652Z] A host error has occurred during startup operation '790aff1d-8f28-4cb7-8c1d-3b1000394db4'.
[2025-07-18T16:23:11.652Z] Microsoft.Azure.WebJobs.Script: dotnet exited with code 134 (0x86). Unhandled exception. System.Threading.Tasks.TaskCanceledException: A task was canceled.
[2025-07-18T16:23:11.657Z] Stopping JobHost
[2025-07-18T16:23:11.658Z] Failed to stop host instance '45b772e9-bc50-43b5-bd5b-00d6d2faff90'.
[2025-07-18T16:23:11.658Z] Microsoft.Azure.WebJobs.Host: The host has not yet started.
[2025-07-18T16:23:11.671Z] Host startup operation has been canceled
[2025-07-18T16:23:11.671Z] Initialization cancellation requested by runtime.
[2025-07-18T16:23:11.672Z] Stopping host...
[2025-07-18T16:23:11.672Z] Host shutdown completed.
[2025-07-18T16:23:11.673Z] Shutting down language worker channels for runtime:dotnet-isolated
[2025-07-18T16:23:16.576Z] Stopping the flush logs timer.
[2025-07-18T16:23:16.576Z] We couldn’t reach the Table service endpoint specified in the 'AzureWebJobsStorage' setting. We are unable to record diagnostic events, so the diagnostic logging service is being stopped. Please confirm network connectivity and endpoint accessibility.
[2025-07-18T16:23:16.576Z] Azure.Core: Connection refused (127.0.0.1:10002). System.Net.Http: Connection refused (127.0.0.1:10002). System.Net.Sockets: Connection refused.

For core tools at least, I would love to reduce this time but I can probably just inject options in the webhost to reduce the flush logs timer to a 1 second or something like that. Anywho not related to this PR

[2025-07-18T16:23:11.673Z] Shutting down language worker channels for runtime:dotnet-isolated
<this 5 second gap>
[2025-07-18T16:23:16.576Z] Stopping the flush logs timer.
[2025-07-18T16:23:16.576Z] We couldn’t reach the Table service endpoint specified in the 'AzureWebJobsStorage' setting. We are unable to record diagnostic events, so the diagnostic logging service is being stopped. Please confirm network connectivity and endpoint accessibility.
[2025-07-18T16:23:16.576Z] Azure.Core: Connection refused (127.0.0.1:10002). System.Net.Http: Connection refused (127.0.0.1:10002). System.Net.Sockets: Connection refused.

…functions-host into jviau/abort-channel

Fix GrpcWorkerChannel.StartWorkerProcessAsync timeout

52e222c

jviau requested a review from a team as a code owner March 21, 2025 21:25

jviau commented Mar 21, 2025

View reviewed changes

brettsam reviewed Mar 24, 2025

View reviewed changes

src/WebJobs.Script.Grpc/Channel/GrpcWorkerChannel.cs Outdated Show resolved Hide resolved

brettsam reviewed Mar 24, 2025

View reviewed changes

jviau added 8 commits April 14, 2025 09:12

Merge branch 'dev' into jviau/abort-channel

e4cb2f0

Add unit tests for worker exit

bf0f495

Fix expectations

16c2923

Skip artifact on reruns

35237df

Fix tests setting env vars

51dadc3

Refactor ScriptStartupTypeDiscovererTests for env isolation

ceb9763

Fix more unit tests

3cd320c

Merge remote-tracking branch 'upstream/dev' into jviau/abort-channel

da70f86

jviau added 5 commits May 14, 2025 15:36

Merge branch 'dev' into jviau/abort-channel

56fb132

Update new test method

b03c7bf

Merge branch 'dev' into jviau/abort-channel

eb92e9f

Update release notes

4f364c7

Merge branch 'jviau/abort-channel' of https://github.com/Azure/azure-…

68106cf

…functions-host into jviau/abort-channel

jviau requested a review from brettsam May 19, 2025 20:12

jviau added 2 commits July 15, 2025 08:30

Merge branch 'dev' into jviau/abort-channel

723264b

Merge branch 'dev' into jviau/abort-channel

56c5634

jviau added 2 commits July 18, 2025 10:06

Swallow Publish event exceptions

fa98493

Merge branch 'jviau/abort-channel' of https://github.com/Azure/azure-…

37b5f69

…functions-host into jviau/abort-channel

liliankasem mentioned this pull request Jul 18, 2025

Core tools not responding to ctrl-C Azure/azure-functions-core-tools#4355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix GrpcWorkerChannel.StartWorkerProcessAsync timeout #10937

Fix GrpcWorkerChannel.StartWorkerProcessAsync timeout #10937

jviau commented Mar 21, 2025 •

edited

Loading

Uh oh!

jviau Mar 21, 2025

Uh oh!

Uh oh!

Uh oh!

brettsam left a comment

Uh oh!

jviau commented May 14, 2025

Uh oh!

azure-pipelines bot commented May 14, 2025

Uh oh!

jviau commented May 14, 2025

Uh oh!

azure-pipelines bot commented May 14, 2025

Uh oh!

liliankasem commented Jul 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fix GrpcWorkerChannel.StartWorkerProcessAsync timeout #10937

Are you sure you want to change the base?

Fix GrpcWorkerChannel.StartWorkerProcessAsync timeout #10937

Conversation

jviau commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue describing the changes in this PR

Pull request checklist

Additional information

Uh oh!

jviau Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

brettsam left a comment

Choose a reason for hiding this comment

Uh oh!

jviau commented May 14, 2025

Uh oh!

azure-pipelines bot commented May 14, 2025

Uh oh!

jviau commented May 14, 2025

Uh oh!

azure-pipelines bot commented May 14, 2025

Uh oh!

liliankasem commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jviau commented Mar 21, 2025 •

edited

Loading

liliankasem commented Jul 18, 2025 •

edited

Loading