Skip to content

Conversation

Copilot
Copy link
Contributor

@Copilot Copilot AI commented Aug 19, 2025

Problem

When red team scans encounter non-retryable errors (like authentication failures, bad requests, or configuration issues), the current implementation continues to retry these operations, wasting time and resources before eventually failing. This leads to poor user experience and inefficient resource usage.

Solution

This PR implements fail-fast behavior for non-retryable errors by:

  1. Adding NonRetryableError exception class - A specialized exception that signals immediate failure for errors that won't be resolved by retrying
  2. Enhanced error detection - Logic to identify non-retryable errors including HTTP 4xx status codes (except 429), authentication failures, and configuration errors
  3. Modified retry logic - Updated RetryManager.should_retry_exception() to immediately raise NonRetryableError for non-retryable cases
  4. Orchestrator integration - Enhanced the network retry decorator to properly handle and log non-retryable errors

Error Categories

Non-retryable (fail fast):

  • HTTP 4xx errors: 400 Bad Request, 401 Unauthorized, 403 Forbidden, 404 Not Found
  • Authentication/permission errors: "authentication failed", "unauthorized", "access denied"
  • Configuration errors: "invalid configuration", "malformed request"
  • File system errors: FileNotFoundError, PermissionError

Retryable (existing behavior unchanged):

  • HTTP 5xx errors: 500 Internal Server Error
  • Network issues: Connection errors, timeouts
  • Rate limiting: 429 Too Many Requests

Example Behavior Change

Before:

Red team scan encounters 401 Unauthorized
→ Retry attempt 1: fails with 401
→ Retry attempt 2: fails with 401  
→ Retry attempt 3: fails with 401
→ Eventually gives up after multiple failed attempts

After:

Red team scan encounters 401 Unauthorized
→ Detected as non-retryable error
→ Immediately raises NonRetryableError with clear message
→ Scan fails fast, saving time and resources

Benefits

  • Improved user experience - Immediate feedback on configuration/authentication issues
  • Resource efficiency - No wasted retry attempts on operations that will never succeed
  • Better error reporting - Clear distinction between temporary failures and permanent misconfigurations
  • Backward compatibility - Existing retry behavior for legitimate transient errors remains unchanged

Files Changed

  • azure/ai/evaluation/red_team/_utils/exception_utils.py - Added NonRetryableError class and detection logic
  • azure/ai/evaluation/red_team/_utils/retry_utils.py - Enhanced retry manager with fail-fast behavior
  • azure/ai/evaluation/red_team/_orchestrator_manager.py - Updated retry decorator to handle non-retryable errors
  • azure/ai/evaluation/red_team/_utils/__init__.py - Added exports for new exception type

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • pypi.org
    • Triggering command: /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/lib/python3.9/site-packages/pip/__pip-REDACTED__.py install --ignore-installed --no-user --prefix /tmp/pip-build-env-79zs7crv/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i REDACTED -- setuptools>=40.8.0 (dns block)
    • Triggering command: /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/python /home/REDACTED/work/azure-sdk-for-python/azure-sdk-for-python/.venv/bin/pip install httpx tenacity (dns block)

If you need me to access, download, or install something from one of these locations, you can either:


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

@Copilot Copilot AI changed the title [WIP] When the user's target for a red team scan returns a non-retryable error we should throw an error. Implement fail-fast behavior for non-retryable errors in red team scans Aug 19, 2025
@Copilot Copilot AI requested a review from slister1001 August 19, 2025 16:48
Copilot finished work on behalf of slister1001 August 19, 2025 16:48
@slister1001
Copy link
Member

do not fail fast on 400

@slister1001 slister1001 marked this pull request as ready for review August 20, 2025 14:01
@Copilot Copilot AI review requested due to automatic review settings August 20, 2025 14:01
@slister1001 slister1001 requested a review from a team as a code owner August 20, 2025 14:01
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements fail-fast behavior for non-retryable errors in red team scans to improve efficiency and user experience. When the system encounters errors that won't be resolved by retrying (like authentication failures or bad requests), it now fails immediately instead of wasting time on futile retry attempts.

Key changes:

  • Added NonRetryableError exception class for immediate failure scenarios
  • Enhanced error detection logic to identify non-retryable HTTP 4xx errors and configuration issues
  • Modified retry logic to raise NonRetryableError for non-retryable cases instead of continuing retry attempts

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
exception_utils.py Added NonRetryableError class and is_non_retryable_error() method with detection logic
retry_utils.py Enhanced RetryManager with fail-fast behavior and non-retryable error checking
__init__.py Exported new NonRetryableError exception class
_orchestrator_manager.py Updated retry decorator to handle and log non-retryable errors appropriately

return True

return False

Copy link
Preview

Copilot AI Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition is duplicated from line 111 in the same method. Consider extracting this logic to avoid code duplication.

Suggested change
if self._is_non_retryable_status_code(status_code):
return True
# Specific HTTP status errors that are non-retryable
if isinstance(exception, httpx.HTTPStatusError):
status_code = exception.response.status_code
if self._is_non_retryable_status_code(status_code):
return True
return False
def _is_non_retryable_status_code(self, status_code: int) -> bool:
"""Return True if status code is a non-retryable client error (4xx except 429)."""
return 400 <= status_code < 500 and status_code != 429

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants