Skip to content

Conversation

ebembi-crdb
Copy link
Contributor

Summary

Implements robust retry logic with exponential backoff for Netlify documentation builds to handle transient network
failures and remote include errors.

Changes

  • Retry Logic: Added exponential backoff retry mechanism (3 attempts by default)
  • Backoff Strategy: 30s base delay with exponential scaling (30s → 60s → 120s)
  • Build Script: Enhanced build script with comprehensive error handling and monitoring
  • Configuration: Configurable retry parameters via environment variables
  • Logging: Detailed attempt tracking and failure analysis

Benefits

  • Reliability: Handles transient network failures automatically
  • Reduced Manual Intervention: Failed builds due to temporary issues retry automatically
  • Smart Backoff: Exponential delays reduce load on failing services
  • Monitoring: Clear logging of retry attempts and failure patterns
  • Configurable: Easily adjustable retry count and timing via netlify.toml

Configuration

  • MAX_RETRIES: Number of retry attempts (default: 3)
  • BASE_RETRY_DELAY: Base delay in seconds for exponential backoff (default: 30)

Testing

Validated with intentional failure scenarios including:

  • DNS resolution failures
  • HTTP 404 errors
  • Network timeouts
  • Remote include failures

The retry mechanism successfully recovers from transient failures while failing fast on persistent issues.

ebembi-crdb added 3 commits September 18, 2025 21:09
- Add test-remote-failure.md with multiple failure scenarios
- Configure netlify.toml for retry testing environment
- Keep cache plugin for performance during testing
- Ready for manual PR creation to trigger deploy preview
- Create comprehensive build test script with retry capabilities
- Add detailed logging and build statistics
- Support both retry testing and stress testing modes
- Include build timing and attempt tracking
- Make script executable for Netlify deployment
  - Remove test-remote-failure.md with intentional failures
  - Update netlify.toml for production retry configuration
  - Rename build-test.sh to build.sh and remove test-specific code
  - Configure 3 retries with 30s base exponential backoff delay
  - Simplify logging for production use while keeping retry functionality
@ebembi-crdb ebembi-crdb requested a review from a team as a code owner September 22, 2025 13:19
Copy link

netlify bot commented Sep 22, 2025

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit d263130
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-interactivetutorials-docs/deploys/68d14ccf506e170008846fc1

Copy link

netlify bot commented Sep 22, 2025

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit d263130
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-api-docs/deploys/68d14ccf170e820008f54125

Copy link

Files changed:

  • src/current/netlify.toml
  • src/current/netlify/build.sh
  • src/current/package.json

Copy link

netlify bot commented Sep 22, 2025

Netlify Preview

Name Link
🔨 Latest commit d263130
🔍 Latest deploy log https://app.netlify.com/projects/cockroachdb-docs/deploys/68d14ccf12e8ee00085bb498
😎 Deploy Preview https://deploy-preview-20409--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link
Contributor

@mikeCRL mikeCRL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebembi-crdb I like the logging and exponential backoff. I tested the PR with some common error types (via a few commits on a branch off your branch) and found the following:

  • Good behavior: htmltest failures (bad anchor links) do not retry - this is correct since these are permanent errors (commit/log)
  • Problematic behavior:
    • Liquid syntax errors retry 3 times (taking an extra ~4-5 minutes) (commit/log
    • Bad Liquid {% link %} references to non-existent files retry 3 times (taking an extra ~4-5 minutes) (commit/log)

Both of these are permanent errors that won't resolve on retry.

I would suggest adding error classification to only retry transient network errors (and in the future, we could always add additional possible criteria).

Specifically, per Claude, I believe we could modify the build_with_monitoring function to capture and analyze Jekyll's output:

function build_with_monitoring {
    local config=$1
    local build_log="build_${ATTEMPT_COUNT}.log"

    # Capture Jekyll output for analysis
    if bundle exec jekyll build --trace --config _config_base.yml,$config 2>&1 | tee "$build_log"; then
        return 0
    else
        local exit_code=$?

        # Only retry on transient network errors
        if grep -qE "Temporary failure in name resolution|SocketError|Connection refused|Connection reset|Failed to open TCP connection" "$build_log"; then
            return 2  # Retryable error
        else
            # Permanent errors: Liquid syntax, missing files, ArgumentError
            echo "❌ Permanent build error detected - not retrying"
            return 1  # Non-retryable
        fi
    fi
}

Then update build_with_retries to check the return code:

if build_with_monitoring "$config"; then
    success=true
    break
else
    local result=$?
    if [[ $result -eq 1 ]]; then
        echo "Permanent error - failing immediately"
        break  # Don't retry
    fi
    # Only retry if result == 2 (transient error)
fi

I have not tested this, but I wanted to share it in case it is helpful.

One way or another I do think it is worth only retrying on DNS/network failures, since I can't think of any other failure that it would've benefitted us to retry, and I know that retrying on those other cases I mentioned would waste time for writers when they aren't closely watching their build log.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants