Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 66 additions & 2 deletions plugins/newspack-plugin/includes/class-alert-manager.php
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,15 @@ class Alert_Manager {
*/
const FAILURE_LOG_OPTION = 'newspack_alert_failure_log';

/**
* Window during which a repeating health-check failure for the same
* integration + error-code signature emits at most one Slack alert.
*
* Private because no external caller needs to read it; keeping the
* surface minimal lets the value evolve without breaking consumers.
*/
private const HEALTH_CHECK_DEDUP_INTERVAL = DAY_IN_SECONDS;

/**
* Default pattern rules.
* Each rule defines a grouping dimension, threshold, and time interval.
Expand Down Expand Up @@ -454,14 +463,52 @@ private static function format_interval( $seconds ) {
/**
* Handle integration health check failure.
*
* Deduplicates by integration + error-code + error-message signature
* for HEALTH_CHECK_DEDUP_INTERVAL so an hourly cron does not repeat
* the same Slack alert all day. A new error code OR a changed message
* on the same integration (e.g. "list missing" escalating to "auth
* fully revoked") falls outside the key and alerts immediately.
*
* Known boundaries of the dedup contract:
* - Message text is part of the key, so locale shifts between cron
* passes (e.g. `switch_to_locale()` in a multilingual context) can
* produce a different key for the same underlying error and
* re-alert. Newspack ESP error messages are static per code today,
* so this is theoretical; revisit if dynamic content lands in
* error strings.
* - The dedup key is stored as a transient, so on hosts backed by a
* persistent object cache (memcached) the entry can be evicted
* under LRU pressure before HEALTH_CHECK_DEDUP_INTERVAL elapses.
* The failure mode is re-alerting on the next hourly cron — the
* alternative (writing to the options table on every cron tick)
* has its own cost; transient + accepted re-alert risk is the
* intentional trade-off here.
*
* @param array $payload Health check failure data.
*/
public static function handle_health_check_failed( $payload ) {
$error = $payload['error'] ?? null;
$error = $payload['error'] ?? null;
$integration_id = $payload['integration_id'] ?? 'unknown';
$error_codes = is_wp_error( $error ) ? $error->get_error_codes() : [];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Dedup keys on error codes only, not messages

An escalating failure that keeps the same code set but carries a worse message (e.g. "list missing" → "auth fully revoked") is suppressed for the full 24h window, so the richer signal never reaches Slack. Worth confirming that's acceptable, or fold a short message hash into the key.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 0654b80 — the dedup key now folds WP_Error::get_error_messages() into the md5 alongside the codes. An escalating same-code failure with a worse message now bypasses the bucket and re-alerts. Added test_health_check_failed_alerts_on_new_error_messages. Trade-off noted in the docblock for any future caller passing dynamic content (timestamps/IDs) in the message — for the current Newspack ESP errors, messages are static per code so dedup remains stable.

if ( empty( $error_codes ) ) {
$error_codes = [ 'unknown' ];
}
$error_messages = is_wp_error( $error ) ? $error->get_error_messages() : [];

$dedup_key = self::get_health_check_dedup_key( $integration_id, $error_codes, $error_messages );
Comment on lines +466 to +498

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right — PR description updated. The 'Changes proposed' section now states the dedup key is integration_id + sorted error codes + sorted error messages and explicitly calls out the trade-off (escalating same-code failures re-alert; dynamic content in messages would inflate alert volume — current Newspack ESP messages are static per code so it lands on the right side). The change came out of @chickenn00dle's feedback on the prior codes-only signature; test_health_check_failed_alerts_on_new_error_messages pins the new behavior.

if ( get_transient( $dedup_key ) ) {
return;
}

// Set the dedup transient BEFORE dispatch so a `newspack_alert`
// handler that throws (e.g. transient Slack POST failure) cannot
// defeat dedup by leaving the key unset for the next hourly cron.
set_transient( $dedup_key, time(), self::HEALTH_CHECK_DEDUP_INTERVAL );

$message = sprintf(
'Integration "%s" health check failed: %s',
$payload['integration_name'] ?? 'unknown',
is_wp_error( $error ) ? implode( '; ', $error->get_error_messages() ) : 'unknown error'
is_wp_error( $error ) ? implode( '; ', $error_messages ) : 'unknown error'
);

/** This action is documented in includes/class-alert-manager.php */
Expand All @@ -476,5 +523,22 @@ public static function handle_health_check_failed( $payload ) {
]
);
}

/**
* Get the deduplication transient key for a health-check failure.
*
* @param string $integration_id The integration identifier.
* @param string[] $error_codes The WP_Error codes from the failure.
* @param string[] $error_messages The WP_Error messages from the failure.
*
* @return string Transient key.
*/
private static function get_health_check_dedup_key( $integration_id, $error_codes, $error_messages = [] ) {
$codes = array_map( 'strval', $error_codes );
sort( $codes );
$messages = array_map( 'strval', $error_messages );
sort( $messages );
return 'newspack_alert_hc_' . md5( $integration_id . ':' . implode( ',', $codes ) . ':' . implode( '|', $messages ) );
}
}
Alert_Manager::init();
Original file line number Diff line number Diff line change
Expand Up @@ -364,6 +364,29 @@ public static function get_active_integrations() {
return $active;
}

/**
* Get active integrations whose external prerequisites are also configured.
*
* Iteration sites that perform real I/O on each integration (health checks,
* contact pushes, pulls, retries) MUST use this instead of
* `get_active_integrations()` to avoid alerting on, or scheduling AS
* retries for, integrations the admin has not yet finished setting up.
*
* `is_set_up()` is a stored-state check by contract — see ESP's override.
* Treating it as a runtime probe here would silently drop traffic on
* transient provider failures.
*
* @return Integration[] Integrations that are both enabled AND set up.
*/
public static function get_active_configured_integrations() {
return array_filter(
self::get_active_integrations(),
function ( $integration ) {
return $integration->is_set_up();
}
);
}

/**
* Get a specific integration by ID.
*
Expand Down Expand Up @@ -742,13 +765,17 @@ public static function deactivate_health_check() {
}

/**
* Run health checks on all active integrations.
* Run health checks on all active and configured integrations.
*
* Routed through `get_active_configured_integrations()` so a missing
* provider or unconfigured master list — a setup-incomplete state, not a
* runtime incident — surfaces in the integrations UI rather than the
* alerts channel.
*
* Logs failures and fires an action for the Alert Manager.
*/
public static function run_health_checks() {
$active = self::get_active_integrations();
foreach ( $active as $integration ) {
foreach ( self::get_active_configured_integrations() as $integration ) {
$result = $integration->health_check();
if ( is_wp_error( $result ) ) {
Logger::error(
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ public static function is_stale( $timestamp ) {
*/
public static function pull_sync( $integrations = [] ) {
if ( empty( $integrations ) ) {
$integrations = Integrations::get_active_integrations();
$integrations = Integrations::get_active_configured_integrations();
}

Logger::log( 'Synchronous pull started for user "' . get_current_user_id() . '".', self::LOGGER_HEADER );
Expand Down Expand Up @@ -158,7 +158,7 @@ public static function pull_sync( $integrations = [] ) {
* @return true|\WP_Error True if all succeeded, WP_Error with combined messages.
*/
public static function pull_all( $user_id ) {
$active_integrations = Integrations::get_active_integrations();
$active_integrations = Integrations::get_active_configured_integrations();
$errors = [];
Comment on lines 160 to 162

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — Fixed in a8cea54. Contact_Pull::pull_sync() now defaults to Integrations::get_active_configured_integrations(), and handle_ajax_pull() has a defense-in-depth check that silently succeeds for an unconfigured integration (so a direct AJAX call can't bypass the gate). Added test_pull_sync_skips_unconfigured_integrations which registers a configured and an unconfigured mock and asserts only the configured one receives a loopback. The other remaining get_active_integrations() callers (register_group_labels, register_my_account_endpoints, promoted-fields, Sync::has_one_syncable_integration) are intentionally kept — they're registration/UI/has-any checks, not integration-walking I/O, so the configured filter doesn't apply.


foreach ( $active_integrations as $integration ) {
Expand Down Expand Up @@ -225,6 +225,14 @@ public static function handle_ajax_pull() {
wp_send_json_error( 'Integration not found or not enabled.', 404 );
}

// Defense-in-depth: pull_sync already filters to set-up integrations,
// but a direct AJAX call could still arrive here for an unconfigured
// integration. Skip silently with success — "not set up" is a no-op,
// not an error worth surfacing to the loopback caller.
if ( ! $integration->is_set_up() ) {
wp_send_json_success();
}

$user_id = get_current_user_id();
if ( ! $user_id ) {
wp_send_json_error( 'No user context.', 403 );
Expand Down Expand Up @@ -418,6 +426,11 @@ public static function execute_integration_retry( $retry_data ) {
return;
}

if ( ! $integration->is_set_up() ) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Same retry-chain abort on the pull side

Mirrors the sync retry guard: a transient is_set_up() probe failure permanently aborts the queued pull retry with no reschedule.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same root cause. Fixed in dae5bae — the stored-state is_set_up() means a transient probe failure can't abort an in-flight pull retry chain.

Logger::log( sprintf( 'Integration "%s" no longer set up on pull retry %d; aborting retry chain.', $integration_id, $retry_count ), self::LOGGER_HEADER );
return;
}

Logger::log( sprintf( 'Executing pull retry %d/%d for integration "%s" of user %d.', $retry_count, self::MAX_RETRIES, $integration_id, $user_id ), self::LOGGER_HEADER );

$result = self::pull_single_integration( $user_id, $integration );
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -37,14 +37,28 @@ public function __construct() {
/**
* Whether the ESP integration is ready to sync.
*
* Mirrors the readiness gate used by get_settings_config() so the configure
* UI never advertises a card as set up while the underlying settings call
* short-circuits to an empty config.
* Checks STORED configuration only — provider option set + master list
* option set. Does NOT call the live provider API. A live `get_lists()`
* call here would mean: every gate that consults `is_set_up()`
* (`Integrations::get_active_configured_integrations()`, the retry-time
* guards in `Contact_Sync` / `Contact_Pull`) would silently skip and
* lose data on any transient provider failure — exactly the failure
* mode the AS retry system was built to survive. The setup question
* "did the admin finish configuring this?" must be answered from local
* state; "is the provider reachable right now?" is `health_check()`'s
* job.
*
* @return bool True if an ESP provider is selected and at least one list is active.
* @return bool True if a provider is selected and a master list ID is stored.
*/
public function is_set_up() {
return Reader_Activation::is_esp_configured();
$newsletters_configuration_manager = Configuration_Managers::configuration_manager_class_for_plugin_slug( 'newspack-newsletters' );
if ( ! $newsletters_configuration_manager || is_wp_error( $newsletters_configuration_manager ) ) {
return false;
}
if ( ! $newsletters_configuration_manager->is_esp_set_up() ) {
return false;
}
return (bool) $this->get_master_list_id();
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ private static function push_to_integrations( $contact, $context, $existing_cont
* @param string $context The context of the sync.
*/
$contact = \apply_filters( 'newspack_esp_sync_contact', $contact, $context );
$integrations = Integrations::get_active_integrations();
$integrations = Integrations::get_active_configured_integrations();
$errors = [];

// Resolve user ID for retry scheduling.
Expand Down Expand Up @@ -372,6 +372,11 @@ public static function execute_integration_retry( $retry_data ) {
return;
}

if ( ! $integration->is_set_up() ) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Aborts an in-flight retry chain on a transient probe failure

Same root cause as the push-path comment above. A retry that was queued because of a real prior failure hits this guard at execution time; if the live is_set_up() ESP call momentarily errors, the chain aborts with no reschedule and no exhaustion record — the queued contact update is dropped. A configured-but-temporarily-unreachable ESP is exactly the case the backoff/retry was built to handle.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same root cause as the push-path comment. Fixed in dae5baeis_set_up() is now a stored-state check, so this retry guard can't abort a chain because of a transient probe failure.

static::log( sprintf( 'Integration "%s" no longer set up on retry %d; aborting retry chain.', $integration_id, $retry_count ) );
return;
}

static::log( sprintf( 'Executing retry %d/%d for integration "%s" sync of user %d (%s).', $retry_count, self::MAX_RETRIES, $integration_id, $user_id, $contact['email'] ?? 'unknown' ) );

/** This filter is documented in includes/reader-activation/sync/class-contact-sync.php */
Expand Down
15 changes: 14 additions & 1 deletion plugins/newspack-plugin/tests/mocks/newsletters-mocks.php
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,19 @@ public static function get_fields( $list_id = null ) {
class Newspack_Newsletters {
const EMAIL_HTML_META = 'newspack_email_html';

/**
* Default return value of is_service_provider_configured(). Tests that
* exercise the "provider not selected" branch can flip this to false
* via reset_calls() / direct assignment, then restore it.
*
* @var bool
*/
public static $is_service_provider_configured = true;

public static function reset_calls() {
self::$is_service_provider_configured = true;
}

public static function service_provider() {
return get_option( 'newspack_newsletters_service_provider', false );
}
Expand All @@ -76,7 +89,7 @@ public static function get_service_provider() {
}

public static function is_service_provider_configured() {
return true;
return self::$is_service_provider_configured;
}
}
}
Expand Down
Loading
Loading