Skip to content

channels/telegram: propagate send failures and cache terminal reaction errors#1100

Open
Streamweaver wants to merge 2 commits intoRightNow-AI:mainfrom
Streamweaver:fix/telegram-silent-failures
Open

channels/telegram: propagate send failures and cache terminal reaction errors#1100
Streamweaver wants to merge 2 commits intoRightNow-AI:mainfrom
Streamweaver:fix/telegram-silent-failures

Conversation

@Streamweaver
Copy link
Copy Markdown

Summary

The Telegram adapter currently masks two classes of real failures as success.

  1. Send pathapi_send_message and its five sibling helpers (api_send_photo, api_send_document, api_send_document_upload, api_send_voice, api_send_location) log a warn! on HTTP non-success and return Ok(()). Callers (kernel::send_channel_messagetool_channel_send) interpret that as successful delivery and report "Message sent to <id> via telegram" back to the agent. The user receives nothing; the agent records phantom success in its session history and acts on it in future turns.

  2. Reaction pathfire_reaction retries setMessageReaction with the same emoji on every turn even when Telegram returns terminal errors like REACTION_INVALID (emoji not in the free-reaction allowlist), REACTION_NOT_AVAILABLE (chat-admin restriction), or REACTION_TOO_MANY (per-message cap). Pure log spam plus wasted API calls on every agent turn.

This PR fixes both, adds regression tests, and brings the adapter into line with the crate's error-handling convention documented in CONTRIBUTING.md (lines 116, 239–247).

Evidence

Real silent-failure log line from production:

WARN openfang_channels::telegram: Telegram sendMessage failed (400 Bad Request): {"ok":false,"error_code":400,"description":"Bad Request: can't parse entities: Unmatched end tag at byte offset 988, expected \"</i>\", found \"</code>\""}

The agent loop that produced this warning reported success to its calling tool. The message was never delivered. The agent's session history recorded "Message sent," which then corrupted future behavior.

Reaction spam from the same session, every turn forever:

DEBUG Telegram setMessageReaction failed: {"ok":false,"error_code":400,"description":"Bad Request: REACTION_INVALID"}
DEBUG Telegram setMessageReaction failed: {"ok":false,"error_code":400,"description":"Bad Request: REACTION_INVALID"}

Changes

Commit 1 — propagate send failures from api_send_*

crates/openfang-channels/src/telegram.rs:

  • api_send_photo, api_send_document, api_send_document_upload, api_send_voice, api_send_location now return Err(format!(...).into()) on HTTP non-success in addition to the existing warn!.
  • api_send_message splits long text via split_message(4096). Naively returning Err on any chunk failure would introduce a "partial-delivery-then-error" regression — worse than the original silent success. The function now tracks delivered_any across chunks and:
    • Returns Err on first-chunk failure (nothing delivered yet; surface the failure — this is where the motivating HTML-parse-error bug lives).
    • Logs warn! and continues on subsequent-chunk failure (user already received preceding chunks; best-effort completes the send).
    • This matches the existing convention used by every other adapter in the crate that calls split_message (Discord, Gitter, Mattermost, Nextcloud, Twitch, Pumble).

Commit 2 — cache terminal setMessageReaction errors per (chat, emoji)

  • New TelegramAdapter field: rejected_reactions: Arc<Mutex<HashSet<(i64, String)>>>. Uses std::sync::Mutex — critical section is two HashSet ops (contains + insert), never held across .await. Endorsed by the Tokio shared-state tutorial for this exact shape.
  • New private helper is_terminal_reaction_error(body_text: &str) -> bool matches REACTION_INVALID | REACTION_NOT_AVAILABLE | REACTION_TOO_MANY. Transient errors (429, 5xx, MESSAGE_NOT_MODIFIED, unrelated 400s) are NOT cached.
  • fire_reaction short-circuits early if (chat_id, emoji) is cached; inserts on terminal-error response. Key-by-chat honors Chat.available_reactions varying per chat and being admin-mutable (API ref). Cache is per-process — rebuilds naturally on restart, which handles any runtime allowlist change without needing persistence.
  • A brief race is possible where two concurrent fire_reaction calls for the same (chat, emoji) both pass the cache check before either rejection lands, producing up to N duplicate API calls on the first rejection. The duplicate insert is idempotent so this is benign and self-limits on the second turn; documented in-code.

Tests

10 new tests in the same file using a small in-crate stub server (axum on an ephemeral port, reached via the existing api_url constructor seam — zero new dependencies):

Send path:

  • test_api_send_message_single_chunk_400_returns_err
  • test_api_send_message_single_chunk_200_returns_ok
  • test_api_send_message_first_chunk_fail_returns_err
  • test_api_send_message_partial_delivery_returns_ok (multi-chunk B1 regression guard)

Reaction cache:

  • test_is_terminal_reaction_error_matches
  • test_is_terminal_reaction_error_rejects_transient
  • test_fire_reaction_caches_on_reaction_invalid
  • test_fire_reaction_cache_is_per_chat (proves the per-chat key invariant)
  • test_fire_reaction_does_not_cache_non_terminal
  • test_fire_reaction_does_not_cache_on_success

All 47 telegram tests pass (37 pre-existing + 10 new). No new clippy warnings on the changed file.

Compatibility

  • Public API: unchanged. ChannelAdapter trait signatures unchanged.
  • Behavior change for legitimate callers: previously-silent send failures now surface as Err. Every call site in the repo was audited and handles Err correctly:
    • kernel::send_channel_message (kernel/src/kernel.rs:6988-6996) — already maps via .map_err(|e| format!("Channel send failed: {e}"))?.
    • tool_channel_send (runtime/src/tool_runner.rs:2430) — propagates to the agent as a tool-result error string.
    • bridge::send_response_to_channel — logs and swallows (no behavior change; log line is now truthful).
    • cron_delivery::deliver_to_target — per-target best-effort (no behavior change).
    • bridge::spawn_typing_loop — discards with let _ (no behavior change).
  • No config, schema, or dependency changes.

Follow-ups (out of scope)

  • Five sibling adapters share the same log-and-Ok silent-failure pattern: discord.rs:131-135, slack.rs:127-132, feishu.rs:409-427, line.rs:494-498, revolt.rs:187-194. Happy to file individual issues if welcome.
  • OpenFang's default lifecycle emojis (types.rsALLOWED_REACTION_EMOJI / default_phase_emoji) include several outside Telegram's free-reaction allowlist (⏳ ⚙️ ✅ ❌ 🔄). A per-channel emoji default set would eliminate the reaction-cache churn entirely for Telegram while keeping the current semantic set for other adapters. Behavior-defining rather than bug-fixing — belongs in a separate discussion.

Test plan

cargo test --release -p openfang-channels -- telegram    # 47/47 pass
cargo clippy --release -p openfang-channels --lib --tests

Live verification on a running instance: user sent real Telegram messages, reply landed, log showed first-rejection-then-cached behavior for and , subsequent turns short-circuited with zero REACTION_INVALID lines and no send-path regressions.

The six outbound helpers in the Telegram adapter (sendMessage, sendPhoto,
sendDocument, sendDocument_upload, sendVoice, sendLocation) previously
logged a `warn!` on HTTP non-success and still returned `Ok(())`. Callers
interpreted that as successful delivery and told the agent "Message sent"
even when Telegram had rejected the request (e.g. 400 Bad Request from
malformed HTML entities with parse_mode=HTML). The agent recorded phantom
success in its session history, corrupting subsequent behavior.

The fix returns `Err(format!(...).into())` on HTTP non-success in all six
helpers, matching the error-handling convention documented in
CONTRIBUTING.md.

`api_send_message` is slightly different because it splits long messages
into chunks via `split_message(4096)`. Naively returning `Err` on any
chunk failure would create a partial-delivery-then-error regression —
worse than the original silent success. The function now tracks
`delivered_any` across chunks:

- First-chunk failure (nothing delivered yet) → return `Err` to surface
  the failure. This is where the motivating HTML-parse-error bug lives,
  so the fix is fully effective.
- Subsequent-chunk failure (user already received preceding chunks) →
  log `warn!` and continue with best-effort delivery, matching the
  convention used by every other adapter in the crate that calls
  `split_message` (Discord, Gitter, Mattermost, Nextcloud, Twitch,
  Pumble, etc.).

Tests: 4 new tests using a small in-crate stub server (axum on an
ephemeral port, reached via the existing `api_url` constructor seam —
zero new dependencies). 41 telegram tests pass (37 existing + 4 new).
…, emoji)

`fire_reaction` calls `setMessageReaction` fire-and-forget on every
agent lifecycle event. When Telegram returns a terminal error like
`REACTION_INVALID` (emoji not in the bot's free-reaction allowlist),
`REACTION_NOT_AVAILABLE` (chat admin restricted this emoji), or
`REACTION_TOO_MANY` (per-message cap), retrying on every subsequent
turn is pointless log spam and wasted API quota.

This adds a per-bot-instance `HashSet<(i64, String)>` keyed by
`(chat_id, emoji)` that records terminal rejections and short-circuits
future calls for the same pair. Keyed by chat, not just emoji, because
`Chat.available_reactions` varies across chats and is admin-mutable
(https://core.telegram.org/bots/api#setmessagereaction) — an emoji
rejected in chat A may still be valid in chat B. Cache is
per-process; on restart it rebuilds naturally, which handles any
runtime allowlist change without needing persistence.

The terminal-error match uses a small private helper
`is_terminal_reaction_error` that substring-matches the three
permanent errors. Transient errors (429, 5xx, `MESSAGE_NOT_MODIFIED`,
unrelated 400s) are deliberately NOT cached.

Concurrency: the cache uses `std::sync::Mutex` — critical section is
two `HashSet` ops (contains + insert), never held across `.await`.
Endorsed by the Tokio shared-state tutorial
(https://tokio.rs/tokio/tutorial/shared-state) for exactly this shape.
Two concurrent `fire_reaction` calls for the same (chat, emoji) can
both pass the cache check before either rejection lands, producing up
to N duplicate API calls on the first rejection; the duplicate
`insert` is idempotent so this is benign and self-limits on the
second turn. Documented in-code.

Tests: 6 new tests covering terminal-error matching, cache insertion,
per-chat key isolation, and non-caching of transient and successful
responses. Total 47 telegram tests pass (41 existing + 6 new). No new
clippy warnings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant