Skip to content

Conversation

@AlanPonnachan
Copy link

@AlanPonnachan AlanPonnachan commented Sep 19, 2025

  • I have added tests that cover my changes.
  • If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
  • PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
  • (If applicable) I have updated the documentation accordingly.

Description

This PR introduces prompt caching telemetry for the AWS Bedrock Converse and Converse Stream APIs, bringing feature parity with the existing invoke_model instrumentation.

The Converse API reports caching information in the usage field of the response body, rather than through HTTP headers. This implementation adds the necessary logic to parse this information and record it as metrics and span attributes.

Changes include:

  1. New function prompt_caching_converse_handling in prompt_caching.py to extract cache_read_input_tokens and cache_creation_input_tokens from the response body.
  2. Integration into __init__.py: The new function is now called from _handle_converse and _handle_converse_stream to process caching data for both standard and streaming calls.
  3. New Test File: Added test_bedrock_converse_prompt_caching_metrics.py to validate that the gen_ai.prompt.caching metric is correctly emitted for the Converse API.

Fixes #3337


Important

Adds prompt caching telemetry for AWS Bedrock Converse APIs, including new function for caching data extraction and corresponding tests.

  • Behavior:
    • Adds prompt_caching_converse_handling in prompt_caching.py to extract caching data from Converse API response body.
    • Integrates prompt_caching_converse_handling into _handle_converse and _handle_converse_stream in __init__.py.
  • Testing:
    • Adds test_bedrock_converse_prompt_caching_metrics.py to validate gen_ai.prompt.caching metric emission for Converse API.
  • Misc:

This description was created by Ellipsis for 4fa3792. You can customize this summary. It will automatically update as commits are pushed.

Summary by CodeRabbit

  • New Features

    • Prompt-caching detection for Converse (streaming and non‑streaming), recording cache read/write state and token-level usage into telemetry and metrics.
  • Tests

    • Added tests validating cache write and cache read scenarios, span attributes, and aggregated token metrics.
    • Added a test cassette exercising a Converse error response path.

@CLAassistant
Copy link

CLAassistant commented Sep 19, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Changes requested ❌

Reviewed everything up to 4fa3792 in 2 minutes and 2 seconds. Click for details.
  • Reviewed 156 lines of code in 3 files
  • Skipped 0 files when reviewing.
  • Skipped posting 2 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py:359
  • Draft comment:
    Good integration of prompt_caching_converse_handling in _handle_converse. In the streaming handler (lines ~400), note that if both read and write tokens are present, the span attribute may be overwritten. Ensure this is the intended behavior.
  • Reason this comment was not posted:
    Comment was on unchanged code.
2. packages/opentelemetry-instrumentation-bedrock/tests/metrics/test_bedrock_converse_prompt_caching_metrics.py:56
  • Draft comment:
    The test correctly validates prompt caching metrics for Converse API. The cumulative workaround for metric values indicates the underlying counter is cumulative. Consider resetting metrics between tests to avoid cross-test interference if possible.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment has two parts: 1) An observation about the cumulative nature of the metrics which is already documented in the code comments, and 2) A speculative suggestion about resetting metrics that isn't clearly necessary since the current approach works. The comment doesn't identify any actual problems or required changes. The suggestion about resetting metrics could be valid if there's evidence of cross-test interference, but we don't see any such evidence. The current workaround seems intentional and functional. Since the current approach is working and documented, and there's no evidence of actual problems, the suggestion is more speculative than necessary. Delete the comment as it's primarily informative/observational and makes a speculative suggestion without clear evidence of need for change.

Workflow ID: wflow_bNUeXv3pUdPPxbhz

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

…nstrumentation/bedrock/prompt_caching.py

Co-authored-by: ellipsis-dev[bot] <65095814+ellipsis-dev[bot]@users.noreply.github.com>
@nirga nirga changed the title feat(bedrock): Add prompt caching support for Converse API fix(bedrock): Add prompt caching support for Converse API Sep 19, 2025
Copy link
Member

@nirga nirga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @AlanPonnachan - looks like tests are failing, can you take a look?

@AlanPonnachan
Copy link
Author

Hi @nirga

I’ve resolved the lint test failures. The remaining failing test, test_prompt_cache_converse, is expected since it requires a VCR cassette to be recorded.

As I don’t have access to an active AWS account, I’m unable to generate the test_prompt_cache_converse.yaml cassette file myself. Would you be able to check out this branch, run the test and push the generated cassette file to this PR?

Thanks for your help!

Copy link
Member

@nirga nirga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure @AlanPonnachan, will do it - can you fix the small comment I wrote? I'll then run it locally and record a test. BTW - if you can rely on existing converse tests it might be easier

@AlanPonnachan
Copy link
Author

Thanks for the great suggestion and for your willingness to help record the test!

I agree that relying on an existing test is a cleaner approach. Before I push the changes, I just want to confirm my plan sounds good to you.

Here is what I am planning to do:

  1. Modify the Existing Test: I will update the test_titan_converse function in tests/traces/test_titan.py.
  2. Enable Caching: I'll add the additionalModelRequestFields with cacheControl to the existing brt.converse API call.
  3. Test Both Scenarios: I will add a second brt.converse call within that same test to ensure we cover both the initial "cache-write" and the subsequent "cache-read".
  4. Add Assertions: I will add the metric assertions I wrote to validate that the prompt caching counters are working correctly.
  5. Clean Up: Finally, I will delete the new test file I originally created (test_bedrock_converse_prompt_caching_metrics.py).

This will result in the cassette for test_titan_converse.yaml needing to be re-recorded, as you mentioned.

Does this plan look good? If so, I'll go ahead and make the changes.

@AlanPonnachan
Copy link
Author

@nirga any update on this?

@coderabbitai
Copy link

coderabbitai bot commented Oct 29, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds response/metadata-based prompt caching handling for Bedrock Converse and invokes it from both synchronous and streaming converse flows; introduces a new handler that reads cache read/write token counts from responses/metadata, sets span attributes, and emits cache metrics when enabled.

Changes

Cohort / File(s) Summary
Instrumentation entrypoints
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py
Imported prompt_caching_converse_handling (and prompt_caching_handling) and added calls to prompt_caching_converse_handling(...) in _handle_converse() and in the final-message path of _handle_converse_stream() after guardrail handling.
Prompt caching logic
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py
Added prompt_caching_converse_handling(response, vendor, model, metric_params) which extracts cache read/write token counts from a response or metadata object, verifies current span presence and recording, sets CACHED span attribute and gen_ai.usage.* token attributes, and emits cache read/write metrics when configured.
Tests & cassettes
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py, packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py, packages/opentelemetry-instrumentation-bedrock/tests/traces/cassettes/test_titan/test_titan_converse_with_caching.yaml
Added metric helper functions and test_anthropic_converse_with_caching to validate caching read/write behaviour (note: duplicate test definition present); minor formatting tweak in test_titan.py; added a Titan Converse cassette capturing a 403 auth-failure response for converse-with-caching scenario.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant BedrockInstrumentor
    participant BedrockAPI
    participant PromptCacheHandler
    participant Span
    participant Metrics

    Client->>BedrockInstrumentor: invoke_converse(...)
    BedrockInstrumentor->>BedrockAPI: call Converse (sync or stream)
    BedrockAPI-->>BedrockInstrumentor: response / streamed events (include usage_metadata or metadata)

    alt sync final response
      BedrockInstrumentor->>PromptCacheHandler: prompt_caching_converse_handling(response, vendor, model, metric_params)
    else streaming final message
      BedrockInstrumentor->>PromptCacheHandler: prompt_caching_converse_handling(metadata, vendor, model, metric_params)
    end

    rect #EEF8FF
      Note over PromptCacheHandler: extract cache_read/cache_creation\nset span attrs and gen_ai.usage.* token attrs
      PromptCacheHandler->>Span: set CACHED ("read"/"write") and token attributes
    end

    alt cache_read > 0
      PromptCacheHandler->>Metrics: emit cache read metric
    end
    alt cache_creation > 0
      PromptCacheHandler->>Metrics: emit cache write metric
    end

    BedrockInstrumentor-->>Client: return response / stream end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify response/metadata key paths (e.g., usage_metadata, input_token_details) match Bedrock Converse shapes.
  • Confirm span existence and is_recording checks are correct for streaming final-message handling.
  • Ensure handler is only invoked for final events (not intermediate stream events).
  • Review the duplicated test definition in test_anthropic.py and cassette expectations.

Suggested reviewers

  • galkleinman

Poem

🐰 I hopped through responses, counted tokens by light,

read and wrote some markers, and set spans just right.
Metrics took note as the cache did its dance,
traces wore small badges from a fortuitous chance.
Hooray — Converse caching tracked in a joyful prance!

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Out of Scope Changes Check ⚠️ Warning The changes to test_anthropic.py appear to be out of scope for a PR focused on "Add prompt caching support for Converse API" (Bedrock); while they introduce caching test utilities and a new caching test for Anthropic models, they do not directly align with the Bedrock Converse objectives stated in issue #3337. More critically, the summary indicates that test_anthropic_converse_with_caching is defined twice (duplicate function definition), which is a clear code error that should not be present in a merged changeset regardless of scope considerations. Remove the duplicate test_anthropic_converse_with_caching function definition from test_anthropic.py. Additionally, clarify whether the Anthropic test changes are intentional and necessary for this PR; if they represent a separate feature effort, consider moving them to a separate pull request focused on Anthropic caching to keep this PR focused on its stated Bedrock Converse objectives.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Linked Issues Check ❓ Inconclusive The code implementation addresses the core requirements from issue #3337 by creating the prompt_caching_converse_handling function that extracts cache metrics from the Converse response body, integrating it into both _handle_converse and _handle_converse_stream, recording cache-related metrics and span attributes, and emitting telemetry to the gen_ai usage attributes. However, the linked requirement to add and validate tests covering both cache-write and cache-read flows is not fully satisfied; the test cassette for test_titan_converse_with_caching.yaml contains a 403 authentication error response rather than a successful caching scenario, and reviewers have reported botocore ValidationExceptions indicating the test input contains an extraneous cacheControl key that is not permitted by the API, preventing successful validation of the implementation. The implementation should be validated by running the tests with valid AWS credentials to properly record the VCR cassettes and ensure they capture successful caching scenarios. Additionally, the test input should be corrected to remove or properly format the cacheControl parameter to resolve the botocore ValidationException before the PR can be fully verified against the issue requirements.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "fix(bedrock): Add prompt caching support for Converse API" accurately and concisely describes the main objective of the changeset. The primary changes across the codebase—adding the prompt_caching_converse_handling function, integrating it into _handle_converse and _handle_converse_stream, and introducing associated tests—all directly align with this stated goal. The title is specific and clear without unnecessary noise, making it immediately obvious to reviewers what aspect of the Bedrock instrumentation is being enhanced.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 771b5d5 and ea2b0da.

📒 Files selected for processing (1)
  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py
🧬 Code graph analysis (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (5)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (1)
  • PromptCaching (470-472)
packages/opentelemetry-instrumentation-milvus/tests/conftest.py (1)
  • reader (37-41)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
  • CacheSpanAttrs (9-11)
packages/opentelemetry-instrumentation-bedrock/tests/conftest.py (2)
  • instrument_legacy (90-102)
  • brt (36-42)
packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)
  • get_finished_spans (40-43)
🪛 Ruff (0.14.2)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py

22-22: Create your own exception

(TRY002)


22-22: Avoid specifying long messages outside the exception class

(TRY003)


1113-1113: Unused function argument: instrument_legacy

(ARG001)

🔇 Additional comments (2)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (2)

16-44: LGTM! Clear helper functions for metric validation.

The get_metric and assert_metric helper functions are well-structured for validating prompt caching metrics in tests. The logic correctly distinguishes between cache read and write operations and verifies the token counts.


1112-1171: Test implementation looks correct; VCR cassette recording pending.

The test logic is sound and all critical issues from previous reviews have been addressed:

  • ✓ Uses cachePoint blocks in message content (correct API usage)
  • ✓ Uses GenAIAttributes.GEN_AI_REQUEST_MODEL (correct attribute)
  • ✓ Proper indentation and formatting
  • ✓ Correct cache point type "default"

The test validates both cache-write and cache-read flows with appropriate assertions on usage tokens, span attributes, and metrics. According to the PR objectives, the VCR cassette (test_anthropic_converse_with_caching.yaml) still needs to be recorded by a reviewer with AWS credentials.

Note: The static analysis warning about unused instrument_legacy is a false positive—it's a pytest fixture that sets up instrumentation.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Member

@nirga nirga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the delay @AlanPonnachan! I was OOO and this fell between the cracks (team missed it for some reason). I'm back now! Left a small comment reg. the test - lmk if you need any assistance

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)

59-85: Span attribute overwrite when both cache operations occur.

If a response contains both cache_read_input_tokens > 0 and cache_creation_input_tokens > 0, the second block (line 82) will overwrite the CACHED attribute set by the first block (line 68). This means only "write" will be recorded on the span, losing the "read" information.

Consider one of these approaches:

  • Accumulate both values in a single attribute (e.g., "read,write")
  • Use separate span attributes for read and write operations
  • Prioritize one operation type if simultaneous occurrence is expected to be rare

Apply this diff to record both operations:

     if read_cached_tokens > 0:
         if metric_params.prompt_caching:
             metric_params.prompt_caching.add(
                 read_cached_tokens,
                 attributes={
                     **base_attrs,
                     CacheSpanAttrs.TYPE: "read",
                 },
             )
-        span.set_attribute(CacheSpanAttrs.CACHED, "read")
+        span.set_attribute(f"{CacheSpanAttrs.CACHED}.read", True)
         span.set_attribute(
             "gen_ai.usage.cache_read_input_tokens", read_cached_tokens
         )
 
     if write_cached_tokens > 0:
         if metric_params.prompt_caching:
             metric_params.prompt_caching.add(
                 write_cached_tokens,
                 attributes={
                     **base_attrs,
                     CacheSpanAttrs.TYPE: "write",
                 },
             )
-        span.set_attribute(CacheSpanAttrs.CACHED, "write")
+        span.set_attribute(f"{CacheSpanAttrs.CACHED}.write", True)
         span.set_attribute(
             "gen_ai.usage.cache_creation_input_tokens", write_cached_tokens
         )
🧹 Nitpick comments (1)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)

14-43: Consider adding is_recording check for consistency.

The new prompt_caching_converse_handling function includes an is_recording() check (line 52 in the new function), but this existing function doesn't. For consistency and to avoid unnecessary span operations when not recording, consider adding the same check here.

Apply this diff:

 def prompt_caching_handling(headers, vendor, model, metric_params):
     base_attrs = {
         "gen_ai.system": vendor,
         "gen_ai.response.model": model,
     }
     span = trace.get_current_span()
-    if not isinstance(span, trace.Span):
+    if not isinstance(span, trace.Span) or not span.is_recording():
         return
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b01af56 and 277499e.

📒 Files selected for processing (3)
  • packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (3 hunks)
  • packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1 hunks)
  • packages/opentelemetry-instrumentation-bedrock/tests/metrics/test_bedrock_converse_prompt_caching_metrics.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/opentelemetry-instrumentation-bedrock/tests/metrics/test_bedrock_converse_prompt_caching_metrics.py
  • packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py
  • packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py
🧬 Code graph analysis (2)
packages/opentelemetry-instrumentation-bedrock/tests/metrics/test_bedrock_converse_prompt_caching_metrics.py (3)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (1)
  • PromptCaching (470-472)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
  • CacheSpanAttrs (9-11)
packages/opentelemetry-instrumentation-bedrock/tests/metrics/conftest.py (1)
  • test_context (35-47)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (2)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (2)
  • prompt_caching_converse_handling (46-85)
  • prompt_caching_handling (14-43)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/guardrail.py (1)
  • guardrail_converse (153-176)
🪛 Ruff (0.14.2)
packages/opentelemetry-instrumentation-bedrock/tests/metrics/test_bedrock_converse_prompt_caching_metrics.py

30-30: Create your own exception

(TRY002)


30-30: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (6)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)

46-57: Good defensive programming with is_recording check.

The addition of span.is_recording() check and the conditional metric recording (if metric_params.prompt_caching:) are good defensive practices that prevent unnecessary work when telemetry is disabled.

packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (3)

25-28: LGTM: Clean import of prompt caching handlers.

The import statements are correctly structured and bring in both the new prompt_caching_converse_handling and the existing prompt_caching_handling functions.


357-361: LGTM: Proper integration in synchronous Converse handler.

The call to prompt_caching_converse_handling is correctly placed after guardrail_converse and receives the appropriate parameters (response, provider, model, metric_params). This mirrors the pattern used in _handle_call (line 332) for the invoke_model API.


399-406: LGTM: Correct metadata extraction for streaming events.

The metadata is properly extracted from the event before being passed to prompt_caching_converse_handling. This ensures that cache metrics are recorded when the final metadata event arrives in the stream.

packages/opentelemetry-instrumentation-bedrock/tests/metrics/test_bedrock_converse_prompt_caching_metrics.py (2)

33-47: Verify metric assertion logic handles all scenarios.

The current assertion logic iterates through all data points and validates each one individually. However, it doesn't verify that the expected number of data points exist (e.g., exactly one for "read" and one for "write"). This could mask issues where duplicate or missing data points occur.

Consider adding explicit checks:

def assert_metric(reader, usage):
    metrics_data = reader.get_metrics_data()
    resource_metrics = metrics_data.resource_metrics
    assert len(resource_metrics) > 0

    m = get_metric(resource_metrics, PromptCaching.LLM_BEDROCK_PROMPT_CACHING)
    
    read_points = [dp for dp in m.data.data_points if dp.attributes[CacheSpanAttrs.TYPE] == "read"]
    write_points = [dp for dp in m.data.data_points if dp.attributes[CacheSpanAttrs.TYPE] == "write"]
    
    if usage["cache_read_input_tokens"] > 0:
        assert len(read_points) == 1
        assert read_points[0].value == usage["cache_read_input_tokens"]
    
    if usage["cache_creation_input_tokens"] > 0:
        assert len(write_points) == 1
        assert write_points[0].value == usage["cache_creation_input_tokens"]

50-69: Test structure is sound but VCR cassette needs recording.

The test correctly validates both cache creation (first call) and cache reading (second call) scenarios. The cumulative metric workaround at line 68 is necessary due to how the InMemoryMetricReader accumulates counter values across reads.

Note: As mentioned in the PR description, this test requires a VCR cassette to be recorded with actual AWS credentials.

Ensure the VCR cassette is recorded before merging, as the test will fail without it.

Comment on lines 24 to 30
def get_metric(resource_metrics, name):
for rm in resource_metrics:
for sm in rm.scope_metrics:
for metric in sm.metrics:
if metric.name == name:
return metric
raise Exception(f"No metric found with name {name}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use AssertionError for test helper.

The generic Exception should be replaced with AssertionError since this is a test assertion helper and test failures should propagate as assertion failures.

As per static analysis hints.

Apply this diff:

 def get_metric(resource_metrics, name):
     for rm in resource_metrics:
         for sm in rm.scope_metrics:
             for metric in sm.metrics:
                 if metric.name == name:
                     return metric
-    raise Exception(f"No metric found with name {name}")
+    raise AssertionError(f"No metric found with name {name}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_metric(resource_metrics, name):
for rm in resource_metrics:
for sm in rm.scope_metrics:
for metric in sm.metrics:
if metric.name == name:
return metric
raise Exception(f"No metric found with name {name}")
def get_metric(resource_metrics, name):
for rm in resource_metrics:
for sm in rm.scope_metrics:
for metric in sm.metrics:
if metric.name == name:
return metric
raise AssertionError(f"No metric found with name {name}")
🧰 Tools
🪛 Ruff (0.14.2)

30-30: Create your own exception

(TRY002)


30-30: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In
packages/opentelemetry-instrumentation-bedrock/tests/metrics/test_bedrock_converse_prompt_caching_metrics.py
around lines 24 to 30, the helper raises a generic Exception when a metric is
not found; change it to raise an AssertionError (or use assert False) with the
same message so test failures surface as assertion failures. Update the raise
statement accordingly and keep the message f"No metric found with name {name}"
intact.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (1)

16-22: Consider using a more specific exception type.

While the function logic is correct, using a generic Exception is flagged by Ruff. For test utilities, this is acceptable, but you could optionally create a custom exception class or use AssertionError for better clarity in test failures.

Apply this diff if you'd like to improve the exception handling:

-    raise Exception(f"No metric found with name {name}")
+    raise AssertionError(f"No metric found with name {name}")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 277499e and 1965083.

📒 Files selected for processing (1)
  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py
🧬 Code graph analysis (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (5)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (1)
  • PromptCaching (470-472)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
  • CacheSpanAttrs (9-11)
packages/opentelemetry-instrumentation-bedrock/tests/metrics/conftest.py (1)
  • test_context (35-47)
packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)
  • get_finished_spans (40-43)
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
  • SpanAttributes (64-261)
🪛 Ruff (0.14.2)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py

22-22: Create your own exception

(TRY002)


22-22: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (4)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (4)

12-13: LGTM! Imports are appropriate for caching support.

The new imports are necessary for testing prompt caching functionality and are properly scoped.


25-46: LGTM! Metric assertion logic is sound.

The function correctly handles cumulative metrics by tracking both read and write data points separately and asserting the appropriate one based on the is_read parameter.


1076-1090: Verify VCR cassette recording.

This test requires a VCR cassette to be recorded with actual AWS credentials. According to the PR comments, you've requested that a maintainer with AWS access run and record this cassette. Ensure this is completed before merging.


1077-1130: Comprehensive caching test with good coverage.

The test thoroughly validates both cache write and read scenarios, including:

  • Usage token assertions from API responses
  • Span attribute validation for cache state
  • Metric emission verification

The test structure is clear and follows best practices.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (1)

1113-1114: Resolved: replaced hardcoded attr strings with semconv constants.

Good use of SpanAttributes for cache token attributes; previous review concern addressed.

Also applies to: 1121-1122

🧹 Nitpick comments (4)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (4)

12-14: Avoid depending on instrumentation internals in tests.

Importing PromptCaching and CacheSpanAttrs from instrumentation modules couples tests to internal paths. Prefer stable semconv constants or add local fallbacks for attribute keys to reduce breakage risk on refactors.


16-22: Use pytest.fail and drop bare Exception; simplify lookup.

Replace manual loops + bare Exception with next(...) and pytest.fail for clearer failures and to satisfy TRY002/TRY003.

 def get_metric(resource_metrics, name):
-    for rm in resource_metrics:
-        for sm in rm.scope_metrics:
-            for metric in sm.metrics:
-                if metric.name == name:
-                    return metric
-    raise Exception(f"No metric found with name {name}")
+    try:
+        return next(
+            m
+            for rm in resource_metrics
+            for sm in rm.scope_metrics
+            for m in sm.metrics
+            if m.name == name
+        )
+    except StopIteration:
+        pytest.fail(f"Metric '{name}' not found in collected metrics")

25-46: Make metric assertion robust to cumulative/export cycles.

Sum data points by type to tolerate multiple collections and exporter behavior; add clearer empty-metrics assertion.

 def assert_metric(reader, usage, is_read=False):
-    metrics_data = reader.get_metrics_data()
-    resource_metrics = metrics_data.resource_metrics
-    assert len(resource_metrics) > 0
-
-    m = get_metric(resource_metrics, PromptCaching.LLM_BEDROCK_PROMPT_CACHING)
-    # This check is now more specific to handle cumulative metrics
-    found_read = False
-    found_write = False
-    for data_point in m.data.data_points:
-        if data_point.attributes[CacheSpanAttrs.TYPE] == "read":
-            found_read = True
-            assert data_point.value == usage["cache_read_input_tokens"]
-        elif data_point.attributes[CacheSpanAttrs.TYPE] == "write":
-            found_write = True
-            assert data_point.value == usage["cache_creation_input_tokens"]
-    
-    if is_read:
-        assert found_read
-    else:
-        assert found_write
+    metrics_data = reader.get_metrics_data()
+    resource_metrics = metrics_data.resource_metrics
+    assert resource_metrics, "No resource metrics collected"
+
+    m = get_metric(resource_metrics, PromptCaching.LLM_BEDROCK_PROMPT_CACHING)
+    read = sum(
+        dp.value for dp in m.data.data_points
+        if dp.attributes.get(CacheSpanAttrs.TYPE) == "read"
+    )
+    write = sum(
+        dp.value for dp in m.data.data_points
+        if dp.attributes.get(CacheSpanAttrs.TYPE) == "write"
+    )
+    if is_read:
+        assert read == usage["cache_read_input_tokens"]
+    else:
+        assert write == usage["cache_creation_input_tokens"]

1106-1123: Don’t rely on span ordering; select by attribute.

Use CacheSpanAttrs.CACHED to locate write/read spans to avoid flakiness.

-    # Assertions for the first span (cache write)
-    span_write = spans[0]
-    assert span_write.name == "bedrock.converse"
-    attributes_write = span_write.attributes
+    # Assertions for the first span (cache write)
+    span_write = next(s for s in spans if s.attributes.get(CacheSpanAttrs.CACHED) == "write")
+    assert span_write.name == "bedrock.converse"
+    attributes_write = span_write.attributes
@@
-    # Assertions for the second span (cache read)
-    span_read = spans[1]
-    assert span_read.name == "bedrock.converse"
-    attributes_read = span_read.attributes
+    # Assertions for the second span (cache read)
+    span_read = next(s for s in spans if s.attributes.get(CacheSpanAttrs.CACHED) == "read")
+    assert span_read.name == "bedrock.converse"
+    attributes_read = span_read.attributes
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1965083 and 750a59e.

📒 Files selected for processing (1)
  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py
🧬 Code graph analysis (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (5)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (1)
  • PromptCaching (470-472)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
  • CacheSpanAttrs (9-11)
packages/opentelemetry-instrumentation-bedrock/tests/metrics/conftest.py (1)
  • test_context (35-47)
packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)
  • get_finished_spans (40-43)
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
  • SpanAttributes (64-261)
🪛 Ruff (0.14.2)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py

22-22: Create your own exception

(TRY002)


22-22: Avoid specifying long messages outside the exception class

(TRY003)

🔇 Additional comments (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (1)

1124-1130: Metrics assertions look good after helper hardening.

Cumulative usage check for read/write is aligned with counter semantics; ok once assert_metric is updated as suggested.

Optionally assert there are exactly two data points (read/write) for the metric in this test context after moving fixture for isolation.

@AlanPonnachan
Copy link
Author

Hi @nirga , I've updated the PR with the changes we discussed. I moved the logic into a new test in test_titan.py and deleted the old file.

It should be ready for the cassette to be recorded now. Thanks again for your help!

@nirga
Copy link
Member

nirga commented Oct 30, 2025

Thanks @AlanPonnachan! Can you record the test? poetry run pytest --record-mode=once

@AlanPonnachan
Copy link
Author

Hi @nirga, thank you again for all your guidance.

I tried to run the recording command locally as you requested. As expected, since I don't have AWS credentials, the test run fails with an UnrecognizedClientException (invalid security token). This confirms that the test is now correctly set up and is just waiting for a real recording to be generated.

I believe the PR is now ready from a code perspective. Would you be able to run the recording on your end when you have a moment?

Thank you so much for your help

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 750a59e and 419ae0f.

📒 Files selected for processing (3)
  • packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (3 hunks)
  • packages/opentelemetry-instrumentation-bedrock/tests/traces/cassettes/test_titan/test_titan_converse_with_caching.yaml (1 hunks)
  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/init.py
🧰 Additional context used
📓 Path-based instructions (2)
**/cassettes/**/*.{yaml,yml,json}

📄 CodeRabbit inference engine (CLAUDE.md)

Never commit secrets or PII in VCR cassettes; scrub sensitive data

Files:

  • packages/opentelemetry-instrumentation-bedrock/tests/traces/cassettes/test_titan/test_titan_converse_with_caching.yaml
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py
🧠 Learnings (1)
📚 Learning: 2025-08-17T15:06:48.109Z
Learnt from: CR
PR: traceloop/openllmetry#0
File: CLAUDE.md:0-0
Timestamp: 2025-08-17T15:06:48.109Z
Learning: Applies to tests/**/*.py : Tests that make API calls must utilize VCR cassettes

Applied to files:

  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py
🧬 Code graph analysis (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (5)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (1)
  • PromptCaching (470-472)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
  • CacheSpanAttrs (9-11)
packages/opentelemetry-instrumentation-milvus/tests/conftest.py (1)
  • reader (37-41)
packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)
  • get_finished_spans (40-43)
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
  • SpanAttributes (64-245)
🪛 Ruff (0.14.2)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py

22-22: Create your own exception

(TRY002)


22-22: Avoid specifying long messages outside the exception class

(TRY003)


1077-1077: Unused function argument: instrument_legacy

(ARG001)

🔇 Additional comments (4)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (4)

12-13: LGTM!

The imports for PromptCaching and CacheSpanAttrs are properly added to support the caching test scenario.


16-22: Helper function is clear and appropriate for test utilities.

The get_metric helper correctly searches through resource metrics. While static analysis suggests using a custom exception class, a generic Exception is acceptable for test utilities where the primary goal is clear failure messages during test debugging.


25-45: LGTM!

The assert_metric helper correctly validates prompt caching metrics for both read and write scenarios. The logic properly handles cumulative metrics by iterating through all data points and checking that the appropriate cache type was found.


1076-1129: Test logic is well-structured and correctly validates caching behavior.

The test properly exercises both cache-write and cache-read paths:

  1. First call writes to cache (cache_creation_input_tokens > 0)
  2. Second call reads from cache (cache_read_input_tokens > 0)
  3. Span attributes and metrics are validated for both scenarios

Good use of semantic convention constants (SpanAttributes.*) instead of hardcoded strings, addressing previous review feedback.

Note: The instrument_legacy fixture is flagged as unused by static analysis, but it's required for side effects to set up the Bedrock instrumentation before the test runs.

The test will work correctly once the VCR cassette is properly recorded with valid AWS credentials (see cassette file review comment).

Comment on lines +1 to +47
interactions:
- request:
body: '{"messages": [{"role": "user", "content": [{"text": "Hello, this is a test
prompt for caching."}]}], "inferenceConfig": {"maxTokens": 50}, "additionalModelRequestFields":
{"cacheControl": {"type": "ephemeral"}}}'
headers:
Content-Length:
- '211'
Content-Type:
- !!binary |
YXBwbGljYXRpb24vanNvbg==
User-Agent:
- !!binary |
Qm90bzMvMS4zNC4xNDUgbWQvQm90b2NvcmUjMS4zNC4xNDUgdWEvMi4wIG9zL3dpbmRvd3MjMTAg
bWQvYXJjaCNhbWQ2NCBsYW5nL3B5dGhvbiMzLjExLjggbWQvcHlpbXBsI0NQeXRob24gY2ZnL3Jl
dHJ5LW1vZGUjbGVnYWN5IEJvdG9jb3JlLzEuMzQuMTQ1
X-Amz-Date:
- !!binary |
MjAyNTEwMzBUMTY1MTI4Wg==
amz-sdk-invocation-id:
- !!binary |
ZTFmM2VjZTMtM2VlMC00ZGZiLWE2MWEtYjBiYmU4M2U2NDFm
amz-sdk-request:
- !!binary |
YXR0ZW1wdD0x
method: POST
uri: https://bedrock-runtime.us-east-1.amazonaws.com/model/amazon.titan-text-express-v1/converse
response:
body:
string: '{"message":"The security token included in the request is invalid."}'
headers:
Connection:
- keep-alive
Content-Length:
- '68'
Content-Type:
- application/json
Date:
- Thu, 30 Oct 2025 16:51:30 GMT
x-amzn-ErrorType:
- UnrecognizedClientException:http://internal.amazon.com/coral/com.amazon.coral.service/
x-amzn-RequestId:
- f0fdf86b-d85a-4f16-8713-1b9b5fccbe25
status:
code: 403
message: Forbidden
version: 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Cassette contains authentication failure instead of valid caching response.

The cassette records a 403 Forbidden response with "The security token included in the request is invalid" rather than a successful API response with caching metadata. This means the test test_titan_converse_with_caching will fail because:

  • Line 1086-1088: Expects usage_write["cache_creation_input_tokens"] > 0, but the response has no usage field
  • Line 1091-1099: Expects a second successful call with cache read data

Per the PR objectives, you need AWS credentials to record a valid cassette. Ask the maintainer to run:

and commit the generated cassette.


🏁 Script executed:

poetry run pytest packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py::test_titan_converse_with_caching --record-mode=once

Length of output: 44



Once I locate the test file, I'll examine the specific test to verify the assertions mentioned in the review comment.

---

🏁 Script executed:

```shell
#!/bin/bash
# Find and inspect the test_titan.py file and locate the test_titan_converse_with_caching test
fd "test_titan\.py" --type f

Length of output: 77



🏁 Script executed:

#!/bin/bash
# Inspect the test function test_titan_converse_with_caching
ast-grep --pattern 'def test_titan_converse_with_caching($_) {
  $$$
}'

Length of output: 0



🏁 Script executed:

#!/bin/bash
# Search for the test function and display context
rg -n "def test_titan_converse_with_caching" --type py -A 50

Length of output: 6541


Cassette is invalid and will cause test failure; AWS credentials required to regenerate.

The cassette indeed contains a 403 Forbidden response instead of a valid caching response. The test will fail at line 1086 when attempting to access response_write["usage"] from the error response, which lacks this field. The cassette must be regenerated with valid AWS credentials to record a successful response with cache metadata.

🤖 Prompt for AI Agents
packages/opentelemetry-instrumentation-bedrock/tests/traces/cassettes/test_titan/test_titan_converse_with_caching.yaml
lines 1-47: the cassette records a 403 UnrecognizedClientException (invalid AWS
credentials) instead of a successful converse response, causing tests to crash
when they expect response_write["usage"]; regenerate or replace this cassette by
re-running the test recording with valid AWS credentials so the cassette
contains a successful response that includes cache/usage metadata, and then
commit the new cassette (or remove the failing cassette and update the test to
use a properly recorded fixture).

@nirga
Copy link
Member

nirga commented Oct 30, 2025

Thanks @AlanPonnachan, I tried running it locally and the test failed:

FAILED tests/traces/test_titan.py::test_titan_converse_with_caching - botocore.errorfactory.ValidationException: An error occurred (ValidationException) when calling the Converse operation: The model returned the following errors: Malformed input request: extraneous key [cacheControl] is not permitted, please reformat your input and try again.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (2)

16-22: Consider using a custom exception or a standard assertion.

For test helpers, raising AssertionError or a custom exception would be more idiomatic than a generic Exception.

Apply this diff:

 def get_metric(resource_metrics, name):
     for rm in resource_metrics:
         for sm in rm.scope_metrics:
             for metric in sm.metrics:
                 if metric.name == name:
                     return metric
-    raise Exception(f"No metric found with name {name}")
+    raise AssertionError(f"No metric found with name {name}")

1142-1142: Use constants for cache token attribute names.

Lines 1142 and 1149 use string literals for cache token attributes. The SpanAttributes class defines constants GEN_AI_USAGE_CACHE_CREATION_INPUT_TOKENS and GEN_AI_USAGE_CACHE_READ_INPUT_TOKENS (see relevant code snippets). Use these constants for consistency and maintainability.

Apply this diff:

-    assert attributes_write["gen_ai.usage.cache_creation_input_tokens"] == usage_write["cache_creation_input_tokens"]
+    assert attributes_write[SpanAttributes.GEN_AI_USAGE_CACHE_CREATION_INPUT_TOKENS] == usage_write["cache_creation_input_tokens"]
-    assert attributes_read["gen_ai.usage.cache_read_input_tokens"] == usage_read["cache_read_input_tokens"]
+    assert attributes_read[SpanAttributes.GEN_AI_USAGE_CACHE_READ_INPUT_TOKENS] == usage_read["cache_read_input_tokens"]

Also applies to: 1149-1149

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 419ae0f and 440295c.

📒 Files selected for processing (2)
  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (2 hunks)
  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_titan.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py
🧬 Code graph analysis (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (5)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (1)
  • PromptCaching (470-472)
packages/opentelemetry-instrumentation-milvus/tests/conftest.py (1)
  • reader (37-41)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
  • CacheSpanAttrs (9-11)
packages/opentelemetry-instrumentation-bedrock/tests/conftest.py (1)
  • instrument_legacy (90-102)
packages/opentelemetry-semantic-conventions-ai/opentelemetry/semconv_ai/__init__.py (1)
  • SpanAttributes (64-245)
🪛 Ruff (0.14.2)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py

22-22: Create your own exception

(TRY002)


22-22: Avoid specifying long messages outside the exception class

(TRY003)


1113-1113: Unused function argument: instrument_legacy

(ARG001)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (1)

1131-1136: Remove invalid additionalModelRequestFields and use consistent cache point structure.

Line 1135 uses additionalModelRequestFields={"cacheControl": {"type": "ephemeral"}}, which causes a ValidationException as confirmed in the past review comments and PR discussion. The Bedrock Converse API does not accept cacheControl as a parameter; cache points must be specified within the message content structure.

For the second call to read from the cache created by the first call, it must use the exact same message structure up to the cache point. The current implementation will fail at runtime.

Apply this diff to fix the second call:

     response_read = brt.converse(
         modelId="anthropic.claude-3-haiku-20240307-v1:0",
-        messages=[{"role": "user", "content": [{"text": "Hello, this is a test prompt for caching."}]}],
+        messages=[
+            {
+                "role": "user",
+                "content": [
+                    {"text": "Hello, this is a test prompt for caching."},
+                    {"cachePoint": {"type": "default"}},
+                ],
+            }
+        ],
         inferenceConfig={"maxTokens": 50},
-        additionalModelRequestFields={"cacheControl": {"type": "ephemeral"}},
     )
🧹 Nitpick comments (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (1)

16-22: Consider a custom exception or using a standard exception type.

The function raises a generic Exception with a formatted message. For better error handling and consistency with Python best practices, consider using a more specific exception type such as ValueError or KeyError, or define a custom exception class.

Apply this diff to use a more specific exception:

-    raise Exception(f"No metric found with name {name}")
+    raise ValueError(f"No metric found with name {name}")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 440295c and a5563d5.

📒 Files selected for processing (1)
  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py
🧬 Code graph analysis (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (5)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (1)
  • PromptCaching (470-472)
packages/opentelemetry-instrumentation-milvus/tests/conftest.py (1)
  • reader (37-41)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
  • CacheSpanAttrs (9-11)
packages/opentelemetry-instrumentation-bedrock/tests/conftest.py (2)
  • instrument_legacy (90-102)
  • brt (36-42)
packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)
  • get_finished_spans (40-43)
🪛 Flake8 (7.3.0)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py

[error] 1115-1115: continuation line missing indentation or outdented

(E122)


[error] 1116-1116: continuation line missing indentation or outdented

(E122)


[error] 1125-1125: continuation line missing indentation or outdented

(E122)


[error] 1126-1126: continuation line missing indentation or outdented

(E122)

🪛 Ruff (0.14.2)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py

22-22: Create your own exception

(TRY002)


22-22: Avoid specifying long messages outside the exception class

(TRY003)


1113-1113: Unused function argument: instrument_legacy

(ARG001)

🔇 Additional comments (4)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (4)

13-13: LGTM!

The import is correctly used in the assert_metric helper function.


25-44: LGTM!

The helper correctly validates cache metrics by checking data point attributes and values for both read and write cache types.


1113-1113: Note: The instrument_legacy parameter is not unused.

Static analysis flags instrument_legacy as unused, but this is a false positive. The parameter is a pytest fixture that performs instrumentation setup as a side effect, following the standard pytest fixture pattern. The fixture's presence in the parameter list is necessary for the test to function correctly.


1141-1163: No action required—cache implementation and assertions are already correct.

The cache implementation has already been fixed correctly in the code:

  • Line 1121 uses the correct cachePoint method with the "Correct method" comment
  • Line 1135 properly uses additionalModelRequestFields={"cacheControl": {"type": "ephemeral"}}
  • The prompt_caching_converse_handling() function in prompt_caching.py correctly extracts cache tokens and sets span attributes
  • The assertions at lines 1141-1163 are properly aligned with the implementation, checking for the expected cache attributes and token values
  • The test is decorated with @pytest.mark.vcr and VCR cassettes exist and are properly configured

The test structure, assertions, and implementation are all consistent and correct. No modifications are needed.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a5563d5 and 771b5d5.

📒 Files selected for processing (1)
  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

  • packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py
🧬 Code graph analysis (1)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (5)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/__init__.py (1)
  • PromptCaching (470-472)
packages/opentelemetry-instrumentation-milvus/tests/conftest.py (1)
  • reader (37-41)
packages/opentelemetry-instrumentation-bedrock/opentelemetry/instrumentation/bedrock/prompt_caching.py (1)
  • CacheSpanAttrs (9-11)
packages/opentelemetry-instrumentation-bedrock/tests/conftest.py (2)
  • instrument_legacy (90-102)
  • brt (36-42)
packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)
  • get_finished_spans (40-43)
🪛 Ruff (0.14.2)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py

22-22: Create your own exception

(TRY002)


22-22: Avoid specifying long messages outside the exception class

(TRY003)


1113-1113: Unused function argument: instrument_legacy

(ARG001)

🔇 Additional comments (5)
packages/opentelemetry-instrumentation-bedrock/tests/traces/test_anthropic.py (5)

13-14: LGTM! Import is properly added.

The PromptCaching import is correctly used in the assert_metric helper to validate cache metrics.


16-22: LGTM! Helper function is correct.

The function correctly iterates through resource metrics to find a metric by name. The use of a generic Exception is acceptable for test utility code, though a custom exception type could improve clarity.


25-44: LGTM! Metric assertion logic is sound.

The function correctly validates cache read/write data points by iterating through metrics, checking the cache type attribute, and asserting the expected value matches the usage data.


1114-1130: LGTM! Cache write flow is correctly implemented.

The first call properly uses {"cachePoint": {"type": "default"}} in the message content to mark a cache checkpoint, which aligns with AWS Bedrock Converse API documentation. The assertions correctly validate that cache creation occurred without any cache reads.


1142-1164: LGTM! Span and metric assertions are well-structured.

The test correctly validates:

  • Both spans use the expected model ID via GenAIAttributes.GEN_AI_REQUEST_MODEL (properly fixed from past review comments)
  • Cache attributes (CacheSpanAttrs.CACHED) distinguish write vs. read operations
  • Cache token counts in span attributes match the response usage data
  • Metrics are emitted for both cache read and write flows

The assertions will work correctly once the cache read call (lines 1132-1140) is fixed.

@AlanPonnachan
Copy link
Author

Hi @nirga,

Thank you so much for running the test and providing that error!

I did some research and found that, in general, newer Amazon Titan models do support prompt caching. However, you are right that the specific model in the original test, amazon.titan-text-express-v1, does not support it, which perfectly explains the error.

To ensure the test is stable and reliable, I have implemented the fix we discussed:

  1. I reverted all changes to test_titan.py to ensure it remains focused on Guardrails.
  2. I moved the caching test logic into a new function, test_anthropic_converse_with_caching, within the correct test_anthropic.py file.
  3. The new test uses a supported Anthropic Claude 3 model and the correct API structure.

Would you be able to run the recording on your end when you have a moment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🚀 Feature: Add prompt caching for Bedrock Converse

3 participants