diff --git a/docs/assets/whale_1.png b/docs/assets/whale_1.png new file mode 100644 index 0000000..56b65ad Binary files /dev/null and b/docs/assets/whale_1.png differ diff --git a/docs/assets/whale_2.png b/docs/assets/whale_2.png new file mode 100644 index 0000000..57c1ecb Binary files /dev/null and b/docs/assets/whale_2.png differ diff --git a/docs/assets/whale_2_large.png b/docs/assets/whale_2_large.png new file mode 100644 index 0000000..cea88de Binary files /dev/null and b/docs/assets/whale_2_large.png differ diff --git a/docs/assets/whale_3.png b/docs/assets/whale_3.png new file mode 100644 index 0000000..682de63 Binary files /dev/null and b/docs/assets/whale_3.png differ diff --git a/docs/examples/python/multimodal.md b/docs/examples/python/multimodal.md new file mode 100644 index 0000000..d97e89f --- /dev/null +++ b/docs/examples/python/multimodal.md @@ -0,0 +1,155 @@ +# Multi-modal - Strands Agents for Image Generation and Evaluation + +This [example](https://github.com/strands-agents/docs/blob/main/docs/examples/python/multimodal.py) demonstrates how to create a multi-agent system for generating and evaluating images. It shows how Strands agents can work with multimodal content thorugh a workflow between specialized agents. + +## Overview + +| Feature | Description | +| ------------------ | -------------------------------------- | +| **Tools Used** | generate_image, image_reader | +| **Complexity** | Intermediate | +| **Agent Type** | Multi-Agent System (2 Agents) | +| **Interaction** | Command Line Interface | +| **Key Focus** | Multimodal Content Processing | + +## Tool Overview + +The multimodal example utilizes two tools to work with image content. + +1. The [`generate_image`](https://github.com/strands-agents/tools/blob/main/src/strands_tools/generate_image.py) tool enables the creation of images based on text prompts, allowing the agent to generate visual content from textual descriptions. +2. The [`image_reader`](https://github.com/strands-agents/tools/blob/main/src/strands_tools/image_reader.py) tool provides the capability to analyze and interpret image content, enabling the agent to "see" and describe what's in the images. + +Together, these tools create a complete pipeline for both generating and evaluating visual content through natural language interactions. + +## Code Structure and Implementation + +### Agent Initialization + +The example creates two specialized agents, each with a specific role in the image generation and evaluation process. + +```python +from strands import Agent, tool +from strands_tools import generate_image, image_reader + +# Artist agent that generates images based on prompts +artist = Agent(tools=[generate_image],system_prompt=( + "You will be instructed to generate a number of images of a given subject. Vary the prompt for each generated image to create a variety of options." + "Your final output must contain ONLY a comma-separated list of the filesystem paths of generated images." +)) + +# Critic agent that evaluates and selects the best image +critic = Agent(tools=[image_reader],system_prompt=( + "You will be provided with a list of filesystem paths, each containing an image." + "Describe each image, and then choose which one is best." + "Your final line of output must be as follows:" + "FINAL DECISION: " +)) +``` + +### Using the Multimodal Agents + +The example demonstrates a simple workflow where the agents collaborate to generate and select images: + +```python +# Generate multiple images using the artist agent +result = artist("Generate 3 images of a dog") + +# Pass the image paths to the critic agent for evaluation +critic(str(result)) +``` + +This workflow shows how agents can be chained together, with the output of one agent becoming the input for another, creating a pipeline for multimodal content processing. + +## Key Features and Capabilities + +### 1. Image Generation + +The artist agent can generate multiple variations of images based on a text prompt: + +- **Basic Generation**: + ``` + Generate 3 images of a dog + ``` + +- **Styled Generation**: + ``` + Generate 2 images of a mountain landscape in watercolor style + ``` + +- **Conceptual Generation**: + ``` + Generate 4 images representing the concept of freedom + ``` + +### 2. Image Evaluation + +The critic agent can analyze images and provide detailed descriptions and evaluations of the images, ultimately outputting a final choice. + +### Example Command and Results + +**Command**: + +``` +Generate 3 images of a whale +``` + +**Response**: +``` +I'll generate 3 different images of whales for you with varied prompts. +Tool #1: generate_image + +Tool #2: generate_image + +Tool #3: generate_image +output/a_majestic_blue_whale_swimming.png, output/humpback_whale_breaching_from_the.png, output/orca_whale_pod_swimming_together.png + +I'll describe each of these whale images and determine which is best. Let me examine them one by one. +Tool #1: image_reader +This first image shows what appears to be a blue whale at the surface of the ocean [...] +Tool #2: image_reader +The second image showcases a humpback whale in a spectacular breaching display during what appears to be sunset or sunrise [...] +Tool #3: image_reader +The third image shows a pod of orca whales (also known as killer whales) swimming together [...] + +Comparison and final decision: + +All three images are remarkable whale photographs, each with unique strengths: + +1. The blue whale image offers an intimate, close-up perspective with beautiful atmospheric lighting and a serene mood. + +2. The humpback whale breaching image captures a dynamic, powerful moment with spectacular sunset colors creating an extraordinary visual impact. + +3. The orca pod image showcases multiple whales in their social grouping against a dramatic arctic backdrop, emphasizing their habitat and community. + +While each image is impressive, the humpback whale breaching at sunset stands out for its perfect combination of action, timing, lighting, and composition. The contrast between the dark whale and the golden sky, the dynamic motion captured at precisely the right moment, and the breathtaking sunset setting make this image particularly remarkable. + +FINAL DECISION: output/humpback_whale_breaching_from_the.png +``` + +During its execution, the `artist` agent used the following prompts (which can be seen in [traces](../../user-guide/observability-evaluation/traces.md) or [logs](../../user-guide/observability-evaluation/logs.md)) to generate each image: + +"A majestic blue whale swimming in deep ocean waters, sunlight filtering through the surface, photorealistic" + +![output/a_majestic_blue_whale_swimming.png](../../assets/whale_1.png) + +"Humpback whale breaching from the water, dramatic splash, against sunset sky, wildlife photography" + +![output/humpback_whale_breaching_from_the.png](../../assets/whale_2.png) + +"Orca whale pod swimming together in arctic waters, aerial view, detailed, pristine environment" + +![output/orca_whale_pod_swimming_together.png](../../assets/whale_3.png) + +And the `critic` agent selected the humpback whale as the best image: + +![output/humpback_whale_breaching_from_the.png](../../assets/whale_2_large.png) + + +## Extending the Example + +Here are some ways you could extend this example: + +1. **Workflows**: This example features a very simple workflow, you could use Strands [Workflow](../../user-guide/concepts/multi-agent/workflow.md) capabilities for more elaborate media production pipelines. +2. **Image Editing**: Extend the `generate_image` tool to accept and modify input images. +3. **User Feedback Loop**: Allow users to provide feedback on the selection to improve future generations +4. **Integration with Other Media**: Extend the system to work with other media types, such as video with Amazon Nova models. \ No newline at end of file diff --git a/docs/examples/python/multimodal.py b/docs/examples/python/multimodal.py new file mode 100644 index 0000000..99af7a3 --- /dev/null +++ b/docs/examples/python/multimodal.py @@ -0,0 +1,20 @@ +from strands import Agent, tool +from strands_tools import generate_image, image_reader + + +artist = Agent(tools=[generate_image],system_prompt=( + "You will be instructed to generate a number of images of a given subject. Vary the prompt for each generated image to create a variety of options." + "Your final output must contain ONLY a comma-separated list of the filesystem paths of generated images." +)) + + + +critic = Agent(tools=[image_reader],system_prompt=( + "You will be provided with a list of filesystem paths, each containing an image." + "Describe each image, and then choose which one is best." + "Your final line of output must be as follows:" + "FINAL DECISION: " +)) + +result = artist("Generate 3 images of a dog") +critic(str(result)) \ No newline at end of file diff --git a/mkdocs.yml b/mkdocs.yml index cbd8c8c..789ab61 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -123,6 +123,7 @@ nav: - Multi Agents: examples/python/multi_agent_example/multi_agent_example.md - Meta Tooling: examples/python/meta_tooling.md - MCP: examples/python/mcp_calculator.md + - Multi-modal: examples/python/multimodal.md - Contribute ❤️: https://github.com/strands-agents/sdk-python/blob/main/CONTRIBUTING.md - API Reference: - Agent: api-reference/agent.md