How to call tools with multi-modal data

Here we demonstrate how to call tools with multi-modal data, such as images.

Some multi-modal models, such as those that can reason over images or audio, support tool calling features as well.

To call tools using such models, simply bind tools to them in the usual way, and invoke the model using content blocks of the desired type (e.g., containing image data).

Below, we demonstrate examples using OpenAI and Anthropic. We will use the same image and tool in all cases. Let's first select an image, and build a placeholder tool that expects as input the string "sunny", "cloudy", or "rainy". We will ask the models to describe the weather in the image.

from typing import Literal

from langchain_core.tools import tool

image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"


@tool
def weather_tool(weather: Literal["sunny", "cloudy", "rainy"]) -> None:
    """Describe the weather"""
    pass

API Reference:

tool

OpenAI

For OpenAI, we can feed the image URL directly in a content block of type "image_url":

from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI

model = ChatOpenAI(model="gpt-4o").bind_tools([weather_tool])

message = HumanMessage(
    content=[
        {"type": "text", "text": "describe the weather in this image"},
        {"type": "image_url", "image_url": {"url": image_url}},
    ],
)
response = model.invoke([message])
print(response.tool_calls)

API Reference:

[{'name': 'weather_tool', 'args': {'weather': 'sunny'}, 'id': 'call_mRYL50MtHdeNuNIjSCm5UPmB'}]

Note that we recover tool calls with parsed arguments in LangChain's standard format in the model response.

Anthropic

For Anthropic, we can format a base64-encoded image into a content block of type "image", as below:

import base64

import httpx
from langchain_anthropic import ChatAnthropic

image_data = base64.b64encode(httpx.get(image_url).content).decode("utf-8")

model = ChatAnthropic(model="claude-3-sonnet-20240229").bind_tools([weather_tool])

message = HumanMessage(
    content=[
        {"type": "text", "text": "describe the weather in this image"},
        {
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/jpeg",
                "data": image_data,
            },
        },
    ],
)
response = model.invoke([message])
print(response.tool_calls)

API Reference:

ChatAnthropic

[{'name': 'weather_tool', 'args': {'weather': 'sunny'}, 'id': 'toolu_016m9KfknJqx5fVRYk4tkF6s'}]

How to call tools with multi-modal data

API Reference:

OpenAI

API Reference:

Anthropic

API Reference:

Was this page helpful?

You can leave detailed feedback on GitHub.

How to call tools with multi-modal data

API Reference:

OpenAI​

API Reference:

Anthropic​

API Reference:

Was this page helpful?

You can leave detailed feedback on GitHub.

OpenAI

Anthropic