跳转到内容

Multimodal

此内容尚不支持你的语言。

AI-Lib supports multimodal inputs and outputs — text combined with images, audio, and video — through the same unified API. The V2 protocol provides comprehensive multimodal capabilities with format validation and provider-aware modality checking.

CapabilityDirectionProviders
Vision (images)InputOpenAI, Anthropic, Gemini, Qwen, DeepSeek
Image generationOutputOpenAI (DALL-E), select providers
Audio inputInputGemini, Qwen (omni_mode)
Audio outputOutputQwen (omni_mode), select providers
Video inputInputGemini
Omni modeInput + OutputQwen (simultaneous text + audio)
use ai_lib_rust::{AiClient, Message, ContentBlock};
let client = AiClient::new("openai/gpt-4o").await?;
let message = Message::user_with_content(vec![
ContentBlock::Text("What's in this image?".into()),
ContentBlock::ImageUrl {
url: "https://example.com/photo.jpg".into(),
},
]);
let response = client.chat()
.messages(vec![message])
.execute()
.await?;
println!("{}", response.content);
from ai_lib_python import AiClient, Message, ContentBlock
client = await AiClient.create("openai/gpt-4o")
message = Message.user_with_content([
ContentBlock.text("What's in this image?"),
ContentBlock.image_url("https://example.com/photo.jpg"),
])
response = await client.chat() \
.messages([message]) \
.execute()
print(response.content)
import { AiClient, Message, ContentBlock } from '@hiddenpath/ai-lib-ts';
const client = await AiClient.new('openai/gpt-4o');
const message = Message.userWithContent([
ContentBlock.text("What's in this image?"),
ContentBlock.imageUrl('https://example.com/photo.jpg'),
]);
const response = await client
.chat()
.messages([message])
.execute();
console.log(response.content);

For local images, use base64 encoding:

let image_data = std::fs::read("photo.jpg")?;
let base64 = base64::engine::general_purpose::STANDARD.encode(&image_data);
let message = Message::user_with_content(vec![
ContentBlock::Text("Describe this".into()),
ContentBlock::ImageBase64 {
data: base64,
media_type: "image/jpeg".into(),
},
]);
import base64
with open("photo.jpg", "rb") as f:
image_data = base64.b64encode(f.read()).decode()
message = Message.user_with_content([
ContentBlock.text("Describe this"),
ContentBlock.image_base64(image_data, "image/jpeg"),
])
import { readFileSync } from 'fs';
const imageBuffer = readFileSync('photo.jpg');
const imageData = imageBuffer.toString('base64');
const message = Message.userWithContent([
ContentBlock.text('Describe this'),
ContentBlock.imageBase64(imageData, 'image/jpeg'),
]);

The V2 protocol provides a MultimodalCapabilities module that validates content against provider declarations before sending requests.

The runtime automatically detects modalities in your content blocks:

use ai_lib_rust::multimodal::{detect_modalities, Modality};
let modalities = detect_modalities(&content_blocks);
// Returns: {Text, Image} or {Text, Audio, Video} etc.
from ai_lib_python.multimodal import detect_modalities, Modality
modalities = detect_modalities(content_blocks)
# Returns: {Modality.TEXT, Modality.IMAGE}
// TypeScript
import { detectModalities, Modality } from '@hiddenpath/ai-lib-ts/multimodal';
const modalities = detectModalities(contentBlocks);
// Returns: Set { Modality.TEXT, Modality.IMAGE }

The runtime validates formats against what the provider supports:

use ai_lib_rust::multimodal::MultimodalCapabilities;
let caps = MultimodalCapabilities::from_config(&manifest.multimodal);
assert!(caps.validate_image_format("png"));
assert!(caps.validate_audio_format("wav"));
from ai_lib_python.multimodal import MultimodalCapabilities
caps = MultimodalCapabilities.from_config(manifest_multimodal)
assert caps.validate_image_format("png")
assert caps.validate_audio_format("wav")
// TypeScript
import { MultimodalCapabilities } from '@hiddenpath/ai-lib-ts/multimodal';
const caps = MultimodalCapabilities.fromConfig(manifestMultimodal);
console.assert(caps.validateImageFormat('png'));
console.assert(caps.validateAudioFormat('wav'));

Before sending a request, validate that the provider supports all modalities in the content:

use ai_lib_rust::multimodal::validate_content_modalities;
match validate_content_modalities(&blocks, &caps) {
Ok(()) => { /* all modalities supported */ }
Err(unsupported) => {
eprintln!("Provider doesn't support: {:?}", unsupported);
}
}
from ai_lib_python.multimodal import validate_content_modalities
# Validate content blocks against provider capabilities
// TypeScript
import { validateContentModalities } from '@hiddenpath/ai-lib-ts/multimodal';
try {
validateContentModalities(blocks, caps);
// all modalities supported
} catch (unsupported) {
console.error(`Provider doesn't support: ${unsupported}`);
}
  1. The runtime constructs a multimodal message with mixed content blocks
  2. V2 validation: MultimodalCapabilities checks that all content modalities are supported by the provider
  3. The protocol manifest maps content blocks to the provider’s format
  4. Different providers use different structures:
    • OpenAI: content array with type: "image_url" objects
    • Anthropic: content array with type: "image" objects
    • Gemini: parts array with inline_data objects (supports video parts)
  5. The protocol handles all format differences automatically

The V2 manifest declares each provider’s multimodal capabilities explicitly:

ProviderImage InAudio InVideo InImage OutAudio OutOmni
OpenAI✅ png, jpg, gif, webp
Anthropic✅ png, jpg, gif, webp
Gemini✅ png, jpg, gif, webp✅ wav, mp3, flac✅ mp4, avi
Qwen✅ png, jpg✅ wav, mp3
DeepSeek✅ png, jpg

Check multimodal.input and multimodal.output sections in the V2 provider manifest for the complete declaration.