Multimodal
此内容尚不支持你的语言。
Multimodal
Section titled “Multimodal”AI-Lib supports multimodal inputs and outputs — text combined with images, audio, and video — through the same unified API. The V2 protocol provides comprehensive multimodal capabilities with format validation and provider-aware modality checking.
Supported Capabilities
Section titled “Supported Capabilities”| Capability | Direction | Providers |
|---|---|---|
| Vision (images) | Input | OpenAI, Anthropic, Gemini, Qwen, DeepSeek |
| Image generation | Output | OpenAI (DALL-E), select providers |
| Audio input | Input | Gemini, Qwen (omni_mode) |
| Audio output | Output | Qwen (omni_mode), select providers |
| Video input | Input | Gemini |
| Omni mode | Input + Output | Qwen (simultaneous text + audio) |
Sending Images
Section titled “Sending Images”use ai_lib_rust::{AiClient, Message, ContentBlock};
let client = AiClient::new("openai/gpt-4o").await?;
let message = Message::user_with_content(vec![ ContentBlock::Text("What's in this image?".into()), ContentBlock::ImageUrl { url: "https://example.com/photo.jpg".into(), },]);
let response = client.chat() .messages(vec![message]) .execute() .await?;
println!("{}", response.content);Python
Section titled “Python”from ai_lib_python import AiClient, Message, ContentBlock
client = await AiClient.create("openai/gpt-4o")
message = Message.user_with_content([ ContentBlock.text("What's in this image?"), ContentBlock.image_url("https://example.com/photo.jpg"),])
response = await client.chat() \ .messages([message]) \ .execute()
print(response.content)TypeScript
Section titled “TypeScript”import { AiClient, Message, ContentBlock } from '@hiddenpath/ai-lib-ts';
const client = await AiClient.new('openai/gpt-4o');
const message = Message.userWithContent([ ContentBlock.text("What's in this image?"), ContentBlock.imageUrl('https://example.com/photo.jpg'),]);
const response = await client .chat() .messages([message]) .execute();
console.log(response.content);Base64 Images
Section titled “Base64 Images”For local images, use base64 encoding:
let image_data = std::fs::read("photo.jpg")?;let base64 = base64::engine::general_purpose::STANDARD.encode(&image_data);
let message = Message::user_with_content(vec![ ContentBlock::Text("Describe this".into()), ContentBlock::ImageBase64 { data: base64, media_type: "image/jpeg".into(), },]);Python
Section titled “Python”import base64
with open("photo.jpg", "rb") as f: image_data = base64.b64encode(f.read()).decode()
message = Message.user_with_content([ ContentBlock.text("Describe this"), ContentBlock.image_base64(image_data, "image/jpeg"),])TypeScript
Section titled “TypeScript”import { readFileSync } from 'fs';
const imageBuffer = readFileSync('photo.jpg');const imageData = imageBuffer.toString('base64');
const message = Message.userWithContent([ ContentBlock.text('Describe this'), ContentBlock.imageBase64(imageData, 'image/jpeg'),]);V2 Multimodal Capabilities
Section titled “V2 Multimodal Capabilities”The V2 protocol provides a MultimodalCapabilities module that validates content against provider declarations before sending requests.
Modality Detection
Section titled “Modality Detection”The runtime automatically detects modalities in your content blocks:
use ai_lib_rust::multimodal::{detect_modalities, Modality};
let modalities = detect_modalities(&content_blocks);// Returns: {Text, Image} or {Text, Audio, Video} etc.from ai_lib_python.multimodal import detect_modalities, Modality
modalities = detect_modalities(content_blocks)# Returns: {Modality.TEXT, Modality.IMAGE}// TypeScriptimport { detectModalities, Modality } from '@hiddenpath/ai-lib-ts/multimodal';
const modalities = detectModalities(contentBlocks);// Returns: Set { Modality.TEXT, Modality.IMAGE }Format Validation
Section titled “Format Validation”The runtime validates formats against what the provider supports:
use ai_lib_rust::multimodal::MultimodalCapabilities;
let caps = MultimodalCapabilities::from_config(&manifest.multimodal);assert!(caps.validate_image_format("png"));assert!(caps.validate_audio_format("wav"));from ai_lib_python.multimodal import MultimodalCapabilities
caps = MultimodalCapabilities.from_config(manifest_multimodal)assert caps.validate_image_format("png")assert caps.validate_audio_format("wav")// TypeScriptimport { MultimodalCapabilities } from '@hiddenpath/ai-lib-ts/multimodal';
const caps = MultimodalCapabilities.fromConfig(manifestMultimodal);console.assert(caps.validateImageFormat('png'));console.assert(caps.validateAudioFormat('wav'));Content Validation
Section titled “Content Validation”Before sending a request, validate that the provider supports all modalities in the content:
use ai_lib_rust::multimodal::validate_content_modalities;
match validate_content_modalities(&blocks, &caps) { Ok(()) => { /* all modalities supported */ } Err(unsupported) => { eprintln!("Provider doesn't support: {:?}", unsupported); }}from ai_lib_python.multimodal import validate_content_modalities
# Validate content blocks against provider capabilities// TypeScriptimport { validateContentModalities } from '@hiddenpath/ai-lib-ts/multimodal';
try { validateContentModalities(blocks, caps); // all modalities supported} catch (unsupported) { console.error(`Provider doesn't support: ${unsupported}`);}How It Works
Section titled “How It Works”- The runtime constructs a multimodal message with mixed content blocks
- V2 validation:
MultimodalCapabilitieschecks that all content modalities are supported by the provider - The protocol manifest maps content blocks to the provider’s format
- Different providers use different structures:
- OpenAI:
contentarray withtype: "image_url"objects - Anthropic:
contentarray withtype: "image"objects - Gemini:
partsarray withinline_dataobjects (supports videoparts)
- OpenAI:
- The protocol handles all format differences automatically
Provider Multimodal Matrix
Section titled “Provider Multimodal Matrix”The V2 manifest declares each provider’s multimodal capabilities explicitly:
| Provider | Image In | Audio In | Video In | Image Out | Audio Out | Omni |
|---|---|---|---|---|---|---|
| OpenAI | ✅ png, jpg, gif, webp | — | — | ✅ | — | — |
| Anthropic | ✅ png, jpg, gif, webp | — | — | — | — | — |
| Gemini | ✅ png, jpg, gif, webp | ✅ wav, mp3, flac | ✅ mp4, avi | — | — | — |
| Qwen | ✅ png, jpg | ✅ wav, mp3 | — | — | ✅ | ✅ |
| DeepSeek | ✅ png, jpg | — | — | — | — | — |
Check multimodal.input and multimodal.output sections in the V2 provider manifest for the complete declaration.