Vision Models
Image-understanding models on ORGN Gateway — image + text input, TEE and ZDR, via the AI SDK chatModel() method.
Vision models are language models that also accept image input. You send images alongside text in the same request, and the model reasons over both.
When to Use
Use a vision model when your prompt includes images:
- Describing, captioning, or classifying images
- Extracting text or data from screenshots and documents
- Visual question answering
- Comparing or reasoning across multiple images
Vision models still return text, not images. To generate images, see Image & Video.
AI SDK Method
Vision models use the same chatModel() method as language models. Pass image bytes or a URL as a file content part with an image/* media type:
import { readFile } from 'node:fs/promises';
import { createOLLM } from '@orgn/gateway';
import { generateText } from 'ai';
const ollm = createOLLM({ apiKey: process.env.OLLM_API_KEY });
const image = await readFile('photo.jpg');
const { text } = await generateText({
model: ollm.chatModel('vercel_claude_sonnet_4_6'),
messages: [{
role: 'user',
content: [
{ type: 'text', text: 'Describe this image.' },
{ type: 'file', data: image, mediaType: 'image/jpeg' },
],
}],
});Supported media types include image/jpeg, image/png, image/webp, image/gif, and any other image/* value the underlying model accepts. See the Vercel AI SDK integration for PDF and document input.
Confirm a model accepts image input by checking that 'image' appears in its input_modalities from ollm.listModels({ inputModality: 'image' }).
TEE Catalog
Vision models running in Trusted Execution Environments, on NEAR and Phala infrastructure with Intel TDX + NVIDIA H100 confidential compute.
| Model | Provider | Infrastructure | Context |
|---|---|---|---|
| Qwen3 VL 30B | Alibaba | near | 256K |
| Qwen3 VL 30B | Alibaba | phala | 262K |
| Qwen3 VL 30B A3B Instruct | Alibaba | phala | 128K |
| Qwen2.5 VL 72B | Alibaba | phala | 128K |
ZDR Catalog
Vision-capable models running on Vercel's AI infrastructure with zero data retention provider agreements.
| Model | Provider | Context |
|---|---|---|
| Llama 3.2 11B Vision Instruct | Meta | 128K |
| Llama 3.2 90B Vision Instruct | Meta | 128K |
| Pixtral 12B | Mistral | 128K |
| Pixtral Large | Mistral | 128K |
| Qwen3 VL Instruct | Alibaba | 262K |
| Nemotron Nano 12B v2 VL | NVIDIA | 131K |
Many frontier ZDR language models, including Claude 4.x, the Gemini 2.5 and 3 families, and GPT-4.1+ and GPT-5, also accept image input. Use ollm.listModels({ inputModality: 'image' }) for the authoritative list.