MCPHub LabRegistryGetStream/Vision-Agents
GetStream

GetStream/Vision Agents

Built by GetStream 7,569 stars

What is GetStream/Vision Agents?

Open Vision Agents by Stream. Build Vision Agents quickly with any model or video provider. Uses Stream's edge network for ultra-low latency.

How to use GetStream/Vision Agents?

1. Install a compatible MCP client (like Claude Desktop). 2. Open your configuration settings. 3. Add GetStream/Vision Agents using the following command: npx @modelcontextprotocol/getstream-vision-agents 4. Restart the client and verify the new tools are active.
🛡️ Scoped (Restricted)
npx @modelcontextprotocol/getstream-vision-agents --scope restricted
🔓 Unrestricted Access
npx @modelcontextprotocol/getstream-vision-agents

Key Features

Native MCP Protocol Support
Real-time Tool Activation & Execution
Verified High-performance Implementation
Secure Resource & Context Handling

Optimized Use Cases

Extending AI models with custom local capabilities
Automating system workflows via natural language
Connecting external data sources to LLM context windows

GetStream/Vision Agents FAQ

Q

Is GetStream/Vision Agents safe?

Yes, GetStream/Vision Agents follows the standardized Model Context Protocol security patterns and only executes tools with explicit user-granted permissions.

Q

Is GetStream/Vision Agents up to date?

GetStream/Vision Agents is currently active in the registry with 7,569 stars on GitHub, indicating its reliability and community support.

Q

Are there any limits for GetStream/Vision Agents?

Usage limits depend on the specific implementation of the MCP server and your system resources. Refer to the official documentation below for technical details.

Official Documentation

View on GitHub

Open Vision Agents by Stream

build PyPI version PyPI - Python Version License Discord X (Twitter)

https://github.com/user-attachments/assets/d9778ab9-938d-4101-8605-ff879c29b0e4

Multi-modal AI agents that watch, listen, and understand video.

Vision Agents give you the building blocks to create intelligent, low-latency video experiences powered by your models, your infrastructure, and your use cases.

Key Highlights

  • Video AI: Built for real-time video AI. Combine YOLO, Roboflow, and others with Gemini/OpenAI in real-time.
  • Low Latency: Join quickly (500ms) and maintain audio/video latency under 30ms using Stream's edge network.
  • Open: Built by Stream, but works with any video edge network.
  • Native APIs: Native SDK methods from OpenAI (create response), Gemini (generate), and Claude ( create message) — always access the latest LLM capabilities.
  • SDKs: SDKs for React, Android, iOS, Flutter, React Native, and Unity, powered by Stream's ultra-low-latency network.

Getting Started

Step 1: Install via uv

uv add vision-agents

Step 2: (Optional) Install with extra integrations

uv add "vision-agents[getstream, openai, elevenlabs, deepgram]"

Step 3: Obtain your Stream API credentials

Get a free API key from Stream. Developers receive 333,000 participant minutes per month, plus extra credits via the Maker Program.

Follow the quickstart guide to build your first agent.

See It In Action

https://github.com/user-attachments/assets/d1258ac2-ca98-4019-80e4-41ec5530117e

This example shows you how to build golf coaching AI with YOLO and Gemini Live. Combining a fast object detection model (like YOLO) with a full realtime AI is useful for many different video AI use cases. For example: Drone fire detection, sports/video game coaching, physical therapy, workout coaching, just dance style games etc.

# partial example, full example: examples/02_golf_coach_example/golf_coach_example.py
agent = Agent(
    edge=getstream.Edge(),
    agent_user=agent_user,
    instructions="Read @golf_coach.md",
    llm=gemini.Realtime(fps=10),
    processors=[ultralytics.YOLOPoseProcessor(model_path="yolo11n-pose.pt", device="cuda")],
)

Features

FeatureDescription
Real-time WebRTCStream video directly to model providers for instant visual understanding.
Video ProcessingPluggable processor pipeline for YOLO, Roboflow, or custom PyTorch/ONNX models before/after LLM calls.
Turn DetectionNatural conversation flow with VAD, diarization, and smart turn-taking.
Tool Calling & MCPExecute code and APIs mid-conversation — Linear issues, weather, telephony, or any MCP server.
Phone IntegrationInbound and outbound voice calls via Twilio with bidirectional audio streaming.
RAGRetrieval-augmented generation with TurboPuffer vector search or Gemini FileSearch.
MemoryAgents recall context across turns and sessions via Stream Chat.
Text Back-channelMessage the agent silently during a call — coaching overlays, silent instructions, etc.
Production ReadyBuilt-in HTTP server, Prometheus metrics, horizontal scaling, and Kubernetes deployment.

Out-of-the-Box Integrations

LLMs: OpenAI · Gemini · xAI · OpenRouter · Hugging Face · Kimi AI

Realtime: OpenAI Realtime · Gemini Live · AWS Nova Sonic · Qwen

STT: Deepgram · AssemblyAI · Fast-Whisper · Fish Audio · Wizper · Mistral Voxtral

TTS: ElevenLabs · Cartesia · Deepgram · AWS Polly · Pocket · Kokoro · Inworld · Fish Audio

Vision: Ultralytics · Roboflow · Moondream · NVIDIA Cosmos · Decart

Avatars: LemonSlice

Turn Detection: Vogent · Smart Turn

Other: Twilio · TurboPuffer

Documentation

Check out the full docs at VisionAgents.ai.

Quickstart: Voice AI · Video AI

Guides: MCP & Function Calling · Video Processors · Phone Calling · RAG · Testing

Production: HTTP Server · Deployment · Kubernetes · Horizontal Scaling · Prometheus Metrics

Examples

🔮 Demo Applications
<br><h3>Voice Agents (Low Latency + RAG + File Search)</h3>Build fast voice agents that can reason over knowledge, search files, and respond in real time.<br><br>• Low-latency voice interactions<br>• Retrieval-augmented responses<br>• File and knowledge search<br><br> >Source Code and tutorial<img src="assets/demo_gifs/cartesia.gif" width="320" alt="Voice Agent Demo">
<br><h3>Realtime Coaching and Video Understanding</h3>Power interactive coaching flows with live pose tracking and processor pipelines for frame-by-frame understanding.<br><br>• Real-time pose tracking<br>• Actionable coaching feedback<br>• Video processor pipeline support<br><br> >Source Code and tutorial<img src="assets/demo_gifs/golf.gif" width="320" alt="Realtime Coaching Demo">
<br><h3>Video Restyling and Avatars</h3>Use models like Decart Lucy to build virtual try-ons, stylized scenes, or give your agents a visual identity.<br><br>• Real-time video restyling<br>• Virtual try-on experiences<br>• Avatar-like visual presence<br><br> >Source Code and tutorial<img src="assets/demo_gifs/mirage.gif" width="320" alt="Video Restyling Demo">
<br><h3>Custom Video Models (Roboflow, YOLO, and More)</h3>Train and run custom computer vision models for security monitoring, moderation, and other domain-specific workflows.<br><br>• Bring your own CV models<br>• Real-time moderation pipelines<br>• Security and detection use cases<br><br> >Source Code and tutorial<img src="assets/demo_gifs/security_camera.gif" width="320" alt="Custom Video Models Demo">
<br><h3>Tools, MCP, and Phone Calling</h3>Connect external APIs and services so agents can validate data and take real-world actions during live conversations.<br><br>• MCP and function calling support<br>• Twilio-based phone workflows<br>• Real-time fraud response automation<br><br> >Phone + RAG example · >Fraud workflow example<img src="assets/demo_gifs/fraud_detection.gif" width="320" alt="Tools and Phone Demo">

Development

See DEVELOPMENT.md

Want to add your platform or provider? See Create Your Own Plugin or reach out to nash@getstream.io.

Current Limitations

  • Video AI struggles with small text — models may hallucinate scores, signs, etc.
  • Context degrades on longer sessions (~30s+) for continuous video understanding
  • Most use cases need a mix of specialized models (YOLO, Roboflow) with larger LLMs
  • Real-time models require audio/text to trigger responses — video alone won't prompt output

Star History

Star History Chart

Global Ranking

8.5
Trust ScoreMCPHub Index

Based on codebase health & activity.

Manual Config

{ "mcpServers": { "getstream-vision-agents": { "command": "npx", "args": ["getstream-vision-agents"] } } }