Skip to content

openclaw-video-visionAI-powered video understanding

Crawl any video platform, extract key frames, get structured summaries powered by vision AI.

Overview

openclaw-video-vision is an OpenClaw skill that:

  1. Accepts a video URL (YouTube, Bilibili, or any web page with <video>)
  2. Extracts key frames via yt-dlp + FFmpeg or browser screenshots
  3. Sends frames to a vision AI model for structured summarization

Quick Navigation

PageDescription
InstallationPrerequisites, setup, and first run
ConfigurationAll environment variables
Extraction Modesauto / ytdlp / browser — how to choose
Cloud BrowsersBrowserless, Browserbase, Steel setup
CookiesAuthenticated & age-restricted content
TroubleshootingCommon errors and fixes
ArchitectureCode structure and data flow

Supported Platforms

Platformyt-dlp pathBrowser path
YouTubeYesYes
BilibiliYesYes
Generic <video> pagesPartialYes

Two Extraction Paths

Video URL
    |
    v
[Phase 1] yt-dlp + FFmpeg ---- success ----> Vision AI -> Summary
    |
    | fail
    v
[Phase 2] Browser (Playwright) ---- success ----> Vision AI -> Summary

Phase 1 requires yt-dlp and FFmpeg only — no browser, no Chromium. Phase 2 requires playwright-core (optional dependency) + Chromium or a cloud browser.

You can lock the extraction path via VIDEO_VISION_MODE. See Extraction Modes.