An automated pipeline that transforms curated RSS news feeds into short-form poetry videos for social media, powered by LLMs, neural TTS, and programmatic video composition.
Every day, this system reads curated news articles from an Inoreader RSS feed, distills the key themes into a structured poem, generates narrated audio, sources content-matched imagery, and composites a polished vertical video ready for YouTube Shorts, TikTok, Instagram Reels, and X.
The project replaces an earlier version that relied on GPT to hallucinate "today's news" and produced basic slideshow-style videos. This iteration connects to real curated journalism from 120+ feeds across 14 topic folders, and builds videos with cinematic Ken Burns motion, crossfade transitions, and theme-matched visuals.
Ingests curated articles from BBC, Reuters, Al Jazeera, The Atlantic, Wired, and 100+ other feeds via the Inoreader API.
Uses Claude or GPT-4 to transform news summaries into multi-stanza poems with configurable tone and style.
OpenAI's HD text-to-speech voices narrate each poem with natural cadence and pacing tuned for spoken poetry.
Pillow-rendered text overlays composited with FFmpeg. Ken Burns motion, crossfades, and content-matched stock imagery.
The system is a six-stage pipeline. Each stage produces a serializable checkpoint, so a failed run can resume from the last completed step without re-calling paid APIs.
The pipeline ingests from a separate companion project (inoreader-rag) that handles RSS authentication, feed export, and keyword filtering. The two systems are loosely coupled via markdown files on disk. This separation means either project can evolve independently.
Parses structured markdown files exported by the Inoreader pipeline. Each file represents a curated topic folder (World Stage, Tech & Society, Science & Ideas, etc.) containing up to 100 articles with title, source, author, date, URL, and body text. The reader filters by folder, deduplicates by URL, sorts by recency, and returns a configurable number of the most recent articles across selected topics.
An abstract provider interface supporting both Anthropic (Claude) and OpenAI, swappable via a single config flag. Two specialized modules sit on top: the Summarizer distills the article set into a thematic summary at low temperature (0.3) for factual accuracy, while the Poet transforms that summary into structured stanzas at higher temperature (0.8) for creativity. Both use JSON mode for structured output, eliminating the fragile regex parsing from the previous version. The poet also generates image search keywords grouped by emotional arc, so visuals track the poem's progression.
Fetches content-matched stock photography using keywords generated by the poet. Searches Unsplash (primary, 50 req/hr free tier) with Pexels as fallback (200 req/hr). Requests portrait-oriented images at 1080px width to match the 9:16 video format. Downloads are cached locally by URL hash to avoid redundant API calls across runs. The system typically sources 3 images per video, each covering 2-3 stanzas, creating visual coherence rather than jarring per-stanza switches. Designed with a clean interface so AI image generation (DALL-E 3, Flux) can replace stock search without touching the rest of the pipeline.
Generates narration using OpenAI's tts-1-hd model. Each stanza is synthesized individually, then assembled with configurable silence padding between stanzas (800ms default) and between lines within a stanza (250ms) to give the poetry room to breathe. The voice (onyx default) runs at 0.9x speed for a more deliberate, contemplative delivery. The engine produces both per-stanza audio files and a combined master track used for final video muxing.
The most complex component, split across three submodules. The Text Renderer uses Pillow to generate transparent PNG overlays with bundled fonts, white text, black stroke outlines, and opacity-stepped frames for fade-in/fade-out animations. The Effects module applies Ken Burns slow-zoom on each background image and manages crossfade timing between stanzas. The Composer orchestrates the full build: it creates per-stanza clips (background + text overlay + audio), concatenates them with crossfade transitions via FFmpeg's xfade filter, prepends a title card, appends an end card, and encodes the final output as H.264 at CRF 20 with yuv420p pixel format for universal platform compatibility.
| Decision | Choice | Rationale |
|---|---|---|
| Video engine | Pillow + FFmpeg | The previous version used moviepy, which loads entire videos into RAM. FFmpeg streams frames and handles encoding natively. Pillow gives pixel-perfect control over text rendering (shadows, glow, custom fonts) without the limitations of FFmpeg's ASS subtitle engine. |
| Text rendering | Pre-rendered PNG overlays | Evaluated three approaches: ASS subtitles (limited animation, arcane syntax), Pillow overlays (full control, moderate perf cost), and Remotion/React (unlimited quality, wrong tech stack). Pillow is the best balance of quality ceiling and implementation complexity for a Python pipeline. |
| LLM output format | JSON mode | v2 parsed stanzas from free-text LLM output using regex (split on blank lines, strip "Stanza N:" labels). This broke when the model varied its formatting. JSON mode returns a structured array, eliminating all parsing fragility. |
| Image sourcing | Semantic stock search | The LLM generates image keywords as part of the poem output, informed by the emotional arc. Stock APIs (Unsplash, Pexels) are free and produce high-quality editorial photography. The interface is designed for a future swap to AI generation (DALL-E 3, Flux) without pipeline changes. |
| Feed integration | Markdown files on disk | Loose coupling between the Inoreader pipeline and the poem generator. No shared libraries, no API layer, no database dependency. Either project can evolve or be replaced independently. The contract is simply: structured markdown files in a known directory. |
| Checkpoint/resume | JSON state per run | Each pipeline stage writes its output to disk and updates a checkpoint file. A failed run (e.g., FFmpeg crash) can resume from the last completed stage without re-calling paid LLM and TTS APIs. |
| Encoding | H.264, CRF 20, yuv420p | H.264 is universally accepted by all target platforms. CRF 20 provides high-quality source material that survives platform re-encoding. yuv420p is mandatory; other pixel formats fail on mobile devices. H.265 offers better compression but has spotty platform support. |
The LLM returns a single structured JSON object containing the poem and visual direction. This eliminates the fragile text parsing of earlier versions and lets the image sourcer work from the same creative context as the poet.
The composer builds vertical video (1080 x 1920) through a multi-layer process designed for short-form social platforms.
| Parameter | Value | Notes |
|---|---|---|
| Resolution | 1080 x 1920 | 9:16 vertical, native for all target platforms |
| Frame Rate | 30 fps | Sufficient for text + photo content; 60fps not needed |
| Codec | H.264 (libx264) | Universal platform acceptance; H.265 has spotty support |
| CRF | 20 | High quality source; platforms re-encode regardless |
| Pixel Format | yuv420p | Mandatory for mobile playback compatibility |
| Audio | AAC @ 192kbps | Preserves TTS quality for spoken word content |
| Preset | medium | Balanced encoding speed and compression efficiency |
All image sourcing and video encoding happens locally at zero cost. The only paid services are the LLM and TTS APIs. At daily generation, total cost stays well under $5/month with room to upgrade.
Core runtime. Type hints throughout, dataclasses for structured data, pathlib for file operations.
LLM providers for summarization and poetry generation. Swappable via config with a shared abstract interface.
Neural text-to-speech with HD voice models. Six voice options with configurable speed and output format.
Video encoding, crossfade transitions, audio muxing. Called via ffmpeg-python bindings for clean filter graph construction.
Text overlay rendering as transparent PNGs. Font loading, word wrapping, stroke outlines, opacity control for fade animations.
Free stock photography search with portrait orientation filtering. Local caching prevents redundant downloads.
Get an end-to-end video out the door. Feed reader, LLM abstraction, summarizer, poet, image sourcer, TTS, basic video composition, and CLI.
Add the effects that make a video feel produced rather than generated. Transitions, text animations, Ken Burns motion, audio pacing, and platform-aware duration limits.
Checkpoint/resume, config validation, dry-run mode, image caching, structured logging, and daily automation via cron.
Swap stock photos for AI-generated imagery (DALL-E 3 / Flux), add a web interface, thumbnail generation, and direct platform uploads.