Systems Specification

Poem Short Generator v3

An automated pipeline that transforms curated RSS news feeds into short-form poetry videos for social media, powered by LLMs, neural TTS, and programmatic video composition.

What It Does

Every day, this system reads curated news articles from an Inoreader RSS feed, distills the key themes into a structured poem, generates narrated audio, sources content-matched imagery, and composites a polished vertical video ready for YouTube Shorts, TikTok, Instagram Reels, and X.

The project replaces an earlier version that relied on GPT to hallucinate "today's news" and produced basic slideshow-style videos. This iteration connects to real curated journalism from 120+ feeds across 14 topic folders, and builds videos with cinematic Ken Burns motion, crossfade transitions, and theme-matched visuals.

📰

Real News Sources

Ingests curated articles from BBC, Reuters, Al Jazeera, The Atlantic, Wired, and 100+ other feeds via the Inoreader API.

✍️

AI Poetry

Uses Claude or GPT-4 to transform news summaries into multi-stanza poems with configurable tone and style.

🎤

Neural TTS

OpenAI's HD text-to-speech voices narrate each poem with natural cadence and pacing tuned for spoken poetry.

🎬

Cinematic Video

Pillow-rendered text overlays composited with FFmpeg. Ken Burns motion, crossfades, and content-matched stock imagery.

Pipeline Architecture

The system is a six-stage pipeline. Each stage produces a serializable checkpoint, so a failed run can resume from the last completed step without re-calling paid APIs.

1 Feed Reader Parse Inoreader
markdown exports
2 Summarizer Articles to
thematic summary
3 Poet Summary to stanzas
+ image keywords
4 Image Sourcer Unsplash / Pexels
semantic search
5 TTS OpenAI HD voices
with pacing
6 Composer Pillow + FFmpeg
video assembly

The pipeline ingests from a separate companion project (inoreader-rag) that handles RSS authentication, feed export, and keyword filtering. The two systems are loosely coupled via markdown files on disk. This separation means either project can evolve independently.

Component Details

Feed Reader

src/feed/reader.py

Parses structured markdown files exported by the Inoreader pipeline. Each file represents a curated topic folder (World Stage, Tech & Society, Science & Ideas, etc.) containing up to 100 articles with title, source, author, date, URL, and body text. The reader filters by folder, deduplicates by URL, sorts by recency, and returns a configurable number of the most recent articles across selected topics.

Markdown parsing 14 topic folders 120+ feeds Deduplication

LLM Layer

src/llm/

An abstract provider interface supporting both Anthropic (Claude) and OpenAI, swappable via a single config flag. Two specialized modules sit on top: the Summarizer distills the article set into a thematic summary at low temperature (0.3) for factual accuracy, while the Poet transforms that summary into structured stanzas at higher temperature (0.8) for creativity. Both use JSON mode for structured output, eliminating the fragile regex parsing from the previous version. The poet also generates image search keywords grouped by emotional arc, so visuals track the poem's progression.

Claude GPT-4o JSON mode Structured output

Image Sourcer

src/images/

Fetches content-matched stock photography using keywords generated by the poet. Searches Unsplash (primary, 50 req/hr free tier) with Pexels as fallback (200 req/hr). Requests portrait-oriented images at 1080px width to match the 9:16 video format. Downloads are cached locally by URL hash to avoid redundant API calls across runs. The system typically sources 3 images per video, each covering 2-3 stanzas, creating visual coherence rather than jarring per-stanza switches. Designed with a clean interface so AI image generation (DALL-E 3, Flux) can replace stock search without touching the rest of the pipeline.

Unsplash API Pexels API Local cache Portrait 9:16

Audio Engine

src/audio/tts.py

Generates narration using OpenAI's tts-1-hd model. Each stanza is synthesized individually, then assembled with configurable silence padding between stanzas (800ms default) and between lines within a stanza (250ms) to give the poetry room to breathe. The voice (onyx default) runs at 0.9x speed for a more deliberate, contemplative delivery. The engine produces both per-stanza audio files and a combined master track used for final video muxing.

OpenAI TTS HD Pacing control 6 voice options Per-stanza + combined

Video Composer

src/video/

The most complex component, split across three submodules. The Text Renderer uses Pillow to generate transparent PNG overlays with bundled fonts, white text, black stroke outlines, and opacity-stepped frames for fade-in/fade-out animations. The Effects module applies Ken Burns slow-zoom on each background image and manages crossfade timing between stanzas. The Composer orchestrates the full build: it creates per-stanza clips (background + text overlay + audio), concatenates them with crossfade transitions via FFmpeg's xfade filter, prepends a title card, appends an end card, and encodes the final output as H.264 at CRF 20 with yuv420p pixel format for universal platform compatibility.

Pillow rendering FFmpeg encoding Ken Burns Crossfade transitions

Key Technical Decisions

Decision Choice Rationale
Video engine Pillow + FFmpeg The previous version used moviepy, which loads entire videos into RAM. FFmpeg streams frames and handles encoding natively. Pillow gives pixel-perfect control over text rendering (shadows, glow, custom fonts) without the limitations of FFmpeg's ASS subtitle engine.
Text rendering Pre-rendered PNG overlays Evaluated three approaches: ASS subtitles (limited animation, arcane syntax), Pillow overlays (full control, moderate perf cost), and Remotion/React (unlimited quality, wrong tech stack). Pillow is the best balance of quality ceiling and implementation complexity for a Python pipeline.
LLM output format JSON mode v2 parsed stanzas from free-text LLM output using regex (split on blank lines, strip "Stanza N:" labels). This broke when the model varied its formatting. JSON mode returns a structured array, eliminating all parsing fragility.
Image sourcing Semantic stock search The LLM generates image keywords as part of the poem output, informed by the emotional arc. Stock APIs (Unsplash, Pexels) are free and produce high-quality editorial photography. The interface is designed for a future swap to AI generation (DALL-E 3, Flux) without pipeline changes.
Feed integration Markdown files on disk Loose coupling between the Inoreader pipeline and the poem generator. No shared libraries, no API layer, no database dependency. Either project can evolve or be replaced independently. The contract is simply: structured markdown files in a known directory.
Checkpoint/resume JSON state per run Each pipeline stage writes its output to disk and updates a checkpoint file. A failed run (e.g., FFmpeg crash) can resume from the last completed stage without re-calling paid LLM and TTS APIs.
Encoding H.264, CRF 20, yuv420p H.264 is universally accepted by all target platforms. CRF 20 provides high-quality source material that survives platform re-encoding. yuv420p is mandatory; other pixel formats fail on mobile devices. H.265 offers better compression but has spotty platform support.

Project Structure

poem-short-generator-v3/ pyproject.toml # Dependencies, project metadata .env.example # API keys template config.yaml # All tunables src/ cli.py # CLI entry point config.py # Config loader + validation pipeline.py # Orchestrator with checkpoint/resume feed/ reader.py # Parse Inoreader markdown exports llm/ base.py # Abstract LLM interface openai_provider.py # OpenAI implementation anthropic_provider.py # Claude implementation summarizer.py # Articles to summary poet.py # Summary to stanzas + image keywords images/ sourcer.py # Image selection orchestrator unsplash.py # Unsplash API client pexels.py # Pexels API client (fallback) audio/ tts.py # OpenAI TTS with pacing video/ composer.py # FFmpeg video assembly text_renderer.py # Pillow text overlay generation effects.py # Ken Burns, crossfades, timing assets.py # Image downloading + caching assets/ fonts/ # Bundled fonts tests/ output/ # Generated videos + intermediates

Poet Output Schema

The LLM returns a single structured JSON object containing the poem and visual direction. This eliminates the fragile text parsing of earlier versions and lets the image sourcer work from the same creative context as the poet.

{ "title": "The Weight of Warming Seas", "stanzas": [ "First stanza text\nwith line breaks preserved", "Second stanza text...", // ... up to 7 stanzas ], "image_groups": [ { "keywords": ["stormy ocean", "dark waves", "dramatic sky"], "stanza_indices": [0, 1, 2], "mood": "ominous" }, { "keywords": ["cracked earth", "drought", "desolation"], "stanza_indices": [3, 4], "mood": "stark" }, { "keywords": ["green seedling", "golden light", "hope"], "stanza_indices": [5, 6], "mood": "hopeful" } ] }

Video Composition

The composer builds vertical video (1080 x 1920) through a multi-layer process designed for short-form social platforms.

Video Timeline

Title Card Stanza 1 xfade Stanza 2 xfade ... End Card |—— 2s ——|—— audio duration ——|— 0.5s —|—— audio duration ——|— 0.5s —| ... |—— 2s ——| Per stanza: Background: Ken Burns slow zoom (1.0x → 1.08x) Text: Fade in 0.33s → hold → fade out 0.33s before transition Audio: Stanza narration with 250ms line pauses, 800ms inter-stanza silence

Encoding Specification

ParameterValueNotes
Resolution1080 x 19209:16 vertical, native for all target platforms
Frame Rate30 fpsSufficient for text + photo content; 60fps not needed
CodecH.264 (libx264)Universal platform acceptance; H.265 has spotty support
CRF20High quality source; platforms re-encode regardless
Pixel Formatyuv420pMandatory for mobile playback compatibility
AudioAAC @ 192kbpsPreserves TTS quality for spoken word content
PresetmediumBalanced encoding speed and compression efficiency

Operating Cost

All image sourcing and video encoding happens locally at zero cost. The only paid services are the LLM and TTS APIs. At daily generation, total cost stays well under $5/month with room to upgrade.

Budget
$0.40
per month
Haiku + GPT-4o-mini for LLM, tts-1 standard voices, stock images
Balanced
$1.20
per month
Claude Sonnet for LLM, tts-1-hd voices, stock images
Max Quality
$4.80
per month
Claude Opus for poetry, tts-1-hd, DALL-E 3 generated images

Technology Stack

Python 3.11+

Core runtime. Type hints throughout, dataclasses for structured data, pathlib for file operations.

Anthropic SDK / OpenAI SDK

LLM providers for summarization and poetry generation. Swappable via config with a shared abstract interface.

OpenAI TTS

Neural text-to-speech with HD voice models. Six voice options with configurable speed and output format.

FFmpeg

Video encoding, crossfade transitions, audio muxing. Called via ffmpeg-python bindings for clean filter graph construction.

Pillow

Text overlay rendering as transparent PNGs. Font loading, word wrapping, stroke outlines, opacity control for fade animations.

Unsplash / Pexels APIs

Free stock photography search with portrait orientation filtering. Local caching prevents redundant downloads.

# pyproject.toml dependencies anthropic >= 0.40.0 openai >= 1.0.0 httpx >= 0.27.0 # Unsplash / Pexels API calls pyyaml >= 6.0 click >= 8.0 # CLI framework ffmpeg-python >= 0.2.0 pydub >= 0.25.0 # Audio silence padding Pillow >= 10.0 # Text rendering + image processing python-dotenv >= 1.0.0 # System: FFmpeg (brew install ffmpeg)

Development Roadmap

Phase 1 — Core Pipeline

Get an end-to-end video out the door. Feed reader, LLM abstraction, summarizer, poet, image sourcer, TTS, basic video composition, and CLI.

Feed reader LLM providers Summarizer Poet (JSON) Image sourcer OpenAI TTS FFmpeg composer CLI entry point

Phase 2 — Visual Polish

Add the effects that make a video feel produced rather than generated. Transitions, text animations, Ken Burns motion, audio pacing, and platform-aware duration limits.

Ken Burns effect Crossfade transitions Text fade animations Title / end cards Line-break pauses Platform presets Bundled fonts

Phase 3 — Reliability

Checkpoint/resume, config validation, dry-run mode, image caching, structured logging, and daily automation via cron.

Checkpoint/resume Config validation Dry-run mode Image cache Cron automation Render command

Phase 4 — Upgrades

Swap stock photos for AI-generated imagery (DALL-E 3 / Flux), add a web interface, thumbnail generation, and direct platform uploads.

AI image generation Multi-variant runs YouTube thumbnails Streamlit UI Platform upload APIs