Ad & Marketing

ComfyUI & Runway
Commercial Video Pipelines

NYC commercial production costs — locations, actors, union crews — are astronomical. Generative video pipelines produce the same output in hours: 50 product photo variants, motion ads, multilingual voiceover, and lip-synced talent footage.

50

photo variants

in 20 minutes

Days

→ Hours

production time

3+

languages

from one shoot

90%

cost reduction

vs. NYC production

ComfyUIRunway Gen-3Google VeoElevenLabsn8nFrame.io
← Browse all workflows

Typical build: 4–6 week sprint · Fixed price · Zero delivery risk

To be built — runs on every creative brief
Creative BriefInputCFYComfyUIProduct photosRWYRunwayGen-3 videoELElevenLabsVoiceoverLip SyncMultilingualDeliverFrame.io123456

Photo variants

50 in 20 min

Languages

Unlimited dub

Formats

16:9, 9:16, 1:1

The problem

Why traditional commercial production is broken for most brands

NYC/LA commercial production costs $50–500k per spot

Location fees, union crew rates, talent buyouts, post-production, and agency markups stack up fast. Most brands cannot justify the spend for the testing volume required by modern paid media — you need 20 creative variants, not 1 polished spot.

Production timelines are weeks — paid media needs assets in hours

A traditional shoot takes 2–4 weeks from brief to final delivery. By the time assets are ready, the media window has shifted. Generative pipelines produce test-ready assets in hours, enabling real-time creative testing at scale.

International markets are neglected because dubbing is expensive

Re-shooting talent for multilingual markets is cost-prohibitive. Dubbing with human voice actors and lip-sync work adds $5–20k per language. ElevenLabs + automated lip-sync eliminates this cost entirely and produces results indistinguishable from human dubbing.

How it works

Every step, explained

This is the actual workflow Kovil AI engineers can build and deploy — not a diagram. Here is what runs inside every node.

1
Creative Brief

Creative brief submitted with product images, mood board, and target audience

The workflow begins with a structured creative brief: product images (minimum 3 clean shots on neutral background), a mood board (3–5 reference images showing desired aesthetic and lighting style), target audience description, brand color palette, and intended placement formats (16:9 for YouTube/TV, 9:16 for Reels/TikTok, 1:1 for Meta feed). The brief is submitted via a Notion form or uploaded directly to the n8n workflow trigger. GPT-4o validates completeness before proceeding.

Creative BriefNotion Formn8n TriggerProduct ImagesMood BoardFormat Specifications
2
ComfyUI

ComfyUI generates 50 product photography variants in 20 minutes

ComfyUI runs a custom workflow designed for commercial product photography. It takes the product images and mood board as inputs and generates 50 lighting and staging variants: studio white, lifestyle environment, dramatic shadow, hero shot with gradient background, and seasonal/contextual scenes. Each variant maintains consistent product representation — no distortion or hallucinated details. The workflow runs on GPU infrastructure and completes 50 variants in approximately 20 minutes. The best 10 are automatically scored by a CLIP similarity model against the mood board.

ComfyUIProduct Photography50 VariantsLighting VariantsCLIP Scoring20 Minutes
3
Runway Gen-3 Alpha

Best hero shots sent to Runway Gen-3 for motion: product reveal, lifestyle b-roll, 15-second cuts

The top 3 scored product hero shots are sent to Runway Gen-3 Alpha via API. For each image, three motion prompts are generated by GPT-4o based on the mood board and placement format: a product reveal with subtle camera pull, a lifestyle b-roll sequence, and a 15-second ad cut with motion graphics pacing. Runway generates 4-second clips per prompt. n8n stitches the best clips into a preliminary cut and stores them in Frame.io for the creative team's review.

Runway Gen-3 AlphaMotion GenerationProduct RevealLifestyle B-Roll15-Second CutsFrame.io Storage
4
Google Veo

For 60-second commercials, Google Veo generates scene-by-scene video from storyboard prompts

For longer-form commercial production (30–60 seconds), Google Veo is used instead of Runway. GPT-4o converts the creative brief into a storyboard with 8–12 scene descriptions, each specifying camera angle, action, and emotional tone. Veo generates each scene sequentially. Scenes are assembled by n8n into the complete commercial cut, with transition timing matched to the audio track. Google Veo produces significantly longer, cinematographically consistent clips than Runway — better suited for TV and pre-roll formats.

Google VeoStoryboard GenerationScene-by-Scene Production60-Second CommercialsTV FormatPre-Roll
5
ElevenLabs

ElevenLabs dubs voiceover in target language; lip-sync layer applied

Voiceover script is generated by GPT-4o based on the brief and ad format. ElevenLabs generates the voiceover audio in the target language using a voice profile matching the brand's tone (warm and approachable, authoritative, energetic, etc.). For ads featuring on-camera talent, a lip-sync layer is applied using Runway's lip-sync capability or a dedicated lip-sync model, adjusting mouth movements to match the dubbed audio. This enables one production to become multilingual without re-shooting talent.

ElevenLabsVoiceover GenerationMulti-Language DubLip SyncVoice Profile MatchingMultilingual Production
6
Frame.io Delivery

Final assets delivered to Frame.io organized by format and language

n8n assembles the final asset package and uploads to Frame.io with a structured folder hierarchy: by placement format (16:9, 9:16, 1:1), then by language version (EN, ES, FR, etc.), then by variant (hero shot, lifestyle, product focus). Each asset is tagged with metadata: format, language, duration, and the Runway/Veo generation parameters used — enabling repro of any specific output. A Frame.io review link is generated and posted to the client's Slack channel with a structured handoff message including revision notes and format checklist.

Frame.ioStructured DeliveryFormat OrganizationLanguage VersionsAsset MetadataSlack Handoff
Tech stack

Every tool in the workflow

ComfyUI

Product photography

Generates 50 product photography variants per brief. Runs custom workflows for consistent lighting and brand palette. CLIP scoring selects the best outputs.

Runway Gen-3

Short-form video

Converts hero product shots into motion: product reveals, lifestyle b-roll, and 15-second ad cuts optimized for social placements.

Google Veo

Long-form video

Produces 30–60 second commercials from storyboard prompts. Better suited for TV and pre-roll formats requiring cinematographic consistency.

ElevenLabs

Voice + dub

Generates brand-matched voiceover in any language. Enables multilingual ad production without re-shooting on-camera talent.

n8n

Pipeline orchestration

Connects brief intake to ComfyUI, Runway/Veo, ElevenLabs, and Frame.io delivery in a single automated production pipeline.

Frame.io

Asset delivery

Receives final assets organized by format and language. Client-accessible review links generated automatically after upload.

What we build

A 4–6 week sprint. Production ready.

Kovil AI engineers scope, build, test and deploy this generative video pipeline end-to-end. You submit a brief and receive production-ready assets in hours.

  • ComfyUI custom workflow for product photography with CLIP scoring
  • Runway Gen-3 Alpha integration for motion and short-form ad cuts
  • Google Veo integration for 30–60 second commercial production
  • ElevenLabs voice profile setup and multilingual dub pipeline
  • Automated lip-sync layer for on-camera talent footage
  • Frame.io delivery with structured folder hierarchy and metadata tagging
  • n8n pipeline connecting brief intake to final asset delivery
Sprint timeline4–6 weeks
Week 1–2ComfyUI + Runway
  • ComfyUI product photography workflow + Runway Gen-3 motion pipeline
Week 3–4Veo + ElevenLabs
  • Google Veo 60-sec commercial pipeline + ElevenLabs voice + lip-sync
Week 5–6Delivery + deploy
  • Frame.io delivery system + n8n pipeline integration + full deployment
FAQ

Common Questions

How realistic is ComfyUI-generated product photography?

ComfyUI with the right model stack (FLUX or SDXL with product-specific LoRA fine-tuning) produces product photography that is indistinguishable from a professional studio shoot in social media and digital ad contexts. The workflow generates 50 variants in approximately 20 minutes — a task that would take a professional photographer and post-production team 2–3 days and several thousand dollars.

Can Runway and Google Veo produce broadcast-quality video?

Runway Gen-3 Alpha and Google Veo 2 produce video at sufficient quality for digital advertising, social media, and streaming platforms. For broadcast television, the output typically requires compositing with traditional footage. Most New York agency use cases — Meta Ads, YouTube pre-roll, LinkedIn video, and OTT advertising — are fully served by generative video without traditional production.

How does the multilingual lip-sync work?

ElevenLabs generates the voiceover in the target language. A lip-sync model (currently Sync.so or similar) processes the original on-camera footage and adjusts the talent's lip movements to match the new audio track. The result is a version of the video where the talent appears to be speaking the target language natively — without a separate shoot.

What is the cost saving compared to traditional commercial production in NYC?

A 60-second commercial in New York City typically costs $50,000–$250,000 for location, crew, talent, and post-production. The generative pipeline produces comparable digital advertising output for a fraction of this — primarily the Kovil AI build cost and per-use API fees. The economic disruption is most pronounced for product advertising, lifestyle imagery, and multilingual campaign variants.

Production-quality video. Hours, not weeks.

Book a 30-minute discovery call. Kovil AI engineers will scope the ComfyUI product photography setup, Runway/Veo video pipeline, and ElevenLabs multilingual dub workflow for your specific brand — fixed price, zero delivery risk.

Browse other workflows

Typical sprint: 4–6 weeks · Fixed-price · Fully managed delivery · Post-launch support included