No. 63 Compute

Replicate

Pay-per-second GPU compute for running AI models via API—no infrastructure management

Serverless GPU platform enabling developers to run open-source and fine-tuned AI models with simple API calls; pricing by hardware and runtime.

Visit Replicate

5 news items tracked
2021 first launched
Easy learning curve

Dispatch

Latest from Replicate

Apr 15 How to make remarkable videos with Seedance 2.0 Seedance 2.0 arrives on Replicate, offering high-fidelity AI video generation with improved temporal consistency and motion control for creators.
Feb 24 How to prompt Seedream 5.0 Replicate's Seedream 5.0 introduces multi-step reasoning and example-based editing to improve accuracy in AI image generation.
Feb 18 Recraft V4: image generation with design taste Recraft V4 launches on Replicate, offering art-directed image generation, precise text rendering, and the ability to export editable SVGs for graphic design workflows.
Nov 26 Run Isaac 0.1 on Replicate Replicate now hosts Isaac 0.1, a lightweight vision-language model designed for precise spatial reasoning and real-world object perception.
Nov 25 Run FLUX.2 on Replicate Replicate now supports FLUX.2, offering creators a high-performance environment to run the latest professional-grade image generation and editing models with multi-reference support.

Pay-per-second GPU compute for running AI models via API—no infrastructure management

The Feature

About Replicate

Replicate is a managed GPU compute platform that abstracts away infrastructure complexity, allowing developers to run machine learning models with single API calls. Rather than managing AWS EC2, Kubernetes clusters, or NVIDIA driver configurations, users simply select a model, configure parameters, and Replicate handles GPU allocation, warm-start optimization, and resource cleanup. The platform bills per-second of compute time, with pricing ranging from $0.000025/sec for CPU (micro tasks) to $0.0122/sec for 8x H100 GPUs (cutting-edge LLMs). Common GPUs include Nvidia T4 ($0.81/hour), L40S ($3.51/hour), A100 ($5.04/hour), and H100 ($5.49/hour). Replicate hosts 10,000+ models including Stable Diffusion, SDXL, Flux, Llama, Mistral, Whisper, and community-submitted models. Some models charge per-token (LLM outputs) or per-generated-item (image generation) instead of compute time. The API is language-agnostic (Python, JavaScript, Go, etc.), with built-in async execution, webhooks, and batch processing. Predictions include logs and performance metrics. For production use, Replicate offers dedicated GPU instances with predictable pricing, no cold-start overhead, and fine-tuning support. ComfyUI integration enables complex node-based workflows. Ideal for SaaS applications, content platforms, and automation pipelines avoiding upfront GPU capital expenditure.

Key Features

10,000+ pre-built models (Stable Diffusion, SDXL, Flux, Llama, Whisper, etc.)
Simple REST API for model inference
Async execution with webhook notifications
Batch processing for 1000s of predictions in parallel
Pay-per-second billing: only charge during active compute
Built-in metrics and performance logging
ComfyUI integration for node-based workflows
Fine-tuning support with custom training data
Dedicated GPU instances for deterministic performance

The Verdict

When to reach for it — and when to skip

Reach for it when…

No GPU infrastructure setup: remove NVIDIA drivers, CUDA, PyTorch from your stack
Pay-as-you-go: scale from 1 to 1,000,000 predictions/month without contract
Single API call for complex models: Stable Diffusion takes one line of code
Fast cold-start: most models boot in <5 seconds (sub-second on dedicated instances)
Built-in async + webhooks: perfect for background jobs and batch processing
Auto-scaling: handle traffic spikes without provisioning overhead
10,000+ pre-built models: no custom container management
Cost-effective for variable workloads vs. reserved EC2 instances

Skip it when…

Per-second billing adds up quickly on long-running tasks (e.g., video generation)
Outbound data transfer costs (images, videos) can exceed compute cost
Cold-start latency varies (typically 1-10 seconds for popular models)
Limited customization: no control over GPU driver versions or CUDA setup
Fine-tuning not available for all models
Potential vendor lock-in (migrating custom models to self-hosted requires work)

Best For

✓ Ideal for

SaaS applications adding AI features (generate images, transcribe audio, summarize text)
Content platforms running inference on user uploads
Automation pipelines: batch process 10,000 images overnight
Real-time web/mobile apps with variable traffic (no idle GPU costs)
Proof-of-concept AI projects before building in-house infrastructure
One-off batch jobs (e.g., dataset creation, model evaluation)

✗ Not built for

High-volume production workloads (long-term reserved instances cheaper)
Latency-critical applications requiring <100ms inference
Proprietary models requiring on-premise or air-gapped deployment
Continuous training pipelines (training not optimized)

Field Notes

Working Tips from Filmmakers Using Replicate

01 Use Replicate's Upscale API (ESRGAN, Real-ESRGAN) for overnight batch processing: queue 1000 proxy clips at 2K, execute with H100 GPU ($0.015/sec = $0.20/min = $12/hour), save 4K masters to S3 without local VRAM constraints
02 Combine Stable Diffusion + ComfyUI workflows via Replicate API: design node graph in ComfyUI locally, export JSON, send to Replicate with LoRA + ControlNet parameters for batch concept art generation
03 Leverage Whisper ASR for automated subtitle generation: transcribe rushes and interviews via Replicate's async API, webhook notifies you when complete, parse JSON for SRT export
04 Run video interpolation models (RIFE, FloFlow) for slow-motion creation: input 24fps footage, output 60fps via Replicate, execute 100+ clips in parallel for batch turnarounds
05 Build AI color grading pipeline: fine-tune StyleGAN or DiffusionBased color model on your DCI-P3 graded reference, run on Replicate for one-click color matching across multi-camera shoots

Pricing

Public Models (Pay-Per-Second)

$0.000025-$0.0122/sec

Per-second compute + outbound data

Thousands of open-source models
CPU: $0.000025/sec ($0.09/hour)
Nvidia T4 GPU: $0.000225/sec ($0.81/hour)
Nvidia L40S GPU: $0.000975/sec ($3.51/hour)
Nvidia A100 GPU: $0.001400/sec ($5.04/hour)
Nvidia H100 GPU: $0.001525/sec ($5.49/hour)
Async execution with webhooks
Batch processing via API
ComfyUI node-based workflows

Token-Based Billing (LLMs)

$3.00/M input, $0.015/K output tokens

Per-token consumption (Claude, Llama, etc.)

Claude 3.7 Sonnet: $3.00/M input, $0.015/K output
Llama 2 70B: $0.00075/input, $0.001/output
Mistral: variable pricing by model
Full model parameter access

Private/Dedicated Models

Setup + idle + active billing

Monthly or hourly

Fast-booting fine-tunes (no idle charges)
Guaranteed GPU availability
Custom model container support
Zero cold-start latency
Higher throughput (batch predictions)

Enterprise Program

Custom volume pricing

Annual contract

Volume discounts (50%+ savings possible)
Dedicated account manager
Priority GPU access
Custom SLA guarantees
On-premise or hybrid deployment options

The True Cost

Credits: N/A (pay-as-you-go)
Export: API-based, unlimited exports
Refunds: Unused balance refunds available
Commercial use: Allowed
Watermark: No

Use Cases

Image generation API: add SDXL to web app for user-customized graphicsAudio transcription: batch transcribe 1000s of podcast episodes via WhisperContent moderation: run safety classifier on user-uploaded imagesVideo upscaling: process 4K upscaling overnight, upload to S3LLM-powered chatbots: run Llama or Mistral for context-aware responsesData augmentation: generate synthetic training data for ML models

Integrations

REST API (language-agnostic)Python SDK (pip install replicate)JavaScript/Node.js SDKGo, Ruby, Java community SDKsSlack bot integration (push predictions to Slack)GitHub Actions for CI/CD workflowsComfyUI for node-based workflowsAWS Lambda, Google Cloud Functions for serverless execution