Flash runs your Python functions on remote GPU/CPU workers while you maintain local control flow. This page explains what happens when you call an @Endpoint function.
What runs where
The @Endpoint decorator marks functions for remote execution. Everything else runs locally.
import asyncio
from runpod_flash import Endpoint, GpuType
@Endpoint(name="demo", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090)
def process_on_gpu(data):
# This runs on Runpod worker
import torch
return {"result": "processed"}
async def main():
# This runs on your machine
result = await process_on_gpu({"input": "data"})
print(result) # This runs on your machine
if __name__ == "__main__":
asyncio.run(main()) # This runs on your machine
| Code | Location |
|---|
@Endpoint decorator | Your machine (marks function) |
Inside process_on_gpu | Runpod worker |
| Everything else | Your machine |
Flash apps
When you build a Flash app:
Development (flash run):
- FastAPI server runs locally.
@Endpoint functions run on Runpod workers.
Production (flash deploy):
- Each endpoint configuration becomes a separate Serverless endpoint.
- All endpoints run on Runpod.
Execution flow
Here’s what happens when you call an @Endpoint function:
Endpoint naming
Flash identifies endpoints by their name parameter:
@Endpoint(
name="inference", # This identifies the endpoint
gpu=GpuType.NVIDIA_A100_80GB_PCIe,
workers=3
)
def run_inference(data): ...
- Same name, same config: Reuses the existing endpoint.
- Same name, different config: Updates the endpoint automatically.
- New name: Creates a new endpoint.
This means you can change parameters like workers without creating a new endpoint—Flash detects the change and updates it.
Worker lifecycle
Workers scale up and down based on demand and your configuration.
Worker states
| State | Description | Billing |
|---|
| Initializing | Downloading image, loading code | Yes |
| Idle | Scaled down, waiting for requests | No |
| Running | Processing requests | Yes |
| Throttled | Temporarily unable to run due to host resource constraints | No |
| Outdated | Marked for replacement after update | Yes (while processing) |
| Unhealthy | Crashed; auto-retries for up to 7 days | No |
Scaling behavior
@Endpoint(
name="demo",
gpu=GpuGroup.ANY,
workers=(0, 5), # (min, max) - Scale to zero when idle, up to 5 workers
idle_timeout=60 # Seconds before running workers scale down
)
def process(data): ...
Example:
- First job arrives → Scale to 1 worker (cold start).
- More jobs arrive while worker busy → Scale up to max workers.
- Jobs complete → Workers stay running for
idle_timeout seconds before scaling down to idle.
- No new jobs → Scale down to min workers.
Cold starts and warm starts
Understanding cold and warm starts helps you predict latency and set expectations.
Cold start
A cold start occurs when no workers are available to handle your job, because:
- You’re calling an endpoint for the first time.
- All workers have been scaled down after not processing requests for
idle_timeout seconds.
- All running workers are busy processing requests.
What happens during a cold start:
- Runpod provisions a new worker with your configured GPU/CPU.
- The worker image starts (dependencies are pre-installed during build).
- Your function executes.
Typical timing: 10-60 seconds total, depending on GPU availability and image size.
When using flash build or flash deploy, dependencies are pre-installed in the worker image, eliminating pip installation at request time. When running standalone scripts with @Endpoint functions outside of a Flash app, dependencies may be installed on the worker at request time.
Warm start
A warm start occurs when a worker is already running and idle:
- Worker completed a previous job and is waiting for more work.
- Worker is within its
idle_timeout period.
What happens during a warm start:
- Job is routed immediately to the idle worker.
- Your function executes.
Typical timing: ~1 second + your function’s execution time.
The relationship between configuration and starts
Your workers and idle_timeout settings directly affect cold start frequency:
workers=(0, n): Workers scale to zero when not processing. Every request after the idle_timeout period triggers a cold start.
workers=(1, n): At least one worker stays ready. First concurrent request is warm, additional requests may cold start.
- Higher
idle_timeout: Workers stay running longer before scaling down, reducing cold starts for sporadic traffic.
See configuration best practices for specific recommendations based on your workload.