This page provides a complete reference for all parameters available on the Endpoint class.
Parameter overview
| Parameter | Type | Description | Default |
|---|
name | str | Endpoint name (required unless id= is used) | - |
id | str | Connect to existing endpoint by ID | None |
gpu | GpuGroup, GpuType, or list | GPU type(s) for the endpoint | GpuGroup.ANY |
cpu | str or CpuInstanceType | CPU instance type (mutually exclusive with gpu) | None |
workers | int or (min, max) | Worker scaling configuration | (0, 1) |
idle_timeout | int | Seconds before scaling down idle workers | 60 |
dependencies | list[str] | Python packages to install | None |
system_dependencies | list[str] | System packages to install (apt) | None |
accelerate_downloads | bool | Enable download acceleration | True |
volume | NetworkVolume | Network volume for persistent storage | None |
datacenter | DataCenter | Preferred datacenter | EU_RO_1 |
env | dict[str, str] | Environment variables | None |
gpu_count | int | GPUs per worker | 1 |
execution_timeout_ms | int | Max execution time in milliseconds | 0 (no limit) |
flashboot | bool | Enable Flashboot fast startup | True |
image | str | Custom Docker image to deploy | None |
scaler_type | ServerlessScalerType | Scaling strategy | auto |
scaler_value | int | Scaling threshold | 4 |
template | PodTemplate | Pod template overrides | None |
Parameter details
name
Type: str
Required: Yes (unless id= is specified)
The endpoint name visible in the Runpod console. Use descriptive names to easily identify endpoints.
@Endpoint(name="ml-inference-prod", gpu=GpuGroup.ANY)
async def infer(data): ...
Use naming conventions like image-generation-prod or batch-processor-dev to organize your endpoints.
Type: str
Default: None
Connect to an existing deployed endpoint by its ID. When id is specified, name is not required.
# Connect to existing endpoint
ep = Endpoint(id="abc123xyz")
# Make requests
job = await ep.run({"prompt": "hello"})
result = await ep.post("/inference", {"data": "..."})
gpu
Type: GpuGroup, GpuType, or list[GpuGroup | GpuType]
Default: GpuGroup.ANY (if neither gpu nor cpu is specified)
Specifies GPU hardware for the endpoint. Accepts a single GPU type/group or a list for fallback strategies.
from runpod_flash import Endpoint, GpuType, GpuGroup
# Specific GPU type
@Endpoint(name="inference", gpu=GpuType.NVIDIA_A100_80GB_PCIe)
async def infer(data): ...
# Another specific GPU type
@Endpoint(name="rtx-worker", gpu=GpuType.NVIDIA_GEFORCE_RTX_4090)
async def process(data): ...
# Multiple types for fallback
@Endpoint(name="flexible", gpu=[GpuType.NVIDIA_A100_80GB_PCIe, GpuType.NVIDIA_RTX_A6000, GpuType.NVIDIA_GEFORCE_RTX_4090])
async def flexible_infer(data): ...
See GPU types for all available options.
cpu
Type: str or CpuInstanceType
Default: None
Specifies a CPU instance type. Mutually exclusive with gpu.
from runpod_flash import Endpoint, CpuInstanceType
# String shorthand
@Endpoint(name="data-processor", cpu="cpu5c-4-8")
async def process(data): ...
# Using enum
@Endpoint(name="data-processor", cpu=CpuInstanceType.CPU5C_4_8)
async def process(data): ...
See CPU types for all available options.
workers
Type: int or tuple[int, int]
Default: (0, 1)
Controls worker scaling. Accepts either a single integer (max workers with min=0) or a tuple of (min, max).
# Just max: scales from 0 to 5
@Endpoint(name="elastic", gpu=GpuGroup.ANY, workers=5)
# Min and max: always keep 2 warm, scale up to 10
@Endpoint(name="always-on", gpu=GpuGroup.ANY, workers=(2, 10))
# Default: (0, 1)
@Endpoint(name="default", gpu=GpuGroup.ANY)
Recommendations:
workers=N or workers=(0, N): Cost-optimized, allows scale to zero
workers=(1, N): Avoid cold starts by keeping at least one worker warm
workers=(N, N): Fixed worker count for consistent performance
idle_timeout
Type: int
Default: 60
Number of seconds workers will stay active (running) after completing a request, waiting for additional requests before scaling down (to minimum workers).
# Quick scale-down for cost savings
@Endpoint(name="batch", gpu=GpuGroup.ANY, idle_timeout=30)
# Keep workers longer for variable traffic
@Endpoint(name="api", gpu=GpuGroup.ANY, idle_timeout=120)
Recommendations:
30-60 seconds: Cost-optimized, infrequent traffic
60-120 seconds: Balanced, variable traffic patterns
120-300 seconds: Latency-optimized, consistent traffic
dependencies
Type: list[str]
Default: None
Python packages to install on the remote worker before executing your function. Supports standard pip syntax.
@Endpoint(
name="ml-worker",
gpu=GpuGroup.ANY,
dependencies=["torch>=2.0.0", "transformers==4.36.0", "pillow"]
)
async def process(data): ...
Packages must be imported inside the function body, not at the top of your file.
system_dependencies
Type: list[str]
Default: None
System-level packages to install via apt before your function runs.
@Endpoint(
name="video-processor",
gpu=GpuGroup.ANY,
dependencies=["opencv-python"],
system_dependencies=["libgl1-mesa-glx", "libglib2.0-0"]
)
async def process_video(data): ...
accelerate_downloads
Type: bool
Default: True
Enables faster downloads for dependencies, models, and large files. Disable if you encounter compatibility issues.
@Endpoint(
name="standard-downloads",
gpu=GpuGroup.ANY,
accelerate_downloads=False
)
async def process(data): ...
volume
Type: NetworkVolume
Default: None
Attaches a network volume for persistent storage. Volumes are mounted at /runpod-volume/. Flash uses the volume name to find an existing volume or create a new one.
from runpod_flash import Endpoint, GpuGroup, NetworkVolume
vol = NetworkVolume(name="model-cache") # Finds existing or creates new
@Endpoint(
name="model-server",
gpu=GpuGroup.ANY,
volume=vol
)
async def serve(data):
# Access files at /runpod-volume/
model = load_model("/runpod-volume/models/bert")
...
Use cases:
- Share large models across workers
- Persist data between runs
- Share datasets across endpoints
See Storage for setup instructions.
datacenter
Type: DataCenter
Default: DataCenter.EU_RO_1
Preferred datacenter for worker deployment.
from runpod_flash import Endpoint, DataCenter
@Endpoint(
name="eu-workers",
gpu=GpuGroup.ANY,
datacenter=DataCenter.EU_RO_1
)
async def process(data): ...
Flash Serverless deployments are currently restricted to EU-RO-1.
env
Type: dict[str, str]
Default: None
Environment variables passed to all workers. Useful for API keys, configuration, and feature flags.
@Endpoint(
name="ml-worker",
gpu=GpuGroup.ANY,
env={
"HF_TOKEN": "your_huggingface_token",
"MODEL_ID": "gpt2",
"LOG_LEVEL": "INFO"
}
)
async def load_model():
import os
token = os.getenv("HF_TOKEN")
model_id = os.getenv("MODEL_ID")
...
Values in your project’s .env file are only available locally for CLI commands and development. They are not passed to deployed endpoints. You must declare environment variables explicitly using the env parameter.
To pass a local environment variable to your deployed endpoint, read it from os.environ:
import os
@Endpoint(
name="ml-worker",
gpu=GpuGroup.ANY,
env={"HF_TOKEN": os.environ["HF_TOKEN"]} # Read from local env, pass to workers
)
async def load_model():
...
Environment variables are excluded from configuration hashing. Changing environment values won’t trigger endpoint recreation, making it easy to rotate API keys.
gpu_count
Type: int
Default: 1
Number of GPUs per worker. Use for multi-GPU workloads.
@Endpoint(
name="multi-gpu-training",
gpu=GpuType.NVIDIA_A100_80GB_PCIe,
gpu_count=4, # Each worker gets 4 GPUs
workers=2 # Maximum 2 workers = 8 GPUs total
)
async def train(data): ...
execution_timeout_ms
Type: int
Default: 0 (no limit)
Maximum execution time for a single job in milliseconds. Jobs exceeding this timeout are terminated.
# 5 minute timeout
@Endpoint(
name="training",
gpu=GpuGroup.ANY,
execution_timeout_ms=300000 # 5 * 60 * 1000
)
async def train(data): ...
# 30 second timeout for quick inference
@Endpoint(
name="quick-inference",
gpu=GpuGroup.ANY,
execution_timeout_ms=30000
)
async def infer(data): ...
flashboot
Type: bool
Default: True
Enables Flashboot for faster cold starts by pre-loading container images.
@Endpoint(
name="fast-startup",
gpu=GpuGroup.ANY,
flashboot=True # Default
)
async def process(data): ...
Set to False for debugging or compatibility reasons.
image
Type: str
Default: None
Custom Docker image to deploy. When specified, the endpoint runs your Docker image instead of Flash’s managed workers.
from runpod_flash import Endpoint, GpuType
vllm = Endpoint(
name="vllm-server",
image="runpod/worker-vllm:stable-cuda12.1.0",
gpu=GpuType.NVIDIA_A100_80GB_PCIe,
env={"MODEL_NAME": "meta-llama/Llama-3.2-3B-Instruct"}
)
# Make HTTP calls to the deployed image
result = await vllm.post("/v1/completions", {"prompt": "Hello"})
See Custom Docker images for complete documentation.
scaler_type
Type: ServerlessScalerType
Default: Auto-selected based on endpoint type
Scaling algorithm strategy. Defaults are automatically set:
- Queue-based:
QUEUE_DELAY (scales based on queue depth)
- Load-balanced:
REQUEST_COUNT (scales based on active requests)
from runpod_flash import Endpoint, ServerlessScalerType
@Endpoint(
name="custom-scaler",
gpu=GpuGroup.ANY,
scaler_type=ServerlessScalerType.QUEUE_DELAY
)
async def process(data): ...
scaler_value
Type: int
Default: 4
Parameter value for the scaling algorithm. With QUEUE_DELAY, represents target jobs per worker before scaling up.
# Scale up when > 2 jobs per worker (more aggressive)
@Endpoint(
name="responsive",
gpu=GpuGroup.ANY,
scaler_value=2
)
async def process(data): ...
template
Type: PodTemplate
Default: None
Advanced pod configuration overrides.
from runpod_flash import Endpoint, GpuGroup, PodTemplate
@Endpoint(
name="custom-pod",
gpu=GpuGroup.ANY,
template=PodTemplate(
containerDiskInGb=100,
env=[{"key": "PYTHONPATH", "value": "/workspace"}]
)
)
async def process(data): ...
PodTemplate
PodTemplate provides advanced pod configuration options:
| Parameter | Type | Description | Default |
|---|
containerDiskInGb | int | Container disk size in GB | 64 |
env | list[dict] | Environment variables as list of {"key": "...", "value": "..."} | None |
from runpod_flash import PodTemplate
template = PodTemplate(
containerDiskInGb=100,
env=[
{"key": "PYTHONPATH", "value": "/workspace"},
{"key": "CUDA_VISIBLE_DEVICES", "value": "0"}
]
)
For simple environment variables, use the env parameter on Endpoint instead of PodTemplate.env.
EndpointJob
When using Endpoint(id=...) or Endpoint(image=...), the .run() method returns an EndpointJob object for async operations:
ep = Endpoint(id="abc123")
# Submit a job
job = await ep.run({"prompt": "hello"})
# Check status
status = await job.status() # "IN_PROGRESS", "COMPLETED", etc.
# Wait for completion
await job.wait(timeout=60) # Optional timeout in seconds
# Access results
print(job.id) # Job ID
print(job.output) # Result payload
print(job.error) # Error message if failed
print(job.done) # True if completed/failed
# Cancel a job
await job.cancel()
Configuration change behavior
When you change configuration and redeploy, Flash automatically updates your endpoint.
Changes that recreate workers
These changes restart all workers:
- GPU configuration (
gpu, gpu_count)
- CPU instance type (
cpu)
- Docker image (
image)
- Storage (
volume)
- Datacenter (
datacenter)
- Flashboot setting (
flashboot)
Workers are temporarily unavailable during recreation (typically 30-90 seconds).
Changes that update settings only
These changes apply immediately with no downtime:
- Worker scaling (
workers)
- Timeouts (
idle_timeout, execution_timeout_ms)
- Scaler settings (
scaler_type, scaler_value)
- Environment variables (
env)
- Endpoint name (
name)
# First deployment
@Endpoint(
name="inference-api",
gpu=GpuType.NVIDIA_A100_80GB_PCIe,
workers=5,
env={"MODEL": "v1"}
)
async def infer(data): ...
# Update scaling - no worker recreation
@Endpoint(
name="inference-api",
gpu=GpuType.NVIDIA_A100_80GB_PCIe, # Same GPU
workers=10, # Changed - updates settings only
env={"MODEL": "v2"} # Changed - updates settings only
)
async def infer(data): ...
# Change GPU type - workers recreated
@Endpoint(
name="inference-api",
gpu=GpuType.NVIDIA_GEFORCE_RTX_4090, # Changed - triggers recreation
workers=10,
env={"MODEL": "v2"}
)
async def infer(data): ...