rayai serve to deploy agents as HTTP endpoints via Ray Serve:
- Discovers agents in the
agents/directory - Initializes Ray and Ray Serve
- Creates FastAPI endpoints for each agent
Endpoints
Each agent gets a/chat endpoint:
Streaming
Agents can support streaming responses by implementing additional methods:| Stream Mode | Method Required | Response Format |
|---|---|---|
"text" | run_stream() | Server-sent events with text chunks |
"events" | run_stream_events() | Server-sent events with JSON objects |
stream parameter in your request:
Text streaming request:
CLI Options
| Option | Default | Description |
|---|---|---|
--port | 8000 | HTTP server port |
--agents | All | Comma-separated list of agents to deploy |
Advanced Usage
Scaling with Replicas
Ray Serve automatically load balances requests across replicas. Configure replicas in the@agent decorator:
GPU Agents
For ML inference workloads, allocate GPUs per replica:Resource Planning
When planning cluster resources, consider:| Agent Config | Cluster Requirement |
|---|---|
num_replicas=3, num_cpus=2 | 6 CPUs total |
num_replicas=2, num_gpus=1 | 2 GPUs total |
num_replicas=4, memory="4GB" | 16GB RAM total |
Monitoring
When running, you can access:- Agent endpoints:
http://localhost:8000/agents/{name}/chat - Ray Dashboard:
http://localhost:8265
- Cluster resources and utilization
- Task and actor status
- Logs and metrics