Skip to main content
Use rayai serve to deploy agents as HTTP endpoints via Ray Serve:
rayai serve
RayAI automatically:
  1. Discovers agents in the agents/ directory
  2. Initializes Ray and Ray Serve
  3. Creates FastAPI endpoints for each agent

Endpoints

Each agent gets a /chat endpoint:
POST /agents/{agent_name}/chat
Request body:
{
  "data": {
    "messages": [{"role": "user", "content": "Hello!"}]
  },
  "session_id": "optional-session-id",
  "stream": false
}

Streaming

Agents can support streaming responses by implementing additional methods:
Stream ModeMethod RequiredResponse Format
"text"run_stream()Server-sent events with text chunks
"events"run_stream_events()Server-sent events with JSON objects
@agent()
class StreamingAgent:
    def run(self, data: dict) -> dict:
        return {"response": "Hello!"}

    async def run_stream(self, data: dict):
        """Yield text chunks for streaming."""
        yield "Hello"
        yield " world!"

    async def run_stream_events(self, data: dict):
        """Yield event dicts for streaming."""
        yield {"type": "start"}
        yield {"type": "token", "content": "Hello"}
        yield {"type": "end"}
To enable streaming, set the stream parameter in your request: Text streaming request:
{
  "data": {"messages": [{"role": "user", "content": "Hello!"}]},
  "stream": "text"
}
Event streaming request:
{
  "data": {"messages": [{"role": "user", "content": "Hello!"}]},
  "stream": "events"
}

CLI Options

rayai serve [PROJECT_PATH] [OPTIONS]
OptionDefaultDescription
--port8000HTTP server port
--agentsAllComma-separated list of agents to deploy

Advanced Usage

Scaling with Replicas

Ray Serve automatically load balances requests across replicas. Configure replicas in the @agent decorator:
@agent(num_cpus=1, memory="2GB", num_replicas=4)
class HighThroughputAgent:
    def run(self, data: dict) -> dict:
        return {"response": "Handled by one of 4 replicas"}

GPU Agents

For ML inference workloads, allocate GPUs per replica:
@agent(num_gpus=1, memory="8GB", num_replicas=2)
class MLAgent:
    def __init__(self):
        # Load model once per replica
        self.model = load_model()

    def run(self, data: dict) -> dict:
        return {"prediction": self.model.predict(data)}

Resource Planning

When planning cluster resources, consider:
Agent ConfigCluster Requirement
num_replicas=3, num_cpus=26 CPUs total
num_replicas=2, num_gpus=12 GPUs total
num_replicas=4, memory="4GB"16GB RAM total

Monitoring

When running, you can access:
  • Agent endpoints: http://localhost:8000/agents/{name}/chat
  • Ray Dashboard: http://localhost:8265
The Ray Dashboard provides visibility into:
  • Cluster resources and utilization
  • Task and actor status
  • Logs and metrics