Quickstart#

This page serves a model and sends a first request. If you haven’t installed mstar yet, see Installation.

1. Start a server#

The mstar CLI launches a server for a model with a sensible default config:

mstar serve bagel

It listens on http://localhost:8000 by default. Other models:

mstar serve bagel_cfg_parallel   # BAGEL with CFG branches split across GPUs (faster image gen)
mstar serve qwen3_omni           # omni: text/image/audio/video in, text/speech out
mstar serve orpheus              # text-to-speech
mstar serve pi05                 # vision-language-action (robotics)
mstar serve vjepa2               # video world model

Defaults vary by model: most fit on a single GPU, but some ship with multi-GPU layouts — qwen3_omni uses 2 GPUs and bagel_cfg_parallel uses 3 (the main branch plus the two classifier-free-guidance branches on their own GPUs). Choose GPUs and a port with --gpus / --port:

mstar serve qwen3_omni --gpus 0,1 --port 9000

For custom layouts, disaggregation, and tensor parallelism, see Serving.

Note

The first request(s) on a fresh environment can be slow — often tens of seconds to a few minutes. mstar torch.compiles the model on first use, and that compilation happens lazily on the first request that exercises each path. Subsequent requests run at full speed, and the compiled artifacts are cached on disk, so later runs and restarts warm up much faster. To avoid paying it on a real request, send a throwaway warmup request right after the server reports ready.

2. Send a request#

Python SDK — works for every model and modality. Each line below targets the model that supports it, so run the matching server first:

from mstar import MStarClient

client = MStarClient("http://localhost:8000")

print(client.chat("What is the capital of France?").text)              # text  (BAGEL / Qwen3-Omni)
open("cat.png", "wb").write(client.generate_image("a cat in a hat"))   # image (BAGEL)
client.tts("Hello there", voice="tara").to_wav("out.wav")             # speech (Orpheus)

Streaming yields typed chunks (TextChunk / ImageChunk / AudioChunk):

from mstar.client import TextChunk

for event in client.chat("Tell me a short story.", stream=True):
    if isinstance(event, TextChunk):
        print(event.text, end="", flush=True)

curl — the native /generate endpoint works for every model:

curl -s http://localhost:8000/generate -F 'text=Hello, how are you?'

OpenAI-compatible API — a drop-in client for bagel, qwen3_omni, and orpheus:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
resp = client.chat.completions.create(
    model="bagel",
    messages=[{"role": "user", "content": "Give me one fun fact."}],
)
print(resp.choices[0].message.content)

Runnable versions of all of these live in the repo’s examples/ directory (sdk_chat.py, sdk_image.py, sdk_tts.py, openai_chat.py, openai_tts.py, curl.sh). For the full client surface, see Using a Server.

3. Check health#

curl http://localhost:8000/health        # -> {"status": "healthy"}