Quickstart#
This page serves a model and sends a first request. If you haven’t installed mstar
yet, see Installation.
1. Start a server#
The mstar CLI launches a server for a model with a sensible default config:
mstar serve bagel
It listens on http://localhost:8000 by default. Other models:
mstar serve bagel_cfg_parallel # BAGEL with CFG branches split across GPUs (faster image gen)
mstar serve qwen3_omni # omni: text/image/audio/video in, text/speech out
mstar serve orpheus # text-to-speech
mstar serve pi05 # vision-language-action (robotics)
mstar serve vjepa2 # video world model
Defaults vary by model: most fit on a single GPU, but some ship with multi-GPU layouts —
qwen3_omni uses 2 GPUs and bagel_cfg_parallel uses 3 (the main branch plus the two
classifier-free-guidance branches on their own GPUs). Choose GPUs and a port with
--gpus / --port:
mstar serve qwen3_omni --gpus 0,1 --port 9000
For custom layouts, disaggregation, and tensor parallelism, see Serving.
Note
The first request(s) on a fresh environment can be slow — often tens of seconds to a
few minutes. mstar torch.compiles the model on first use, and that compilation
happens lazily on the first request that exercises each path. Subsequent requests run at
full speed, and the compiled artifacts are cached on disk, so later runs and restarts
warm up much faster. To avoid paying it on a real request, send a throwaway warmup
request right after the server reports ready.
2. Send a request#
Python SDK — works for every model and modality. Each line below targets the model that supports it, so run the matching server first:
from mstar import MStarClient
client = MStarClient("http://localhost:8000")
print(client.chat("What is the capital of France?").text) # text (BAGEL / Qwen3-Omni)
open("cat.png", "wb").write(client.generate_image("a cat in a hat")) # image (BAGEL)
client.tts("Hello there", voice="tara").to_wav("out.wav") # speech (Orpheus)
Streaming yields typed chunks (TextChunk / ImageChunk / AudioChunk):
from mstar.client import TextChunk
for event in client.chat("Tell me a short story.", stream=True):
if isinstance(event, TextChunk):
print(event.text, end="", flush=True)
curl — the native /generate endpoint works for every model:
curl -s http://localhost:8000/generate -F 'text=Hello, how are you?'
OpenAI-compatible API — a drop-in client for bagel, qwen3_omni, and
orpheus:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
resp = client.chat.completions.create(
model="bagel",
messages=[{"role": "user", "content": "Give me one fun fact."}],
)
print(resp.choices[0].message.content)
Runnable versions of all of these live in the repo’s examples/ directory
(sdk_chat.py, sdk_image.py, sdk_tts.py, openai_chat.py, openai_tts.py,
curl.sh). For the full client surface, see Using a Server.
3. Check health#
curl http://localhost:8000/health # -> {"status": "healthy"}