Architecture#
High-level components#
mstar is organized as a set of cooperating processes:
API server (
mstar/api_server/): FastAPI layer that acceptsPOST /generate, tokenizes/loads media, dispatches the request, and streams results back to the client. Entry point:mstar.api_server.entrypoint:main(themstar-serveconsole script).Conductor (
mstar/conductor/): central coordinator. It manages the request lifecycle, handles graph-walk transitions, selects workers, routes inputs, and detects completion.Workers (
mstar/worker/): one process per GPU. Each runs an engine manager, a micro-scheduler (continuous batching), and a KV cache manager, and routes tensors directly to downstream workers.Engines (
mstar/engine/): execution backends that actually run submodules on the GPU —KVCacheEngine(nodes with a persistent paged KV cache, e.g. autoregressive LLMs and LLM-as-denoiser flow loops) andStatelessEngine(everything else: ViT/VAE encoders and decoders, codec decoders, projection/combine stages).Models (
mstar/model/): each model declares its computation graph, tokenization, engine types, and submodules. Registered viamstar/model/registry.py.Graph (
mstar/graph/): computation-graph primitives —GraphNode,Sequential,Parallel,Loop,GraphEdge.Communication (
mstar/communication/): ZMQ-based IPC/TCP messaging; tensor transport over RDMA or TCP.Streaming (
mstar/streaming/): streaming output with configurable chunking policies and async partition topology.
Core design principles#
Models define execution plans. Each model provides its own graph walks (e.g.
prefill,decode,image_gen) viaget_graph_walk_graphs().Disaggregated. Logical computation nodes map to physical workers via the YAML config’s
node_groups(node names → GPU ranks).Graph-driven scheduling. The conductor schedules graph walks and their transitions to coordinate multi-engine pipelines, including async producer/consumer partitions.
Execution flow (simplified)#
The API server receives a request, loads media, and calls the model’s
process_promptto produce the initial tensors.The conductor seeds the initial graph walk (e.g.
prefill) and asks the model for the next forward-pass arguments after each graph walk completes.Workers execute the ready graph nodes on GPU through the appropriate engine and route outputs (tensors) to downstream nodes/workers.
Outputs marked for the client are post-processed (
postprocess) and streamed back.