mstar.model.pi05.config#

Configuration for the Pi0.5 vision-language-action model.

Functions

load_pi05_config([hf_config])

Build a Pi05Config, optionally overlaying values from an HF config dict.

Classes

Pi05Config([vit_hidden_size, ...])

Pi0.5 model configuration.

class mstar.model.pi05.config.Pi05Config(vit_hidden_size=1152, vit_num_layers=27, vit_num_heads=16, vit_intermediate_size=4304, vit_patch_size=14, vit_image_size=224, tokens_per_image=256, num_cameras=3, num_layers=18, num_qo_heads=8, num_kv_heads=1, head_dim=256, rms_norm_eps=1e-06, rope_theta=10000.0, hidden_size=2048, pali_intermediate_size=16384, vocab_size=257152, pad_token_id=0, action_hidden_size=1024, action_intermediate_size=4096, num_flow_steps=10, action_horizon=50, action_dim=32, state_dim=32, state_token_bins=256, state_token_offset=0, max_lang_tokens=200, max_position_embeddings=2048, timestep_min_period=0.004, timestep_max_period=4.0, default_action_dtype='float32', extra=<factory>)[source]#

Bases: object

Pi0.5 model configuration.

Pi0.5 combines a SigLIP vision encoder with two Gemma transformer experts: PaliGemma (Gemma-2B backbone) processes the prefix (image + text + state tokens) and writes a KV cache; an action expert (Gemma) reads that frozen cache and runs a 10-step Euler flow-matching loop with adaRMS timestep conditioning to produce a 50-step robot action trajectory.

Both experts share KV-cache dimensions (num_kv_heads, head_dim) so that the action expert can attend to the cache written by PaliGemma.

Parameters:
  • vit_hidden_size (int)

  • vit_num_layers (int)

  • vit_num_heads (int)

  • vit_intermediate_size (int)

  • vit_patch_size (int)

  • vit_image_size (int)

  • tokens_per_image (int)

  • num_cameras (int)

  • num_layers (int)

  • num_qo_heads (int)

  • num_kv_heads (int)

  • head_dim (int)

  • rms_norm_eps (float)

  • rope_theta (float)

  • hidden_size (int)

  • pali_intermediate_size (int)

  • vocab_size (int)

  • pad_token_id (int)

  • action_hidden_size (int)

  • action_intermediate_size (int)

  • num_flow_steps (int)

  • action_horizon (int)

  • action_dim (int)

  • state_dim (int)

  • state_token_bins (int)

  • state_token_offset (int)

  • max_lang_tokens (int)

  • max_position_embeddings (int)

  • timestep_min_period (float)

  • timestep_max_period (float)

  • default_action_dtype (str)

  • extra (dict)

action_dim: int = 32#
action_hidden_size: int = 1024#
action_horizon: int = 50#
action_intermediate_size: int = 4096#
default_action_dtype: str = 'float32'#
extra: dict#
head_dim: int = 256#
hidden_size: int = 2048#
max_lang_tokens: int = 200#
max_position_embeddings: int = 2048#
num_cameras: int = 3#
num_flow_steps: int = 10#
num_kv_heads: int = 1#
num_layers: int = 18#
num_qo_heads: int = 8#
pad_token_id: int = 0#
pali_intermediate_size: int = 16384#
rms_norm_eps: float = 1e-06#
rope_theta: float = 10000.0#
state_dim: int = 32#
state_token_bins: int = 256#
state_token_offset: int = 0#
timestep_max_period: float = 4.0#
timestep_min_period: float = 0.004#
tokens_per_image: int = 256#
vit_hidden_size: int = 1152#
vit_image_size: int = 224#
vit_intermediate_size: int = 4304#
vit_num_heads: int = 16#
vit_num_layers: int = 27#
vit_patch_size: int = 14#
vocab_size: int = 257152#
mstar.model.pi05.config.load_pi05_config(hf_config=None)[source]#

Build a Pi05Config, optionally overlaying values from an HF config dict.

Auto-maps any HF key that matches a Pi05Config field name. For the few HF keys whose names differ (e.g. num_hidden_layers -> num_layers), an explicit rename dict is used. Unrecognised keys are silently ignored.

Parameters:

hf_config (dict | None)

Return type:

Pi05Config