mstar.model.pi05.kernels.image_normalize#
GPU-side image range normalization without CPU–GPU sync.
Replaces the two blocking transfers in _prepare_one():
img_min = float(images.min()) # GPU → CPU (sync) img_max = float(images.max()) # GPU → CPU (sync) if img_min >= -1e-4 and img_max <= 1.0 + 1e-4:
images = images * 2.0 - 1.0
with three GPU kernel launches and zero CPU transfers:
torch.min → GPU scalar tensor (no .item())
torch.max → GPU scalar tensor (no .item())
Triton kernel reads both scalars from device memory, applies x*2-1 if the range says [0,1], identity otherwise.
The Triton kernel avoids materialising a CPU-visible boolean by keeping the “needs_rescale” predicate entirely in registers and using tl.where to select between the two outcomes per element.
Falls back to the original sync-based path on CPU tensors or when Triton is not installed.
Functions
|
Detect and rescale float32 images from [0,1] to [-1,1], sync-free. |
- mstar.model.pi05.kernels.image_normalize.normalize_float_images(images)[source]#
Detect and rescale float32 images from [0,1] to [-1,1], sync-free.
Intended to replace the inline range-check in _prepare_one() which performs two CPU–GPU synchronisations via float(images.min()) and float(images.max()). This function computes those reductions on the GPU and feeds the result directly to a Triton kernel; the values never surface to the CPU.