PR #2038

open

Non-record 10min/16MB: GQA Macro Meta-Preconditioned (val_bpb 1.19995)

val_bpb

1.2000

Architecture

Transformer

Optimizer

Muon

Artifact Size

15,559,229 bytes

Training Techniques

Architecture

GQA

Grouped-query attention transformer backbone.

parameters: null

SmearGate

Uses a smear gate in the model.

parameters: null

depth recurrence

Adaptive depth/router controls for dynamic computation depth.

parameters: null

XSA

Macro side-channel cross-attention with a detached distillation teacher.

parameters: null

fp32 logit head

Uses an fp32 output head for logits.

parameters: null

meta-preconditioned local transforms

Applies meta-preconditioned local transforms in the network.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"banked_parallel":true,"manual_all_reduce_non_bank_parameters":true}

Quantization

int4

bits: 4

scope: model weights

Compression

zlib

level: null