PR #2038

open

Non-record 10min/16MB: GQA Macro Meta-Preconditioned (val_bpb 1.19995)

by FF-GardenFnView on GitHub
val_bpb
1.2000
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,559,229 bytes

Training Techniques

Architecture
GQA
Grouped-query attention transformer backbone.
parameters: null
SmearGate
Uses a smear gate in the model.
parameters: null
depth recurrence
Adaptive depth/router controls for dynamic computation depth.
parameters: null
XSA
Macro side-channel cross-attention with a detached distillation teacher.
parameters: null
fp32 logit head
Uses an fp32 output head for logits.
parameters: null
meta-preconditioned local transforms
Applies meta-preconditioned local transforms in the network.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"banked_parallel":true,"manual_all_reduce_non_bank_parameters":true}
Quantization
int4
bits: 4
scope: model weights
Compression
zlib
level: null

Novel Contributions

  • Adaptive GQA transformer with macro side-channel cross-attention
  • Detached macro distillation teacher
  • Adaptive depth/router controls
  • Meta-preconditioned local transforms
  • Smear gate
  • Banked parallel Muon backbone with manual all-reduce on non-bank parameters
  • Int4+zlib export to fit under the 16MB artifact cap