PR #2038
openNon-record 10min/16MB: GQA Macro Meta-Preconditioned (val_bpb 1.19995)
by FF-GardenFnView on GitHub
val_bpb
1.2000
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,559,229 bytes
Training Techniques
Architecture
GQA
Grouped-query attention transformer backbone.
parameters: null
SmearGate
Uses a smear gate in the model.
parameters: null
depth recurrence
Adaptive depth/router controls for dynamic computation depth.
parameters: null
XSA
Macro side-channel cross-attention with a detached distillation teacher.
parameters: null
fp32 logit head
Uses an fp32 output head for logits.
parameters: null
meta-preconditioned local transforms
Applies meta-preconditioned local transforms in the network.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"banked_parallel":true,"manual_all_reduce_non_bank_parameters":true}
Quantization
int4
bits: 4
scope: model weights
Compression
zlib
level: null
Novel Contributions
- Adaptive GQA transformer with macro side-channel cross-attention
- Detached macro distillation teacher
- Adaptive depth/router controls
- Meta-preconditioned local transforms
- Smear gate
- Banked parallel Muon backbone with manual all-reduce on non-bank parameters
- Int4+zlib export to fit under the 16MB artifact cap