val_bpb
1.1574
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
16.41MB
Training Techniques
Architecture
XSA
Cross-layer shared attention applied across layers, including an all-layers variant in the reported experiment.
parameters: {"layers":"all"}
Value Embeddings
Shared value embeddings are added into the V stream at later attention layers to reinject token identity.
parameters: {"ve_dim":128,"layers":[10,11]}
MLP3x
Three-times wider MLP with LeakyReLU activation.
parameters: {"multiplier":3,"activation":"LeakyReLU"}
Partial RoPE
Rotary position embeddings applied to only part of the head dimensions.
parameters: {"dimensions":"16/64"}
U-Net skip connections
Skip connections added in a U-Net style across the network.
parameters: null
BigramHash
Hashed bigram embedding table used as an input component.
parameters: {"buckets":10240}
Weight Averaging
EMA
parameters: {"decay":0.997,"qat_reset":true}
Quantization
mixed int6/int4
bits: 6
scope: attn/MLP/bigram
Compression
zstd
level: null
Optimizer
Parallel Muon
weight_decay: null
momentum: null
other_params: {"async_reduce_scatter":true,"no_ddp":true}
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"legal":true,"score_first":true}
Regularization
LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
Other
other
Late QAT trigger based on wallclock fraction of the training budget.
parameters: {"fraction":0.65}
Novel Contributions
- Value embeddings injected into later attention layers
- Parallel Muon with async reduce-scatter on banked gradients
- Banked model tensors for qo/kv/mlp_up/mlp_down
- EMA with QAT-reset at quantization activation
- Mixed INT4/INT6 quantization with zstd compression
- XSA-all experiment achieving a new quality best