PR #315
closedRecord: 11L Partial RoPE + LN Scale + EMA + XSA4 (val_bpb: 1.1248)
by jfprinczView on GitHub
val_bpb
1.1248
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.6 MB
Training Techniques
Architecture
Partial RoPE
Apply rotary position embeddings to only part of the head dimensions, leaving the rest position-free.
parameters: {"dimensions":16,"total_dimensions":64}
XSA
Exclusive Self Attention applied to the last 4 layers.
parameters: {"layers":4}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer_idx+1)"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
mixed int6/int8
bits: 6
scope: MLP and attention int6; embeddings int8
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization with muP scaling on large matrices.
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":1500}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"adam_weight_decay":0.04}
Novel Contributions
- Partial RoPE applied to only 16 of 64 head dimensions
- LayerNorm/RMSNorm output scaling by 1/sqrt(layer_idx+1)
- 11-layer Transformer with XSA on the last 4 layers
- EMA weight averaging with decay 0.997
- Mixed int6/int8 quantization with zstd compression
- Late QAT flag was present but had no effect due to torch.compile constant folding