PR #564
openRecord: 11L Tight SWA + Partial RoPE + LN Scale + XSA4 (val_bpb: 1.1270)
by sadeghja1070View on GitHub
val_bpb
1.1270
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.5 MB
Training Techniques
Weight Averaging
SWA
parameters: {"name":"Tight SWA","scale_threshold":0.2,"frequency_steps":50,"checkpoint_count":12,"description":"SWA checkpoint collection restricted to scale<0.2 (last ~600 steps), every 50 steps, eliminating SWA quality penalty while maintaining quantization-friendly weight averaging."}
Architecture
XSA
Exclusive Self Attention applied on last 4 layers
parameters: {"layers":4}
Partial RoPE
Rotary Positional Embeddings applied partially on 16/64 dimensions with NTK-aware scaling
parameters: {"dimensions":16,"total_dimensions":64}
LN Scale
LayerNorm scale factor set to 1/sqrt(layer_idx+1)
parameters: null
SmearGate
SmearGate applied as architectural component
parameters: null
BigramHash
Bigram hashing with 2048 buckets and dimension 128
parameters: {"buckets":2048,"dimension":128}
MLP3x
MLP expansion factor 3 with relu-squared activation
parameters: {"expansion_factor":3,"activation":"relu-squared"}
weight tying
Tied embeddings and output embeddings
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"momentum_warmup_start":0.92,"momentum_warmup_steps":1500,"learning_rate_matrix":0.025,"learning_rate_scalar":0.025,"learning_rate_embedding":0.035,"gradient_clip":0.3,"adamw_for_embeddings":true}
Compression
zstd
level: 22
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization with projection scaling by 1/sqrt(2*num_layers)
Regularization
weight decay
parameters: {"value":0.04}
Other
other
U-Net skip connections with 5 encoder and 6 decoder layers
parameters: {"encoder_layers":5,"decoder_layers":6}
Novel Contributions
- Tight SWA: Restricting SWA checkpoint collection to scale<0.2 in last ~600 steps every 50 steps to eliminate SWA quality penalty while maintaining quantization-friendly weight averaging.
- Use of Partial RoPE on 16/64 dimensions with NTK-aware scaling.
- Applying Exclusive Self Attention (XSA) on last 4 layers.
- LayerNorm scale factor set to 1/sqrt(layer_idx+1).
- Combination of SmearGate and BigramHash (2048 buckets, dim=128) in architecture.
- Int6 per-row quantization for MLP and attention weights combined with Int8 per-row for embeddings.
- Orthogonal initialization with projection scaling by 1/sqrt(2*num_layers).
- Use of Muon optimizer with momentum warmup and separate AdamW for embeddings and scalars.
- U-Net style skip connections with 5 encoder and 6 decoder layers.