val_bpb
1.1476
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,497,769 bytes
Training Techniques
Quantization
mixed Int5/Int6 QAT
bits: null
scope: MLP weights Int5, Attention weights Int6
Architecture
BigramHash
Expanded from 2048 to 10240 buckets, XOR hash of consecutive token pairs into learned 128-dim embeddings to reduce collisions and improve bigram-level signal
parameters: {"buckets":10240,"embedding_dim":128}
XSA
Cross-layer self-attention applied on last 4 layers (layers 8-11)
parameters: {"layers":4,"layer_indices":[8,9,10,11]}
SmearGate
Applied SmearGate gating mechanism
parameters: null
Partial RoPE
Rotary positional embeddings applied partially on 16 dimensions
parameters: {"dimensions":16}
OrthoInit
Orthogonal initialization of weights
parameters: null
Value Embed
Value embeddings added on layers 10 and 11 with dimension 128
parameters: {"layers":[10,11],"dim":128}
MLP expansion
MLP expansion factor 3x (hidden=1536)
parameters: {"expansion_factor":3,"hidden_dim":1536}
Embedding
Tied FP16 embeddings
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_steps":1500}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"start":"last 20% of warmdown","frequency_steps":50}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmup and warmdown
parameters: {"warmup_steps":1500,"late_QAT_start_scale":0.15}
Novel Contributions
- Mixed Int5/Int6 quantization with MLP weights at Int5 and Attention weights at Int6 to save artifact size
- Addition of a 12th transformer layer funded by Int5 MLP compression
- Expansion of BigramHash embedding buckets from 2048 to 10240 to reduce hash collisions and improve bigram-level signal
- Use of SmearGate gating mechanism and OrthoInit initialization
- Application of cross-layer self-attention (XSA) on last 4 layers
- Partial RoPE applied on 16 dimensions
- Late Quantization-Aware Training (QAT) combined with GPTQ-lite clip search
- Use of EMA and SWA weight averaging techniques