PR #592

open

Submission: 12L Int5-MLP BigramHash10K EMA (1.1476 BPB)

by SkytuhuaView on GitHub
val_bpb
1.1476
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,497,769 bytes

Training Techniques

Quantization
mixed Int5/Int6 QAT
bits: null
scope: MLP weights Int5, Attention weights Int6
Architecture
BigramHash
Expanded from 2048 to 10240 buckets, XOR hash of consecutive token pairs into learned 128-dim embeddings to reduce collisions and improve bigram-level signal
parameters: {"buckets":10240,"embedding_dim":128}
XSA
Cross-layer self-attention applied on last 4 layers (layers 8-11)
parameters: {"layers":4,"layer_indices":[8,9,10,11]}
SmearGate
Applied SmearGate gating mechanism
parameters: null
Partial RoPE
Rotary positional embeddings applied partially on 16 dimensions
parameters: {"dimensions":16}
OrthoInit
Orthogonal initialization of weights
parameters: null
Value Embed
Value embeddings added on layers 10 and 11 with dimension 128
parameters: {"layers":[10,11],"dim":128}
MLP expansion
MLP expansion factor 3x (hidden=1536)
parameters: {"expansion_factor":3,"hidden_dim":1536}
Embedding
Tied FP16 embeddings
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_steps":1500}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"start":"last 20% of warmdown","frequency_steps":50}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
warmup and warmdown
parameters: {"warmup_steps":1500,"late_QAT_start_scale":0.15}

Novel Contributions

  • Mixed Int5/Int6 quantization with MLP weights at Int5 and Attention weights at Int6 to save artifact size
  • Addition of a 12th transformer layer funded by Int5 MLP compression
  • Expansion of BigramHash embedding buckets from 2048 to 10240 to reduce hash collisions and improve bigram-level signal
  • Use of SmearGate gating mechanism and OrthoInit initialization
  • Application of cross-layer self-attention (XSA) on last 4 layers
  • Partial RoPE applied on 16 dimensions
  • Late Quantization-Aware Training (QAT) combined with GPTQ-lite clip search
  • Use of EMA and SWA weight averaging techniques