PR #695

open

Record: 11L XSA6 + Warmdown3000 + QAT@0.30 (val_bpb=1.1352, 2-seed mean)

by 0xNoramiyaView on GitHub
val_bpb
1.1360
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.88 MB

Training Techniques

Architecture
XSA
Extended efficient partial XSA to the last 6 layers instead of the last 4.
parameters: {"layers":6}
BigramHash
Uses BigramHash for token/context representation.
parameters: {"buckets":2048,"dim":128}
SmearGate
Includes SmearGate in the architecture.
parameters: null
Partial RoPE
Applies partial rotary positional embeddings with NTK-aware scaling.
parameters: {"dimensions":"16/64"}
tied embeddings
Input and output embeddings are tied.
parameters: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3000}
Quantization
STE QAT
bits: 6
scope: MLP and attention weights
int8
bits: 8
scope: embeddings
Weight Averaging
EMA
parameters: {"decay":0.997}
Evaluation
sliding window eval
parameters: {"stride":64}
Initialization
OrthoInit
Orthogonal initialization used with muP-scaled output projections.
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer_idx+1)"}
Compression
zstd
level: 22
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup":"0.92->0.99 over 1500 steps","lr":0.025}
AdamW
weight_decay: 0.04
momentum: null
other_params: {"lr_embeddings":0.035,"lr_scalars":0.025}

Novel Contributions

  • Extended XSA from the last 4 layers to the last 6 layers
  • Shortened warmdown from 3500 to 3000 iterations
  • Raised late QAT threshold from 0.15 to 0.30
  • Selected hyperparameters via 37 local ablation experiments on an RTX 4060 Ti
  • Used STE int6 QAT for MLP and attention weights with int8 embeddings
  • Submitted the best seed from a 2-seed run with 2-seed mean reporting