PR #1386

open

Non-record submission: 11L XSA4 + EMA + BigramHash3072 + LZMA (1.1452 BPB)

val_bpb
1.1452
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,616,435 bytes

Training Techniques

Architecture
XSA
XSA enabled on the last 4 transformer layers
parameters: {"layers":4}
BigramHash
Bigram hash embedding used in the model
parameters: {"vocab_size":3072,"dim":112}
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions
parameters: {"dimensions":16}
VE128
Value Residual / VE enabled on selected layers
parameters: {"dim":128,"layers":[9,10]}
LeakyReLU
MLP uses lrelu2 activation with slope 0.5
parameters: {"slope":0.5,"mult":3}
Weight Averaging
EMA + SWA
parameters: {"decay":0.997,"start_step":0,"swa_every":50}
Quantization
late QAT
bits: 6
scope: export
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":4000}
Regularization
LN scale
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"matrix_lr":0.025,"scalar_lr":0.025,"head_lr":0.008,"tied_embed_lr":0.035,"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}

Novel Contributions

  • 11-layer XSA-based Transformer run packaged as a compliant non-record submission
  • BigramHash(3072, 112) configuration
  • EMA + SWA training setup
  • late QAT with int6 export
  • lzma artifact compression to fit under the 16MB cap
  • sliding-window evaluation with stride 64