PR #160

open

Record: MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window (val_bpb=1.1623)

by ChaseWNortonView on GitHub
val_bpb
1.1623
Architecture
Transformer
Optimizer
Muon
Artifact Size
15910904 bytes

Training Techniques

Architecture
MLP3x
Increased feedforward capacity from 2x to 3x while keeping the baseline Transformer backbone.
parameters: {"mlp_mult":3}
tied embeddings
Uses tied input/output embeddings.
parameters: {"tie_embeddings":1}
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
RoPE
Uses rotary positional embeddings with RMSNorm and a U-Net-style skip structure inherited from the baseline.
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"tied_embed_lr":0.03,"matrix_lr":0.02,"scalar_lr":0.02,"warmup_steps":20,"warmdown_iters":3000}
LR Schedule
warmup + warmdown
parameters: {"warmup_steps":20,"warmdown_iters":3000}
Quantization
mixed int6/int8
bits: 6
scope: most tensors, with int8 token embedding
QAT
bits: null
scope: submission artifact / timed run support, but not activated before stop
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"seq_len":2048,"stride":256}
Other
other
Grouped QGv3 serialization was used to reduce artifact overhead before compression.
parameters: null

Novel Contributions

  • Increased feedforward capacity from 2x to 3x
  • Trained and evaluated at sequence length 2048
  • Used grouped QGv3 serialization to reduce artifact overhead
  • Kept token embeddings at int8 while quantizing most other tensors to int6
  • Applied sliding-window evaluation to improve the final under-cap score
  • Repacked the timed checkpoint into a submission-valid LZMA-compressed artifact