PR #160
openRecord: MLP3x + Int8 Tok Emb + Grouped LZMA + Sliding Window (val_bpb=1.1623)
by ChaseWNortonView on GitHub
val_bpb
1.1623
Architecture
Transformer
Optimizer
Muon
Artifact Size
15910904 bytes
Training Techniques
Architecture
MLP3x
Increased feedforward capacity from 2x to 3x while keeping the baseline Transformer backbone.
parameters: {"mlp_mult":3}
tied embeddings
Uses tied input/output embeddings.
parameters: {"tie_embeddings":1}
KV head count
Uses grouped-query attention with fewer KV heads than attention heads.
parameters: {"num_heads":8,"num_kv_heads":4}
RoPE
Uses rotary positional embeddings with RMSNorm and a U-Net-style skip structure inherited from the baseline.
parameters: null
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"tied_embed_lr":0.03,"matrix_lr":0.02,"scalar_lr":0.02,"warmup_steps":20,"warmdown_iters":3000}
LR Schedule
warmup + warmdown
parameters: {"warmup_steps":20,"warmdown_iters":3000}
Quantization
mixed int6/int8
bits: 6
scope: most tensors, with int8 token embedding
QAT
bits: null
scope: submission artifact / timed run support, but not activated before stop
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"seq_len":2048,"stride":256}
Other
other
Grouped QGv3 serialization was used to reduce artifact overhead before compression.
parameters: null
Novel Contributions
- Increased feedforward capacity from 2x to 3x
- Trained and evaluated at sequence length 2048
- Used grouped QGv3 serialization to reduce artifact overhead
- Kept token embeddings at int8 while quantizing most other tensors to int6
- Applied sliding-window evaluation to improve the final under-cap score
- Repacked the timed checkpoint into a submission-valid LZMA-compressed artifact