PR #81

open

Record: SwiGLU + MLP 3x + Int6 + LoRA TTT, val_bpb=1.1670 (8xH100)

by polarizedfortnite-cpuView on GitHub
val_bpb
1.1670
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.83MB

Training Techniques

Architecture
MLP3x
Increased MLP expansion from 2x to 3x to add nonlinear capacity.
parameters: {"mlp_mult":3}
SwiGLU
Replaced relu^2 with SwiGLU activation.
parameters: {"mlp_hidden_dim":1024}
KV head count
Used grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
depth
Added one extra transformer layer over the baseline.
parameters: {"layers":10}
Quantization
STE QAT int6
bits: 6
scope: all weights except tied embeddings
Compression
zstd
level: 22
Test-Time Training
LoRA TTT
parameters: {"rank":8}
LR Schedule
warmdown
parameters: {"warmdown_iters":1200}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"matrices":"Muon","embeddings_scalars":"Adam","matrix_lr":0.04,"embed_lr":0.05}

Novel Contributions

  • Combined MLP 3x expansion with SwiGLU activation in a compact Transformer.
  • Applied int6 quantization with zstd compression to fit a larger model under the artifact cap.
  • Used quantization-aware training with STE during the final quarter of training.
  • Introduced LoRA-based test-time training during evaluation to improve validation bpb.
  • Added an extra transformer layer and used grouped-query attention with 4 KV heads.