PR #81
openRecord: SwiGLU + MLP 3x + Int6 + LoRA TTT, val_bpb=1.1670 (8xH100)
by polarizedfortnite-cpuView on GitHub
val_bpb
1.1670
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.83MB
Training Techniques
Architecture
MLP3x
Increased MLP expansion from 2x to 3x to add nonlinear capacity.
parameters: {"mlp_mult":3}
SwiGLU
Replaced relu^2 with SwiGLU activation.
parameters: {"mlp_hidden_dim":1024}
KV head count
Used grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
depth
Added one extra transformer layer over the baseline.
parameters: {"layers":10}
Quantization
STE QAT int6
bits: 6
scope: all weights except tied embeddings
Compression
zstd
level: 22
Test-Time Training
LoRA TTT
parameters: {"rank":8}
LR Schedule
warmdown
parameters: {"warmdown_iters":1200}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"matrices":"Muon","embeddings_scalars":"Adam","matrix_lr":0.04,"embed_lr":0.05}
Novel Contributions
- Combined MLP 3x expansion with SwiGLU activation in a compact Transformer.
- Applied int6 quantization with zstd compression to fit a larger model under the artifact cap.
- Used quantization-aware training with STE during the final quarter of training.
- Introduced LoRA-based test-time training during evaluation to improve validation bpb.
- Added an extra transformer layer and used grouped-query attention with 4 KV heads.