PR #1864

open

Hardik sota submission

by hardik-bhalekarView on GitHub
val_bpb
1.0805
Architecture
Transformer
Optimizer
Muon
Artifact Size
16MB

Training Techniques

Architecture
depth recurrence
Layers 3 through 5 are executed twice per forward pass to increase effective depth without increasing parameter count.
parameters: {"layers":[3,4,5],"repeats":2}
Parallel Residuals
Attention and MLP are processed in parallel to widen the model within the same latency budget.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: null
Test-Time Training
score-first TTT
parameters: null
Quantization
GPTQ
bits: null
scope: all
Compression
Brotli
level: null

Novel Contributions

  • SP8192 tokenizer for improved compression on FineWeb
  • Depth recurrence on layers 3-5
  • Parallel residual attention/MLP processing
  • Muon optimizer
  • Score-first test-time training
  • GPTQ with SDClip
  • Brotli-compressed state dictionary