PR #822

open

Add baseline and depth recurrence submissions (1xH100 20min runs)

by henrycashe26View on GitHub
val_bpb
1.2604
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.8MB

Training Techniques

Quantization
mixed int5/int6 QAT
bits: null
scope: MLP in int5, attention in int6
Architecture
BigramHash
BigramHash embeddings used instead of standard token embeddings.
parameters: {"buckets":10240,"dim":128}
SmearGate
Token blending mechanism.
parameters: null
tied embeddings
Input and output embeddings are tied.
parameters: null
depth recurrence
4 unique transformer layers shared across 3 loop iterations.
parameters: {"layers":4,"loops":3}
learned level signals
Learned level signals used with depth recurrence.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"newton_schulz_orthogonalization":true}
Weight Averaging
SWA
parameters: {"start_frac":0.4,"every_steps":50}
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Training on a single H100 GPU for 20 minutes instead of the standard 8xH100 for 10 minutes.
parameters: {"gpu_count":1,"duration_minutes":20}
Test-Time Training
LoRA TTT
parameters: {"rank":32}

Novel Contributions

  • Reproduction of the #1 leaderboard baseline on a single H100 with reduced compute
  • Mixed int5/int6 quantization-aware training with BigramHash embeddings, SmearGate, and SWA
  • Depth recurrence model with 4 shared transformer layers across 3 loops
  • LoRA rank-32 per loop and learned level signals for depth recurrence
  • Reported artifact sizes and compute-constrained training results for two submissions