PR #822
openAdd baseline and depth recurrence submissions (1xH100 20min runs)
by henrycashe26View on GitHub
val_bpb
1.2604
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.8MB
Training Techniques
Quantization
mixed int5/int6 QAT
bits: null
scope: MLP in int5, attention in int6
Architecture
BigramHash
BigramHash embeddings used instead of standard token embeddings.
parameters: {"buckets":10240,"dim":128}
SmearGate
Token blending mechanism.
parameters: null
tied embeddings
Input and output embeddings are tied.
parameters: null
depth recurrence
4 unique transformer layers shared across 3 loop iterations.
parameters: {"layers":4,"loops":3}
learned level signals
Learned level signals used with depth recurrence.
parameters: null
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: {"newton_schulz_orthogonalization":true}
Weight Averaging
SWA
parameters: {"start_frac":0.4,"every_steps":50}
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Other
other
Training on a single H100 GPU for 20 minutes instead of the standard 8xH100 for 10 minutes.
parameters: {"gpu_count":1,"duration_minutes":20}
Test-Time Training
LoRA TTT
parameters: {"rank":32}
Novel Contributions
- Reproduction of the #1 leaderboard baseline on a single H100 with reduced compute
- Mixed int5/int6 quantization-aware training with BigramHash embeddings, SmearGate, and SWA
- Depth recurrence model with 4 shared transformer layers across 3 loops
- LoRA rank-32 per loop and learned level signals for depth recurrence
- Reported artifact sizes and compute-constrained training results for two submissions