PR #822

open

Add baseline and depth recurrence submissions (1xH100 20min runs)

by henrycashe26View on GitHub

val_bpb

1.2604

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.8MB

Training Techniques

Quantization

mixed int5/int6 QAT

bits: null

scope: MLP in int5, attention in int6

Architecture

BigramHash

BigramHash embeddings used instead of standard token embeddings.

parameters: {"buckets":10240,"dim":128}

SmearGate

Token blending mechanism.

parameters: null

tied embeddings

Input and output embeddings are tied.

parameters: null

depth recurrence

4 unique transformer layers shared across 3 loop iterations.

parameters: {"layers":4,"loops":3}

learned level signals

Learned level signals used with depth recurrence.

parameters: null

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: {"newton_schulz_orthogonalization":true}

Weight Averaging

SWA

parameters: {"start_frac":0.4,"every_steps":50}

Compression

zstd

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

Training on a single H100 GPU for 20 minutes instead of the standard 8xH100 for 10 minutes.

parameters: {"gpu_count":1,"duration_minutes":20}

Test-Time Training

LoRA TTT

parameters: {"rank":32}

Novel Contributions

Reproduction of the #1 leaderboard baseline on a single H100 with reduced compute
Mixed int5/int6 quantization-aware training with BigramHash embeddings, SmearGate, and SWA
Depth recurrence model with 4 shared transformer layers across 3 loops
LoRA rank-32 per loop and learned level signals for depth recurrence
Reported artifact sizes and compute-constrained training results for two submissions