PR #1465

open

Non-record: v6.2 Phase 5a SOTA-trivial stack (3-seed re-run @66% = 1.138112; TTT 1.204 not competitive)

val_bpb

1.1381

Architecture

Hybrid

Optimizer

Muon

Artifact Size

15 MB

Training Techniques

Quantization

mixed int6

bits: 6

scope: embeddings

Architecture

weight tying

Tied token embeddings with int6 quantization for the embedding path.

parameters: null

depth recurrence

Tested depth recurrence variants as an alternative architecture, though they were abandoned.

parameters: {"unique_layers":9,"recur":2,"effective_layers":18}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"ttt_muon":true,"newton_schulz":5}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"muon":true,"stride":64}

Regularization

weight decay

parameters: null

Phase 5a trivial-wins composition combining prior improvements from QK gain initialization, Muon row normalization, EMA tuning, hidden multiplier re-investment, and int6 tied embeddings.
3-seed SLOT-100 re-run showing improved mid-eval and re-run validation bpb around 1.138112.
Legal score-first Muon TTT was evaluated and found not competitive versus aggressive SLOT.
Use of custom rANS entropy coding to pack the model into a sub-16MB artifact.
Hidden multiplier increased from 4x to 5x as a byte re-investment that improved performance.
Extensive negative ablations documenting unsuccessful compression and architecture ideas.