PR #1864

open

Hardik sota submission

by hardik-bhalekarView on GitHub

val_bpb

1.0805

Architecture

Transformer

Optimizer

Muon

Artifact Size

16MB

Training Techniques

Architecture

depth recurrence

Layers 3 through 5 are executed twice per forward pass to increase effective depth without increasing parameter count.

parameters: {"layers":[3,4,5],"repeats":2}

Parallel Residuals

Attention and MLP are processed in parallel to widen the model within the same latency budget.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: null

Test-Time Training

score-first TTT

parameters: null

Quantization

GPTQ

bits: null

scope: all

Compression

Brotli

level: null