PR #1731

closed

Record: SP8192 + Hadamard Rotation + AWQ + Layer-wise Precision + Hessian-Aware Calibration + Legal TTT — val_bpb 1.0785 (3-seed mean)

by Victory963View on GitHub

val_bpb

1.0785

Architecture

Transformer

Optimizer

SGD

Artifact Size

~15.98 MB

Training Techniques

Quantization

mixed int4/int6/int8

bits: null

scope: embeddings, attention, MLP, residuals

Architecture

depth recurrence

3-layer depth recurrence creating virtual layers from physical layers

parameters: {"layers":3,"virtual_layers":17,"physical_layers":11}

parallel residuals

GPT-J style parallel residual pathway where attention and MLP read from the same input

parameters: {"start_layer":7}

Partial RoPE

Uses partial rotary positional embeddings

parameters: {"dimensions":16}

LeakyReLU

Uses LeakyReLU activation in the MLP

parameters: {"slope":0.5}

weight tying

Tied input and output embeddings

parameters: null

QK-Gain

Learnable per-head query scaling

parameters: {"gain":5.25}

U-Net skip connections

Skip-gated U-Net style connections

parameters: null

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.005,"epochs_per_chunk":3}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: null

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3}

Regularization

logit softcap

parameters: {"value":30}

LR Schedule

cosine decay

parameters: {"applied_to":"TTT"}

warmdown

parameters: {"warmdown_steps_fraction":0.72}

Novel Contributions

Hadamard rotation applied before quantization to reduce outlier effects
AWQ with Hessian-aware calibration for per-layer quantization ranges
Layer-wise mixed precision allocation across embeddings, attention, MLP, and residuals
3-layer depth recurrence producing virtual layers from a smaller physical stack
Parallel residuals in later layers
Legal score-first test-time training under the competition rules
QK-Gain 5.25 tuning