PR #1731

closed

Record: SP8192 + Hadamard Rotation + AWQ + Layer-wise Precision + Hessian-Aware Calibration + Legal TTT — val_bpb 1.0785 (3-seed mean)

by Victory963View on GitHub
val_bpb
1.0785
Architecture
Transformer
Optimizer
SGD
Artifact Size
~15.98 MB

Training Techniques

Quantization
mixed int4/int6/int8
bits: null
scope: embeddings, attention, MLP, residuals
Architecture
depth recurrence
3-layer depth recurrence creating virtual layers from physical layers
parameters: {"layers":3,"virtual_layers":17,"physical_layers":11}
parallel residuals
GPT-J style parallel residual pathway where attention and MLP read from the same input
parameters: {"start_layer":7}
Partial RoPE
Uses partial rotary positional embeddings
parameters: {"dimensions":16}
LeakyReLU
Uses LeakyReLU activation in the MLP
parameters: {"slope":0.5}
weight tying
Tied input and output embeddings
parameters: null
QK-Gain
Learnable per-head query scaling
parameters: {"gain":5.25}
U-Net skip connections
Skip-gated U-Net style connections
parameters: null
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.005,"epochs_per_chunk":3}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3}
Regularization
logit softcap
parameters: {"value":30}
LR Schedule
cosine decay
parameters: {"applied_to":"TTT"}
warmdown
parameters: {"warmdown_steps_fraction":0.72}

Novel Contributions

  • Hadamard rotation applied before quantization to reduce outlier effects
  • AWQ with Hessian-aware calibration for per-layer quantization ranges
  • Layer-wise mixed precision allocation across embeddings, attention, MLP, and residuals
  • 3-layer depth recurrence producing virtual layers from a smaller physical stack
  • Parallel residuals in later layers
  • Legal score-first test-time training under the competition rules
  • QK-Gain 5.25 tuning