PR #1732

open

Record: SP8192 + Hadamard Rotation + AWQ + Layer-wise Precision + Hessian-Aware Calibration + Legal TTT — val_bpb 1.0785

by Victory963View on GitHub

val_bpb

1.0785

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.98 MB

Training Techniques

Architecture

depth recurrence

3-layer recurrence expands 11 physical layers into 17 virtual layers via encoder-decoder skip connections.

parameters: {"physical_layers":11,"virtual_layers":17,"recurrence_layers":3}

GQA

Grouped query attention with reduced KV heads.

parameters: {"num_heads":8,"num_kv_heads":4}

RoPE

Partial rotary positional embeddings.

parameters: {"rotary_dims":16,"total_dims":64}

U-Net skip connections

Encoder-decoder style skip connections used in the recurrent architecture.

parameters: null

Parallel residuals

GPT-J style parallel attention and MLP residual pathway.

parameters: null

QK gain

Learnable query scaling per head.

parameters: {"gain":5.25}

Quantization

mixed int8/int6/int4

bits: null

scope: layer-wise

AWQ

bits: null

scope: weights

int8

bits: 8

scope: embeddings and attention

int6

bits: 6

scope: MLP FC1

int4

bits: 4

scope: MLP FC2 and residuals

Other

other

Hadamard rotation applied before quantization to reduce activation outliers and quantization noise.

parameters: {"matrix_size":"512x512"}

other

Hessian-aware calibration using Fisher information diagonal to set quantization ranges.

parameters: {"calibration_batches":50}

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"variant":"MuonEq-R","row_normalized":true}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3}

Evaluation

sliding window eval

parameters: {"chunk_size":32000}

LR Schedule

warmdown

parameters: {"final_fraction":0.72}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

Novel Contributions

Hadamard rotation for outlier removal before quantization
AWQ-based mixed-precision quantization with layer-wise int8/int6/int4 allocation
Hessian-aware calibration using Fisher information for quantization ranges
3-layer recurrence expanding 11 physical layers into 17 virtual layers
Legal score-first test-time training under evaluation constraints
GPT-J style parallel residual architecture with QK gain