PR #1732
openRecord: SP8192 + Hadamard Rotation + AWQ + Layer-wise Precision + Hessian-Aware Calibration + Legal TTT — val_bpb 1.0785
by Victory963View on GitHub
val_bpb
1.0785
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB
Training Techniques
Architecture
depth recurrence
3-layer recurrence expands 11 physical layers into 17 virtual layers via encoder-decoder skip connections.
parameters: {"physical_layers":11,"virtual_layers":17,"recurrence_layers":3}
GQA
Grouped query attention with reduced KV heads.
parameters: {"num_heads":8,"num_kv_heads":4}
RoPE
Partial rotary positional embeddings.
parameters: {"rotary_dims":16,"total_dims":64}
U-Net skip connections
Encoder-decoder style skip connections used in the recurrent architecture.
parameters: null
Parallel residuals
GPT-J style parallel attention and MLP residual pathway.
parameters: null
QK gain
Learnable query scaling per head.
parameters: {"gain":5.25}
Quantization
mixed int8/int6/int4
bits: null
scope: layer-wise
AWQ
bits: null
scope: weights
int8
bits: 8
scope: embeddings and attention
int6
bits: 6
scope: MLP FC1
int4
bits: 4
scope: MLP FC2 and residuals
Other
other
Hadamard rotation applied before quantization to reduce activation outliers and quantization noise.
parameters: {"matrix_size":"512x512"}
other
Hessian-aware calibration using Fisher information diagonal to set quantization ranges.
parameters: {"calibration_batches":50}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"variant":"MuonEq-R","row_normalized":true}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3}
Evaluation
sliding window eval
parameters: {"chunk_size":32000}
LR Schedule
warmdown
parameters: {"final_fraction":0.72}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
Novel Contributions
- Hadamard rotation for outlier removal before quantization
- AWQ-based mixed-precision quantization with layer-wise int8/int6/int4 allocation
- Hessian-aware calibration using Fisher information for quantization ranges
- 3-layer recurrence expanding 11 physical layers into 17 virtual layers
- Legal score-first test-time training under evaluation constraints
- GPT-J style parallel residual architecture with QK gain