PR #1123

open

Non-Record Submission: 1.1986 BPB — HybridQuantGPT v6.1 rANS + Legal TTT

by sisegodView on GitHub
val_bpb
1.1986
Architecture
Hybrid
Optimizer
Muon
Artifact Size
15,132,719 bytes

Training Techniques

Quantization
mixed int6/int5
bits: null
scope: Q/K, V/O, MLP, embeddings
Architecture
U-Net skip connections
Encoder-decoder style skip connections with learned skip weights.
parameters: {"layers":11}
XSA
Cross-Self Attention variant that removes self-value projection from attention output.
parameters: {"layers":11}
Value Residual
Propagates first-layer value information to later layers via a learned lambda.
parameters: null
SmearGate
Blends each token with the previous token using a learned gate.
parameters: null
BigramHash
Hash-based bigram embedding.
parameters: {"vocab":2048,"dim":128}
VE128
Token identity re-injection at later layers.
parameters: {"layers":[9,10]}
Partial RoPE
Applies rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":16}
LeakyReLU
Uses LeakyReLU squared as the MLP activation.
parameters: {"negative_slope":0.5}
weight tying
Tied embeddings.
parameters: null
Regularization
LN scale
parameters: {"formula":"1/sqrt(layer+1)"}
logit softcap
parameters: {"value":15}
Optimizer
Muon
weight_decay: null
momentum: 0.95
other_params: {"adamw_for":["embeddings","scalars"],"muon_warmup_momentum":0.85}
Weight Averaging
SWA
parameters: {"snapshots":7,"start_step":9700,"end_step":10000}
EMA
parameters: {"type":"HMA","decay":0.997}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":2}
Sequence Length
sequence_length
train_length: 1024
eval_length: null
LR Schedule
warmdown
parameters: {"ratio":0.175}
Compression
custom
level: null

Novel Contributions

  • Mixed-precision quantization across different model components
  • rANS entropy coding to fit the model under the 16MB artifact limit
  • Legal score-first test-time training on already-evaluated tokens
  • HybridQuantGPT v6.1 architecture with U-Net skips, XSA, Value Residual, SmearGate, and BigramHash
  • Muon optimization with SWA/HMA weight averaging on a single RTX 3090