PR #1123

open

Non-Record Submission: 1.1986 BPB — HybridQuantGPT v6.1 rANS + Legal TTT

by sisegodView on GitHub

val_bpb

1.1986

Architecture

Hybrid

Optimizer

Muon

Artifact Size

15,132,719 bytes

Training Techniques

Quantization

mixed int6/int5

bits: null

scope: Q/K, V/O, MLP, embeddings

Architecture

U-Net skip connections

Encoder-decoder style skip connections with learned skip weights.

parameters: {"layers":11}

XSA

Cross-Self Attention variant that removes self-value projection from attention output.

parameters: {"layers":11}

Value Residual

Propagates first-layer value information to later layers via a learned lambda.

parameters: null

SmearGate

Blends each token with the previous token using a learned gate.

parameters: null

BigramHash

Hash-based bigram embedding.

parameters: {"vocab":2048,"dim":128}

VE128

Token identity re-injection at later layers.

parameters: {"layers":[9,10]}

Partial RoPE

Applies rotary position embeddings to only part of the head dimensions.

parameters: {"dimensions":16}

LeakyReLU

Uses LeakyReLU squared as the MLP activation.

parameters: {"negative_slope":0.5}

weight tying

Tied embeddings.

parameters: null

Regularization

LN scale

parameters: {"formula":"1/sqrt(layer+1)"}

logit softcap

parameters: {"value":15}

Optimizer

Muon

weight_decay: null

momentum: 0.95

other_params: {"adamw_for":["embeddings","scalars"],"muon_warmup_momentum":0.85}

Weight Averaging

SWA

parameters: {"snapshots":7,"start_step":9700,"end_step":10000}

EMA

parameters: {"type":"HMA","decay":0.997}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.002,"epochs":3,"chunk_tokens":32768,"freeze_blocks":2}

Sequence Length

sequence_length

train_length: 1024

eval_length: null

LR Schedule

warmdown

parameters: {"ratio":0.175}

Compression

custom

level: null

Novel Contributions

Mixed-precision quantization across different model components
rANS entropy coding to fit the model under the 16MB artifact limit
Legal score-first test-time training on already-evaluated tokens
HybridQuantGPT v6.1 architecture with U-Net skips, XSA, Value Residual, SmearGate, and BigramHash
Muon optimization with SWA/HMA weight averaging on a single RTX 3090