PR #1749

open

GDN-Hybrid + Legal Score-First TTT + Full-Hessian GPTQ Int6

by gracebmlView on GitHub

val_bpb

1.0996

Architecture

Hybrid

Optimizer

Muon

Artifact Size

14.03 MB

Training Techniques

Architecture

Gated DeltaNet

Replaces most attention layers with recurrent GDN layers for long-range associative memory.

parameters: {"layers":10}

SWA

Uses two sliding window attention layers, with shared weights between them.

parameters: {"layers":2,"shared_weights":true,"window":512}

weight tying

Tied embedding and lm-head weights.

parameters: null

BigramHash

Hash-based bigram embedding for local n-gram statistics.

parameters: {"buckets":3072}

TrigramHash

Hash-based trigram embedding for additional local n-gram features.

parameters: null

SmearGate

Learned smoothing gate applied over embeddings before recurrent layers.

parameters: null

GQA

Grouped query attention used in the sliding window attention blocks.

parameters: {"heads":8,"kv_heads":4}

logit softcap

Caps logits with tanh-based soft clipping to stabilize training.

parameters: {"value":30}

Optimizer

Muon

weight_decay: null

momentum: 0.97

other_params: {"warmup_momentum_start":0.92}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"scalar and embedding parameters"}

Weight Averaging

EMA

parameters: null

Evaluation

sliding window eval

parameters: {"stride":32}

Test-Time Training

score-first TTT

parameters: {"chunk_size":32768,"optimizer":"AdamW"}

Quantization

GPTQ

bits: 6

scope: all linear layers

Compression

brotli

level: 11

Sequence Length

sequence_length

train_length: 2048

eval_length: 32768

Regularization

logit softcap

parameters: {"value":30}

Novel Contributions

GDN-Hybrid architecture combining recurrent Gated DeltaNet layers with shared sliding-window attention
Legal score-first test-time training that adapts only after scoring already-evaluated tokens
Full-Hessian GPTQ Int6 quantization with Cholesky error compensation
Shared SWA weights to reduce parameter count
Eval-time hash embeddings and n-gram posterior tilt integrated into TTT
Hessian-aware quantization of recurrent layers with minimal BPB degradation