val_bpb
1.0996
Architecture
Hybrid
Optimizer
Muon
Artifact Size
14.03 MB
Training Techniques
Architecture
Gated DeltaNet
Replaces most attention layers with recurrent GDN layers for long-range associative memory.
parameters: {"layers":10}
SWA
Uses two sliding window attention layers, with shared weights between them.
parameters: {"layers":2,"shared_weights":true,"window":512}
weight tying
Tied embedding and lm-head weights.
parameters: null
BigramHash
Hash-based bigram embedding for local n-gram statistics.
parameters: {"buckets":3072}
TrigramHash
Hash-based trigram embedding for additional local n-gram features.
parameters: null
SmearGate
Learned smoothing gate applied over embeddings before recurrent layers.
parameters: null
GQA
Grouped query attention used in the sliding window attention blocks.
parameters: {"heads":8,"kv_heads":4}
logit softcap
Caps logits with tanh-based soft clipping to stabilize training.
parameters: {"value":30}
Optimizer
Muon
weight_decay: null
momentum: 0.97
other_params: {"warmup_momentum_start":0.92}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"scalar and embedding parameters"}
Weight Averaging
EMA
parameters: null
Evaluation
sliding window eval
parameters: {"stride":32}
Test-Time Training
score-first TTT
parameters: {"chunk_size":32768,"optimizer":"AdamW"}
Quantization
GPTQ
bits: 6
scope: all linear layers
Compression
brotli
level: 11
Sequence Length
sequence_length
train_length: 2048
eval_length: 32768
Regularization
logit softcap
parameters: {"value":30}
Novel Contributions
- GDN-Hybrid architecture combining recurrent Gated DeltaNet layers with shared sliding-window attention
- Legal score-first test-time training that adapts only after scoring already-evaluated tokens
- Full-Hessian GPTQ Int6 quantization with Cholesky error compensation
- Shared SWA weights to reduce parameter count
- Eval-time hash embeddings and n-gram posterior tilt integrated into TTT
- Hessian-aware quantization of recurrent layers with minimal BPB degradation