PR #1834

open

Record: NgramRes + Sliding-Window Attention + Legal Score-First TTT. …

val_bpb
1.0803
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.99 MB

Training Techniques

Architecture
weight tying
Neural n-gram head shares input embeddings and ties output projection to the token embedding matrix.
parameters: {"n_gram":3,"layers":2,"d_hidden":64,"d_embed":64}
sliding window eval
Sliding-window attention on early layers with full causal attention on later layers.
parameters: {"layers":4,"window_size":512}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3,"chunk_tokens":32768}
Quantization
GPTQ
bits: 6
scope: attention/MLP/NgramRes matrices
GPTQ
bits: 8
scope: embeddings
Compression
lzma
level: null
brotli
level: 11
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.005,"gradient_clip":1}
AdamW
weight_decay: null
momentum: null
other_params: {"used_for":"embeddings/scalars/NgramRes-head bias and gain terms"}
LR Schedule
warmdown
parameters: {"final_fraction":0.72}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Sequence Length
sequence_length
train_length: 32768
eval_length: 32768

Novel Contributions

  • Neural n-gram residual mixing (NgramRes) with a tied-output n-gram head
  • Sliding-window attention on early layers to reduce compute while preserving long-context modeling
  • Legal score-first test-time training with chunked SGD adaptation
  • Mixed int6/int8 GPTQ quantization that fits the model under the 16 MB limit