PR #1834

open

Record: NgramRes + Sliding-Window Attention + Legal Score-First TTT. …

by ghruaView on GitHub

val_bpb

1.0803

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.99 MB

Training Techniques

Architecture

weight tying

Neural n-gram head shares input embeddings and ties output projection to the token embedding matrix.

parameters: {"n_gram":3,"layers":2,"d_hidden":64,"d_embed":64}

sliding window eval

Sliding-window attention on early layers with full causal attention on later layers.

parameters: {"layers":4,"window_size":512}

Test-Time Training

score-first TTT

parameters: {"learning_rate":0.005,"momentum":0.9,"epochs":3,"chunk_tokens":32768}

Quantization

GPTQ

bits: 6

scope: attention/MLP/NgramRes matrices

GPTQ

bits: 8

scope: embeddings

Compression

lzma

level: null

brotli

level: 11

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"learning_rate":0.005,"gradient_clip":1}

AdamW

weight_decay: null

momentum: null

other_params: {"used_for":"embeddings/scalars/NgramRes-head bias and gain terms"}

LR Schedule

warmdown

parameters: {"final_fraction":0.72}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Sequence Length

sequence_length

train_length: 32768

eval_length: 32768

Novel Contributions

Neural n-gram residual mixing (NgramRes) with a tied-output n-gram head
Sliding-window attention on early layers to reduce compute while preserving long-context modeling
Legal score-first test-time training with chunked SGD adaptation
Mixed int6/int8 GPTQ quantization that fits the model under the 16 MB limit