PR #1895

open

Record: SP8192 + 3GDN/8FA 4096 + Attention Sink

by VFYASView on GitHub

val_bpb

1.0785

Architecture

Transformer

Optimizer

SGD

Artifact Size

~15.934 MB

Training Techniques

Sequence Length

sequence_length

train_length: 4096

eval_length: 4096

Architecture

weight tying

Tied input and output embeddings.

parameters: null

Partial RoPE

Uses partial rotary position embeddings over a subset of head dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

Gated Attention

Uses GDN mixer layers based on chunk_gated_delta_rule in selected blocks.

parameters: {"layers":[0,1,10]}

XSA

Enables XSA on all full-attention layers.

parameters: {"layers":[2,3,4,5,6,7,8,9]}

attention sink

Adds a learned sink key per KV head with zero value vector in full-attention layers.

parameters: {"sink_keys_per_kv_head":1}

depth recurrence

Uses a loop segment activated partway through training.

parameters: {"activated_at_wallclock_fraction":0.4}

LeakyReLU

Uses LeakyReLU(0.5)^2 MLP activations.

parameters: {"slope":0.5}

Regularization

logit softcap

parameters: {"value":20}

layerwise LN scale

parameters: null

Quantization

GPTQ

bits: 6

scope: attention/MLP matrices

mixed int6/int8

bits: null

scope: embeddings and selected attention projections

late QAT

bits: null

scope: attn.c_k.weight and attn.proj.weight

Optimizer

SGD

weight_decay: null

momentum: 0.9

other_params: {"gradient_clip":1,"learning_rate":0.017}

LR Schedule

cosine decay

parameters: {"chunk_lr":true}

warmdown

parameters: {"value":0.85}

Test-Time Training

score-first TTT

parameters: {"optimizer":"SGD","trainable_set":"mlp only","learning_rate":0.017,"epochs":5,"chunk_size":65536}

Compression

Brotli

level: 11

Novel Contributions

3 GDN mixer layers combined with 8 full-attention layers under the SP8192 base shape
Learned attention sink keys per KV head for full-attention layers
Score-first legal TTT with MLP-only SGD on already-scored chunks
Late QAT noise applied only near the end of training to improve quantization robustness
Mixed-bit quantization with Hadamard-rotated attention V/O pairs and frequency-split embeddings
4096-token training/evaluation context with separate compiled full-attention segment