PR #457

open

11L + XSA + VRL + SWA + seq4096 + cross-doc TTT - val_bpb 1.1839

by carlesonielfaView on GitHub

val_bpb

1.1839

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.35 MB

Training Techniques

Sequence Length

sequence_length

train_length: 4096

eval_length: null

Architecture

XSA

Exclusive Self-Attention subtracts the component of attention output aligned with the value vector in the deepest layers.

parameters: {"layers":4}

VRL

Value Residual Learning adds a learnable residual from layer-0 value vectors into each layer's value vectors.

parameters: {"layers":[1,10]}

SmearGate

Learned token-blending gate at the embedding layer that mixes each token with the previous token.

parameters: null

weight tying

Tied embeddings / tied input-output embeddings.

parameters: null

Weight Averaging

SWA

parameters: {"checkpoints":24,"fraction_last_warmdown":0.4}

Test-Time Training

LoRA TTT

parameters: {"rank":8}

Quantization

QAT

bits: 8

scope: all

Compression

zlib

level: null

Evaluation

sliding window eval

parameters: {"stride":64}

LR Schedule

warmdown

parameters: {"warmdown_iters":1200}

Initialization

OvertoneInit

Used with phase-transition resid_mix.

Other

other

Cross-document test-time training with per-document rank-8 LoRA adapters trained on already-evaluated tokens and reset between documents.

parameters: {"reset_between_documents":true}

Novel Contributions

Long-context training with sequence length 4096
Exclusive Self-Attention (XSA) on the deepest 4 layers
Value Residual Learning (VRL) using layer-0 value vectors
SmearGate token-blending gate at the embedding layer
Stochastic Weight Averaging over 24 checkpoints
Cross-document test-time training with rank-8 LoRA adapters
Warmdown-QAT to minimize quantization penalty