PR #421

open

Non-record: 11L mixed int5/int6 + working QAT + TTT (val_bpb=1.1466)

by vytautas-buneviciusView on GitHub

val_bpb

1.1466

Architecture

Transformer

Optimizer

Muon

Artifact Size

14.7MB

Training Techniques

Quantization

mixed int5/int6 QAT

bits: null

scope: MLP int5, attention int6, embeddings int8

Architecture

BigramHash

Increased bigram hash size for token/context representation.

parameters: {"size":10240}

memory tokens

Added learnable global context tokens prepended during evaluation and masked during training.

parameters: {"tokens":64}

backout connection

Learned scalar connection subtracting encoder/decoder boundary state from final output.

parameters: {"parameters":1}

per-head temperature

Learned temperature parameter per attention head.

parameters: {"parameters":88}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.025}

Weight Averaging

EMA

parameters: {"decay":0.997}

Evaluation

sliding window eval

parameters: {"stride":32}

Test-Time Training

full TTT

parameters: {"epochs":3,"optimizer":"SGD","time":"83s"}

Initialization

ortho+muP init

Orthogonal plus muP initialization.

Regularization

layerwise LN scale

parameters: null

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Novel Contributions

Working QAT fix by swapping per-instance forward methods to avoid torch.compile constant folding
Mixed int5 MLP / int6 attention quantization with 3% magnitude pruning
Test-time training with post-quantization SGD on validation tokens
Expanded BigramHash from 2048 to 10240
Added 64 learnable memory tokens
Added a learned backout connection
Added per-head temperature parameters
Reduced evaluation stride to 32