PR #338

open

Record: 11L XSA+EMA+TTT, sliding val_bpb=1.1254 (3-seed mean 1.1256)

by alertcatView on GitHub

val_bpb

1.1254

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.55 MB

Training Techniques

Architecture

XSA

Exclusive Self Attention applied to the last 4 layers.

parameters: {"layers":4}

EMA

Exponential moving average component with decay 0.997.

parameters: {"decay":0.997}

MLP3x

Transformer MLP expanded to 3x hidden size.

parameters: {"expansion":3}

SmearGate

Learned token blending gate.

parameters: null

BigramHash

Bigram hashing module with 2048 buckets.

parameters: {"buckets":2048}

OrthoInit

Orthogonal initialization strategy.

parameters: null

Quantization

int6 QAT

bits: 6

scope: block weights

mixed int5/int6

bits: null

scope: MLP and attention weights

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: null

SGD

weight_decay: null

momentum: 0.9

other_params: {"used_for":"TTT fine-tuning"}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"interval":200,"checkpoint_avg":7}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"epochs":3,"learning_rate":0.002,"momentum":0.9,"frozen_blocks":2}

Initialization

OrthoInit

Orthogonal initialization.

LR Schedule

warmdown

parameters: {"warmdown_steps":3000}

Regularization

weight decay

parameters: {"muon_wd":0.04,"adam_wd":0.04}

Novel Contributions

First submission combining XSA (Exclusive Self Attention), EMA, and Test-Time Training.
TTT adaptation on validation token stream with 3 epochs of SGD fine-tuning.
Mixed precision-tier quantization using int5 for MLP weights and int6 for attention weights.
Use of a 12-layer model enabled by compression savings from int5 MLP quantization.
Sliding window evaluation with stride 64 to report val_bpb.