PR #445

closed

Late Training Replay + EMA + GPTQ-lite (val_bpb=1.1236, 2-seed, no TTT on eval)

by newjordanView on GitHub

val_bpb

1.1236

Architecture

11L Transformer

Optimizer

Muon

Artifact Size

15.59 MB

Training Techniques

Quantization

GPTQ-lite

bits: 6

scope: all

Architecture

MLP3x

3x MLP with relu^2 activation

parameters: null

XSA

Uses XSA4 attention/sequence component

parameters: {"variant":"XSA4"}

SmearGate

Includes SmearGate gating mechanism

parameters: null

BigramHash

Uses BigramHash feature with hashed vocabulary

parameters: {"size":2048}

Partial RoPE

Applies rotary position embeddings only partially

parameters: {"16/64":true}

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"description":"Tight SWA"}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

test_time_training

parameters: null

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

Regularization

layerwise LN scale

parameters: null

Other

other

Late training replay of the last 100 training batches for 2 epochs at 10% learning rate before EMA finalization

parameters: {"epochs":2,"batches":100,"lr_fraction":0.1}

Novel Contributions

Late training replay of the last 100 training batches before EMA finalization
No test-time training on validation data
EMA combined with GPTQ-lite and late-stage replay
Sliding-window evaluation with stride 64
2-seed mean reporting for validation BPB