PR #302

open

Non-record: 11L int5/int6 + XSA + online TTT w/ decay prior (single-run val_bpb=1.1520)

by JackYoung27View on GitHub

val_bpb

1.1520

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.1 MB

Training Techniques

Quantization

mixed int5/int6

bits: null

scope: MLP int5, attention int6

Architecture

XSA

Uses XSA in the last 3 layers.

parameters: {"layers":3}

BigramHash

BigramHash feature with vocabulary size 10240.

parameters: {"dimensions":10240}

MLP3x

Three-layer MLP blocks.

parameters: {"layers":3}

weight tying

Tied fp16 embeddings.

parameters: null

Initialization

OrthoInit

Orthogonal initialization with muP.

Optimizer

Muon

weight_decay: 0.04

momentum: null

other_params: null

Weight Averaging

SWA

parameters: {"start":200}

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"scope":"MLP weights in last 3 blocks","learning_rate":null,"decay_prior":true}

Other

other

Pre-Q/K RMSNorm applied to attention input before Q and K projections only.

parameters: null

other

Reptile meta-learning with K=1 inner SGD step and interpolation during the last 10% of training.

parameters: {"k":1,"train_fraction":0.1}

Novel Contributions

Pre-Q/K RMSNorm to stabilize the RoPE-facing path under int5/int6
Online causal TTT with Krause-style decay prior to prevent drift
Reptile meta-learning in the last 10% of training to improve eval-time TTT adaptation
Evaluation-time adaptation of MLP weights in the last 3 blocks only