PR #562

open

Non-record: 1.1354 BPB — 10L TTT 22ep AdamW Cosine + LeakyReLU(0.5)² + TrigramHash

by bigbagView on GitHub

val_bpb

1.1354

Architecture

Transformer

Optimizer

Muon (matrices) + AdamW (embeddings/scalars)

Artifact Size

15.35 MB

Training Techniques

Architecture

Value Residual

ResFormer-style layer-0 V mixing

parameters: null

Gated Attention

per-head sigmoid gates

parameters: null

XSA

cross self-attention on last 4 layers

parameters: {"layers":4}

LeakyReLU(0.5)²

activation preserving negative gradient flow, improves BPB by -0.003

parameters: {"negative_slope":0.5}

TrigramHash

extends BigramHash to 3-token context via XOR hashing into shared embedding table

parameters: null

SmearGate

additional gating mechanism

parameters: null

LN Scale

depth-scaled residuals

parameters: null

U-Net skip connections

skip connections inspired by U-Net architecture

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"Newton-Schulz":"used for matrices"}

AdamW

weight_decay: 0

momentum: null

other_params: {"used_for":"embeddings/scalars","TTT_lr":0.0005}

Weight Averaging

SWA

parameters: {"checkpoints_averaged":27}

Quantization

mixed int5 (MLP) / int6 (attention) + GPTQ-lite per-row clip search + 3% magnitude pruning + FP16 passthrough for embeddings + zstd-22 compression

bits: null

scope: MLP, attention, embeddings

Compression

zstd

level: 22

Test-Time Training

full TTT

parameters: {"epochs":22,"optimizer":"AdamW","learning_rate":0.0005,"weight_decay":0,"lr_schedule":"per-step cosine decay to 0","per_layer_lr_groups":{"output_projections":3,"input_projections":0.5},"batch_size_per_gpu":32,"gradient_sync":"all_reduce per step","gradient_clipping":1,"TTT_time_seconds":406,"eval_time_seconds":197}

LR Schedule

warmdown

parameters: {"warmdown_steps":3500}

cosine decay

parameters: {"per_step":true,"decay_to":0}

Evaluation

sliding window eval + Test-Time Training (TTT)

parameters: {"TTT_epochs":22,"TTT_batch_size":32,"distributed_sync":"all_reduce per step"}

Novel Contributions

Batched TTT with 32 sequences per GPU is ~500x faster than chunk-based TTT
Per-step cosine learning rate decay prevents overfitting at high epoch counts during TTT
Gradient synchronization per step (all_reduce on gradients) is critical for stable multi-GPU TTT
Per-layer learning rate groups compensate for uneven quantization damage, especially on output projections
LeakyReLU(0.5)² activation improves BPB by -0.003 compared to ReLU²
TrigramHash extends BigramHash context from 2 to 3 tokens using a shared embedding table with zero extra parameters