PR #518

closed

Record: 11L XSA4 + LeakyReLU(0.5)² + Cosine TTT 50ep (val_bpb=1.0622)

by sofiabodView on GitHub

val_bpb

1.0622

Architecture

Transformer

Optimizer

AdamW

Artifact Size

—

Training Techniques

Architecture

XSA

Cross/self-attention variant applied to the last 4 layers

parameters: {"layers":4}

Partial RoPE

Rotary positional embeddings applied to a subset of dimensions

parameters: {"dimensions":16,"total_dimensions":64}

MLP3x

Transformer MLP widened to 3x

parameters: null

tied embeddings

Input and output embeddings are tied

parameters: {"vocab_size":1024}

BigramHash

Bigram hashing feature/module used in the model

parameters: {"hash_size":2048,"dimension":128}

SmearGate

Gating mechanism used in the architecture

parameters: null

OrthoInit

Orthogonal initialization used for some layers

parameters: null

VE128

VE128 module applied to layers 9 and 10

parameters: {"layers":[9,10]}

U-Net skip connections

Skip connections added in a U-Net style

parameters: null

LeakyReLU(0.5)²

LeakyReLU squared activation replacing ReLU² to preserve negative gradient flow

parameters: {"negative_slope":0.5}

Optimizer

AdamW

weight_decay: 0

momentum: null

other_params: {"learning_rate":0.0005}

LR Schedule

cosine decay

parameters: {"epochs":50,"formula":"lr *= 0.5 * (1 + cos(pi * progress))"}

Test-Time Training

full TTT

parameters: {"epochs":50,"learning_rate":0.0005,"weight_decay":0,"all_parameters_unfrozen":true,"per_layer_lr":{"mlp.proj":3,"mlp.fc":0.5},"grad_clip":1,"ddp_gradient_sync":true}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: {"type":"tight"}

Quantization

GPTQ-lite

bits: 6

scope: all

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: null

Initialization

OrthoInit

Orthogonal initialization

Novel Contributions

LeakyReLU(0.5)² activation replacing ReLU²
50-epoch cosine test-time training with per-layer learning-rate groups
Improved validation BPB to 1.0622, beating prior best validated score
Combination of full #414 frontier stack with the new activation and TTT recipe