PR #672

open

Record: 30ep Cosine TTT on LeakyReLU² stack (3-seed mean val_bpb=1.0781)

by andrewbaggio1View on GitHub

val_bpb

1.0781

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.62 MB

Training Techniques

Architecture

LeakyReLU² stack

11-layer Transformer stack using LeakyReLU(0.5) squared MLPs with several custom architectural components.

parameters: {"layers":11,"d_model":512,"gqa_heads":"8/4","mlp_multiplier":3,"bigram_hash":2048,"partial_rope_dims":16}

BigramHash

Bigram hashing component used in the model.

parameters: {"size":2048}

SmearGate

Custom gating mechanism included in the architecture.

parameters: null

XSA4

Custom attention-like architectural component.

parameters: null

Partial RoPE

Rotary positional embeddings applied only to part of the representation.

parameters: {"dimensions":16}

KV GQA

Grouped-query attention with reduced KV heads.

parameters: {"heads":"8/4"}

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: null

Quantization

int6

bits: 6

scope: model weights

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"epochs":30,"optimizer":"AdamW","learning_rate":0.0005,"lr_schedule":"cosine decay","per_layer_lr_groups":{"mlp.proj":3,"mlp.fc":0.5}}

Initialization

OrthoInit

Orthogonal initialization.

LR Schedule

cosine decay

parameters: {"phase":"TTT","epochs":30}

Regularization

layerwise LN scale

parameters: null

Novel Contributions

Increased TTT epochs to 30 while keeping the architecture identical to PR #518
Achieved a 3-seed mean validation BPB of 1.0781
Used cosine-decayed test-time training with per-layer learning-rate groups
Maintained artifact size under 16 MB