PR #1565

open

val_bpb 1.1036 - 12L sp9000 + depth recurrence + hash-TTT

by Idan3011View on GitHub

val_bpb

1.1036

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.70 MB

Training Techniques

Architecture

weight tying

Tied input/output embeddings.

parameters: null

depth recurrence

12 physical layers with a 2-layer recurrence loop, yielding 16 effective layers.

parameters: {"physical_layers":12,"effective_layers":16,"recurrent_layers":2,"repeats":3}

XSA

Applied XSA in all layers.

parameters: null

U-Net skip connections

Used U-Net style encoder-decoder skip connections with learned gates.

parameters: null

GQA

Grouped query attention with 8 query heads and 4 KV heads.

parameters: {"query_heads":8,"kv_heads":4}

LeakyReLU

Used LeakyReLU squared activation.

parameters: {"squared":true,"negative_slope":0.5}

BigramHash

Added a zero-initialized bigram hash embedding trained during TTT.

parameters: {"dimensions":[16384,512]}

Optimizer

Muon

weight_decay: 0.095

momentum: 0.99

other_params: {"backend_steps":5,"warmup":"0.92->0.99 over 1500 steps"}

AdamW

weight_decay: 0.095

momentum: null

other_params: {"beta1":0.9,"beta2":0.95,"fused":true}

Weight Averaging

EMA

parameters: {"decay":0.9965}

SWA

parameters: {"start":"last 33%","frequency":5,"blend":"50/50 with EMA"}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

score-first TTT

parameters: {"chunk_size":32768,"optimizer":"SGD","momentum":0.9,"learning_rate":0.01,"epochs":3,"gradient_clip":1}

Quantization

GPTQ

bits: 6

scope: all

Compression

brotli

level: 11

Sequence Length

sequence_length

train_length: 2048

eval_length: 32768

LR Schedule

warmdown

parameters: {"frac":0.72}

Regularization

logit softcap

parameters: {"value":30}

weight decay

parameters: {"value":0.095}

Novel Contributions

Custom sp9000 SentencePiece BPE tokenizer trained on competition data
12-layer Transformer with depth recurrence for 16 effective layers
Code-level step-time optimization using foreach operations and layout/precomputation improvements
Improved score-first TTT with tuned hyperparameters
Zero-initialized bigram hash embedding trained during TTT