PR #1244

open

Non-record RTX4060Ti 11L LeakyTTT 24h local (1.1443 BPB)

by monkeyKingProgrammerView on GitHub
val_bpb
1.1443
Architecture
Transformer
Optimizer
Parallel Muon
Artifact Size
15.70 MB

Training Techniques

Architecture
LeakyReLU
Uses LeakyReLU^2 MLP activation in an 11-layer Transformer.
parameters: {"layers":11,"d_model":512,"heads":8,"kv_heads":4}
XSA
Applies XSA to the last 4 layers.
parameters: {"last_n_layers":4}
Partial RoPE
Uses partial rotary positional embeddings.
parameters: {"rope_dims":16,"total_dims":64}
BigramHash
Adds a bigram hash embedding component.
parameters: {"vocab_size":1536}
VE128
Uses VE128 on selected layers.
parameters: {"layers":[9,10],"dimension":128}
Optimizer
Parallel Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500}
Regularization
LN scale
parameters: null
weight decay
parameters: {"muon_wd":0.04,"adam_wd":0.04}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"every":50}
Quantization
int6
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"learning_rate":0.002,"momentum":0.9,"epochs":3,"chunk_tokens":32768}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}

Novel Contributions

  • 11-layer local unlimited-compute Transformer submission tuned for 24 hours on a single RTX 4060 Ti 16GB
  • LeakyReLU^2 + Parallel Muon + XSA + Partial RoPE + LN scale + EMA stack
  • BigramHash and VE128 architectural additions
  • int6 export with lzma compression to fit the 16MB artifact limit
  • Sliding-window evaluation followed by legal score-first test-time training