PR #489
openRecord: 7L MLP3x + BigramHash + SmearGate + TTT 5ep (mean val_bpb=1.1327)
by sofiabodView on GitHub
val_bpb
1.1327
Architecture
Transformer
Optimizer
Muon
Artifact Size
—
Training Techniques
Architecture
MLP3x
Transformer MLP widened to 3x with ReLU² activations.
parameters: null
BigramHash
Hashes consecutive token pairs into learned embeddings added before RMSNorm.
parameters: {"hash_size":2048,"dim":128}
SmearGate
Per-dimension learned gate blending each token with the previous token.
parameters: null
Partial RoPE
Applies rotary embeddings to only part of the head dimensions.
parameters: {"rotary_dims":16,"total_dims":64}
tied embeddings
Input and output embeddings are tied.
parameters: {"vocab_size":1024}
Regularization
LN scale depth damping
parameters: {"init_scale_rule":"1/sqrt(layer_idx+1)"}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"tied_embedding_lr":0.01,"matrix_lr":0.03,"logit_softcap":15}
Compression
zlib
level: null
Test-Time Training
AdamW TTT
parameters: {"learning_rate":0.0005,"weight_decay":0,"epochs":5}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 4096
eval_length: 4096
LR Schedule
warmdown
parameters: {"warmdown_steps":6000}
Novel Contributions
- BigramHash(2048) token-pair hashing with learned embeddings
- SmearGate token blending mechanism
- Partial RoPE applied to 25% of head dimensions
- Layer-wise depth damping of LN scales
- AdamW test-time training for 5 epochs
- Sliding window evaluation with stride 64
- 7-layer transformer with MLP3x ReLU²