PR #1440
open[Submission] EngramLite + Mousse + Progressive Depth Recurrence + TTT — val_bpb 1.1026 | 15.95MB | 8×H100
by MertyandimataView on GitHub
val_bpb
1.1026
Architecture
Transformer
Optimizer
Mousse
Artifact Size
15.95MB
Training Techniques
Architecture
BigramHash
Replaced legacy BigramHash with EngramLite multi-head gated bigram+trigram hashing.
parameters: {"buckets":3072,"heads":2}
TrigramHash
Added trigram hashing as part of the EngramLite multi-order n-gram hash.
parameters: {"buckets":3072,"heads":2}
depth recurrence
Repeated selected layers to create effective deeper recurrence.
parameters: {"layers":[4,5],"effective_layers":13}
U-Net skip connections
Learned skip-gated U-Net style connections were used in the architecture.
parameters: null
XSA
Applied value-orthogonal projection across all layers.
parameters: {"layers":11}
Partial RoPE
Used rotary position embeddings on only part of the head dimensions.
parameters: {"dimensions":16}
LeakyReLU
Used LeakyReLU squared activation in the MLP.
parameters: {"slope":0.5}
weight tying
Tied embeddings were used.
parameters: null
KV head count
Used grouped key/value heads in the transformer.
parameters: {"attention_heads":8,"kv_heads":4}
Optimizer
Mousse
weight_decay: 0.09
momentum: 0.99
other_params: {"matrix_lr":0.025}
AdamW
weight_decay: 0.02
momentum: null
other_params: {"scalar_lr":0.025}
Weight Averaging
EMA
parameters: {"decay":0.997}
EMA
parameters: {"decay":0.995,"start_step":892}
Quantization
late QAT
bits: 6
scope: all
GPTQ
bits: 6
scope: all
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first TTT
parameters: {"epochs":3,"learning_rate":0.01,"reset_per_chunk":0}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Regularization
weight decay
parameters: {"muon_embed":0.09,"adam":0.02}
logit softcap
parameters: {"value":30}
Novel Contributions
- EngramLite multi-head gated bigram+trigram hash
- Mousse optimizer with diagonal curvature-aware Muon preconditioning
- Progressive Depth Recurrence with phased activation
- Score-first full-weight TTT outperforming LoRA TTT on this architecture
- Auto-QMax artifact packing
- Adaptive Markov curriculum from the previous Raki v5 approach