PR #397
openRecord: Dynamic Eval + TTT on SOTA Pipeline (val_bpb=1.1364)
by translatingthenameView on GitHub
val_bpb
1.1364
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.65 MB
Training Techniques
Architecture
XSA
Exclusive Self Attention applied to the last 4 layers.
parameters: {"layers":4}
Partial RoPE
Rotary positional embeddings applied partially.
parameters: {"dimensions":16}
SmearGate
Custom gating mechanism used in the model.
parameters: null
BigramHash
Bigram-based hashing component used in the model.
parameters: null
Weight Averaging
EMA
parameters: {"decay":0.997}
Regularization
LN scale
parameters: null
Quantization
QAT
bits: 6
scope: all
Compression
zstd
level: null
Initialization
OrthoInit
Orthogonal initialization strategy.
Test-Time Training
full TTT
parameters: {"learning_rate":0.002,"epochs":3,"freeze_blocks":2,"momentum":0.9}
Evaluation
sliding window eval
parameters: {"stride":64,"batch_size":32,"adapt_every_batches":4}
Optimizer
SGD
weight_decay: null
momentum: 0
other_params: {"learning_rate":0.001,"rank_local":true}
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":1500}
Novel Contributions
- Dynamic evaluation during validation scoring using periodic SGD steps on sliding windows.
- Combining dynamic evaluation with TTT on the SOTA pipeline without changing training.
- Zero additional artifact cost while improving validation bpb.
- Rank-local adaptation during evaluation with batched window scoring.