PR #1451
open[10min/16mb] David Ghazaryan — MoE + BigramHash4096 (mean BPB: 1.11799)
by davie2009khView on GitHub
val_bpb
1.1180
Architecture
Transformer
Optimizer
Adam
Artifact Size
15,908,116 bytes
Training Techniques
Architecture
BigramHash
Expanded bigram vocabulary from 3072 to 4096.
parameters: {"vocab_size":4096}
MoE
Mixture of Experts in the MLP layers.
parameters: null
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01}
Evaluation
stride-based eval
parameters: {"chunk_size":256,"eval_seq_len":1024,"batch_size":64}
Sequence Length
sequence_length
train_length: null
eval_length: 1024
Novel Contributions
- BigramHash4096 — expanded bigram vocabulary from 3072 to 4096
- MoE MLP — Mixture of Experts in the MLP layers
- Per-document LoRA test-time training
- Strided, document-isolated evaluation