PR #1451

open

[10min/16mb] David Ghazaryan — MoE + BigramHash4096 (mean BPB: 1.11799)

by davie2009khView on GitHub
val_bpb
1.1180
Architecture
Transformer
Optimizer
Adam
Artifact Size
15,908,116 bytes

Training Techniques

Architecture
BigramHash
Expanded bigram vocabulary from 3072 to 4096.
parameters: {"vocab_size":4096}
MoE
Mixture of Experts in the MLP layers.
parameters: null
Test-Time Training
LoRA TTT
parameters: {"rank":8,"learning_rate":0.01}
Evaluation
stride-based eval
parameters: {"chunk_size":256,"eval_seq_len":1024,"batch_size":64}
Sequence Length
sequence_length
train_length: null
eval_length: 1024

Novel Contributions

  • BigramHash4096 — expanded bigram vocabulary from 3072 to 4096
  • MoE MLP — Mixture of Experts in the MLP layers
  • Per-document LoRA test-time training
  • Strided, document-isolated evaluation