PR #164
openSubmission: OrthoInit + Int6 MLP3x + SmearGate + BigramHash (val_bpb: 1.1524)
by jfprinczView on GitHub
val_bpb
1.1524
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.4 MB
Training Techniques
Initialization
OrthoInit
Orthogonal initialization for large matrices with muP-style scaling of output projections by 1/sqrt(2 * layers) to improve early convergence.
Architecture
MLP3x
Expanded MLP hidden size to 1536, increasing model capacity.
parameters: {"hidden_size":1536}
SmearGate
Learned sigmoid gate blending each token embedding with the previous token embedding before the first transformer layer.
parameters: null
BigramHash
Hash-based bigram embedding table injecting token-pair features.
parameters: {"buckets":4096,"input_dim":128,"output_dim":512}
Quantization
mixed int6/int8
bits: 6
scope: MLP and attention int6; embeddings and bigram int8; controls fp32
Optimizer
Muon
weight_decay: null
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"warmup_start":0.92,"warmup_steps":1500,"warmdown_iters":3000,"grad_clip":0.3}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Evaluation
sliding window eval
parameters: {"stride":256}
LR Schedule
warmup + warmdown
parameters: {"warmup_steps":1500,"warmdown_steps":3000}
Compression
zstd
level: 22
Novel Contributions
- Orthogonal + muP-scaled initialization for faster early convergence
- 3x wider MLP to increase capacity within the artifact budget
- Mixed int6/int8 quantization to reduce artifact size
- SmearGate token embedding blending with previous-token context
- BigramHash embedding for token-pair feature injection
- Tuned Muon optimizer settings with warmup and warmdown
- Training and evaluation at 2048-token sequence length with NTK-aware RoPE
- FlashAttention 3 integration for faster training steps