PR #64
openRecord: DominationV3 + GPTQ-lite + TTT25 (mean val_bpb=1.1250, 3 seeds)
by yesbhautikView on GitHub
val_bpb
1.1250
Architecture
Transformer
Optimizer
Muon
Artifact Size
under 16MB
Training Techniques
Architecture
Partial RoPE
Applies rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
BigramHash
Adds a BigramHash local-context component.
parameters: {"vocab_size":4096,"embedding_dim":128}
SmearGate
Uses per-dimension SmearGate.
parameters: null
XSA
XSA is removed/disabled to save time for more training steps.
parameters: null
Regularization
LN Scale
parameters: {"scale_rule":"1/sqrt(layer+1)"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
GPTQ-lite
bits: 6
scope: mlp, attn, tok_emb
int6
bits: 6
scope: mlp, attn, tok_emb
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"orthoinit":true}
Initialization
OrthoInit
Orthogonal initialization.
Test-Time Training
full TTT
parameters: {"epochs":25,"learning_rate":0.012,"momentum":0.9,"freeze_blocks":0}
Evaluation
sliding window eval
parameters: {"stride":64}
Compression
zstd
level: 22
Other
other
Uses 11 layers, 512 model dimension, 8 heads, 4 KV heads, and 3x MLP expansion.
parameters: {"layers":11,"model_dim":512,"heads":8,"kv_heads":4,"mlp_hidden":1536}
Novel Contributions
- GPTQ-lite optimal clip percentile search during int6 quantization
- 25-epoch aggressive SGD test-time training on already-graded tokens
- Partial RoPE with LN Scale and XSA removed to enable more training steps
- Per-dimension SmearGate combined with BigramHash local context
- Mixed int6 quantization of MLP, attention, and token embeddings with zstd-22 compression
- Muon optimizer with OrthoInit and U-Net skip connections