PR #135
openRecord: OrthoInit + Int6 MLP3x + BigramHash + SmearGate (val_bpb: 1.1539)
by unnirView on GitHub
val_bpb
1.1539
Architecture
GPT
Optimizer
Muon
Artifact Size
15,162,375 bytes
Training Techniques
Initialization
OrthoInit
Orthogonal initialization with gain=1.0, plus muP-scaled output projections.
Quantization
mixed int6
bits: 6
scope: MLP and attention weight matrices; FP16 passthrough for tied embeddings and last 2 layers' Key projections
Architecture
MLP3x
Expanded MLP hidden dimension from 1024 to 1536 (3x model_dim).
parameters: {"hidden_dimension":1536}
SmearGate
Learned gate blending each token embedding with the previous token embedding.
parameters: {"parameters":512}
BigramHash
4096-bucket hash table injecting token-pair information, projected to model dimension.
parameters: {"buckets":4096,"dimension":128,"projection_dimension":512}
Optimizer
Muon
weight_decay: 0.01
momentum: 0.99
other_params: {"matrix_lr":0.02,"scalar_lr":0.02,"tied_embed_lr":0.03,"muon_momentum_warmup_start":0.92,"muon_momentum_warmup_steps":1500,"warmdown_iters":3000,"grad_clip_norm":0.3,"beta1":0.9,"beta2":0.95,"adamw_for_embedding_and_scalar_params":true}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
weight decay
parameters: {"weight_decay":0.01}
Novel Contributions
- Orthogonal initialization with muP-scaled output projections
- Mixed int6 quantization with FP16 passthrough for sensitive tensors
- 3x MLP expansion enabled by quantization savings
- Tuned Muon/AdamW optimizer hyperparameters
- SmearGate token blending mechanism
- BigramHash token-pair embedding
- Sliding-window evaluation with stride 64