PR #206
openRecord: Int6 STE + SmearGate + Seq2048 + OrthoInit + RoPE50K + SWA/100 (mean val_bpb=1.1507)
by dexhunterView on GitHub
val_bpb
1.1507
Architecture
Transformer
Optimizer
NorMuon
Artifact Size
14.79MB
Training Techniques
Quantization
int6 STE QAT
bits: 6
scope: all weights except fp16 tied embedding
Architecture
SmearGate
Learned gate blends token embeddings with predecessor representations.
parameters: {"params":512}
MLP3x
Wider MLP layers enabled by int6 compression savings.
parameters: {"hidden_size":1536}
RoPE
Rotary position embeddings with adjusted base frequency for longer context.
parameters: {"base":50000}
tied embeddings
Input and output embeddings are tied; embedding tensor is kept in fp16 and not quantized.
parameters: null
KV head count
Grouped-query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
Optimizer
NorMuon
weight_decay: 0.02
momentum: 0.99
other_params: {"beta2":0.95,"warmup_start":0.92,"matrix_lr":0.021,"scalar_lr":0.02,"tied_embed_lr":0.03}
Weight Averaging
SWA
parameters: {"every_steps":100,"start_fraction_of_warmdown":0.5}
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":1984}
Initialization
OrthoInit
Orthogonal initialization applied to all non-zero-init linear layers.
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_iters":3000,"warmup_steps":20}
Regularization
weight decay
parameters: {"value":0.02}
Other
other
U-Net style skip connections with learnable per-layer per-dimension skip weights.
parameters: null
Novel Contributions
- Int6 straight-through estimator quantization during training
- SmearGate token-to-predecessor embedding blending
- Wider 3x MLP enabled by quantization savings
- Orthogonal initialization across non-zero-init linear layers
- Longer 2048-token training context with RoPE base 50K
- Frequent SWA checkpoint averaging every 100 steps
- Sliding-window evaluation with stride 64
- U-Net skip connections in the model