PR #453
openExploratory: PR315-derived candidate and looped-depth gate
by Divyesh-ThirukondaView on GitHub
val_bpb
1.1248
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.6 MB
Training Techniques
Architecture
Partial RoPE
Rotary position embeddings applied to only part of the head dimensions, leaving the rest without positional bias.
parameters: {"dimensions":16,"total_dimensions":64}
XSA
Exclusive Self Attention used in the last layers.
parameters: {"last_layers":4}
SmearGate
Learned token blending gate.
parameters: {"parameters":512}
BigramHash
Bigram hash embedding with projection to the model dimension.
parameters: {"buckets":2048,"embedding_dim":128,"projection_dim":512}
MLP3x
Expanded MLP width to 3x standard size with relu² activation.
parameters: {"hidden_size":1536}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer_idx+1)"}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
mixed int6/int8
bits: 6
scope: MLP and attention int6; embeddings int8
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"warmup_momentum_start":0.92,"warmup_steps":1500,"warmdown_iters":3000,"adamw_weight_decay":0.04,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"grad_clip":0.3}
Initialization
Orthogonal + muP-scaled init
Orthogonal initialization with muP scaling applied to large matrices.
Other
other
Late QAT flag for STE int6 fake-quantization in the final 4% of training, though post-analysis says it was constant-folded and had no effect.
parameters: {"enabled":true,"final_training_fraction":0.04}
Novel Contributions
- Partial RoPE applied to 16 of 64 head dimensions
- Layer-wise RMSNorm scaling by 1/sqrt(layer_idx+1)
- EMA weight averaging during training
- Mixed int6/int8 quantization with zstd compression
- XSA on the last 4 layers
- SmearGate token blending gate
- Bigram hash embedding with projection
- Orthogonal + muP-scaled initialization
- Late QAT flag was included but had no effect due to constant folding