PR #453

open

Exploratory: PR315-derived candidate and looped-depth gate

by Divyesh-ThirukondaView on GitHub

val_bpb

1.1248

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.6 MB

Training Techniques

Architecture

Partial RoPE

Rotary position embeddings applied to only part of the head dimensions, leaving the rest without positional bias.

parameters: {"dimensions":16,"total_dimensions":64}

XSA

Exclusive Self Attention used in the last layers.

parameters: {"last_layers":4}

SmearGate

Learned token blending gate.

parameters: {"parameters":512}

BigramHash

Bigram hash embedding with projection to the model dimension.

parameters: {"buckets":2048,"embedding_dim":128,"projection_dim":512}

MLP3x

Expanded MLP width to 3x standard size with relu² activation.

parameters: {"hidden_size":1536}

Regularization

layerwise LN scale

parameters: {"scale":"1/sqrt(layer_idx+1)"}

Weight Averaging

EMA

parameters: {"decay":0.997}

Quantization

mixed int6/int8

bits: 6

scope: MLP and attention int6; embeddings int8

Compression

zstd

level: 22

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: 2048

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"warmup_momentum_start":0.92,"warmup_steps":1500,"warmdown_iters":3000,"adamw_weight_decay":0.04,"matrix_lr":0.025,"scalar_lr":0.025,"tied_embed_lr":0.035,"grad_clip":0.3}

Initialization

Orthogonal + muP-scaled init

Orthogonal initialization with muP scaling applied to large matrices.

Other

other

Late QAT flag for STE int6 fake-quantization in the final 4% of training, though post-analysis says it was constant-folded and had no effect.

parameters: {"enabled":true,"final_training_fraction":0.04}

Novel Contributions

Partial RoPE applied to 16 of 64 head dimensions
Layer-wise RMSNorm scaling by 1/sqrt(layer_idx+1)
EMA weight averaging during training
Mixed int6/int8 quantization with zstd compression
XSA on the last 4 layers
SmearGate token blending gate
Bigram hash embedding with projection
Orthogonal + muP-scaled initialization
Late QAT flag was included but had no effect due to constant folding