PR #1658

open

Submission: SP8192 + DepthRecur + MuonEq-R + SGD-TTT + SDClip GPTQ + Brotli-11

by AVINASH0052View on GitHub
val_bpb
1.0810
Architecture
Transformer
Optimizer
Muon
Artifact Size

Training Techniques

Architecture
GQA
8 query heads / 4 KV heads grouped query attention
parameters: {"query_heads":8,"kv_heads":4}
MLP4x
Expanded MLP width to 4x model dimension
parameters: {"multiplier":4,"hidden_dim":2048}
depth recurrence
Repeated layers L3-L5 with loop warmup
parameters: {"layers":[3,4,5],"passes":2,"enabled_frac":0.35}
U-Net skip connections
U-Net style skip connections in the decoder/encoder stack
parameters: null
XSA
XSA applied to all 11 layers
parameters: {"layers":11}
SmearGate
SmearGate gating mechanism
parameters: null
Partial RoPE
Rotary position embeddings applied to a subset of dimensions
parameters: {"dimensions":16,"total_dimensions":64}
weight tying
Tied input and output embeddings
parameters: null
Regularization
logit softcap
parameters: {"value":30}
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Optimizer
Muon
weight_decay: 0.095
momentum: 0.99
other_params: {"variant":"MuonEq-R","row_norm":true,"matrix_lr":0.022,"scalar_lr":0.02,"adamw_for":["embeddings","scalars"]}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Compression
brotli
level: 11
Quantization
GPTQ
bits: 6
scope: attention/MLP matrices
int8
bits: 8
scope: token embeddings
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
score-first SGD
parameters: {"epochs_per_chunk":3,"chunk_size":32000,"learning_rate":0.005,"momentum":0.9,"grad_clip":1}
Sequence Length
sequence_length
train_length: 2048
eval_length: 2048
LR Schedule
warmdown
parameters: {"warmdown_frac":0.72}

Novel Contributions

  • SP8192 vocabulary
  • Depth recurrence with loop warmup
  • MuonEq-R row-normalized optimizer variant
  • Score-first SGD chunk test-time training
  • SDClip sigma-based GPTQ clipping
  • int8 embedding quantization
  • Brotli-11 artifact compression
  • U-Net skip connections with GPT-J style parallel residuals