PR #1126

open

review: Rerun of PR #1089

by AnirudhRahulView on GitHub
val_bpb
1.1091
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.3 MB

Training Techniques

Architecture
Turbo-Muon
Muon variant with AOL preconditioning, Polar Express coefficients, and post-NS row/col normalization to reduce Newton-Schulz iterations.
parameters: {"newton_schulz_iterations":4}
EngramLite
Multi-head prime-based hash embeddings capturing bigram and trigram statistics.
parameters: {"heads":2,"orders":2,"buckets":8192}
Parameter Banking
Stores per-layer linear weights in contiguous banks to enable batched orthogonalization and reduce optimizer overhead.
parameters: null
U-Net skip connections
Encoder/decoder skip connections with learned sigmoid gates.
parameters: null
ValueEmbedding
Reinjects token identity into attention values at deep layers.
parameters: {"layers":[9,10]}
SmearGate
Causal shift blending each token with its predecessor using padding-based mixing.
parameters: null
XSA
Cross-sequence attention applied to all layers, subtracting self-value projection from attention output.
parameters: {"layers":11}
Mimetic V-O initialization
Output projections initialized as a small negative multiple of value projections per head.
parameters: {"alpha":0.05}
Partial RoPE
Applies rotary position embeddings to only part of the head dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
GQA
Grouped Query Attention with fewer KV heads than query heads.
parameters: {"query_heads":8,"kv_heads":4}
weight tying
Input and output embeddings share weights.
parameters: null
LeakyReLU
Uses LeakyReLU squared in the MLP.
parameters: {"negative_slope":0.3}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer_idx + 1)"}
logit softcap
parameters: {"softcap":30}
magnitude pruning
parameters: {"threshold":"|q| <= 2"}
Quantization
GPTQ
bits: null
scope: weights
late QAT
bits: null
scope: weights
mixed int5/int6/int7
bits: null
scope: weights
Compression
brotli
level: 11
lzma
level: null
Weight Averaging
SWA
parameters: {"interval":50,"start_fraction":0.2}
EMA
parameters: {"decay":0.997}
Optimizer
Muon
weight_decay: 0.04
momentum: 0.99
other_params: {"lr":0.025}
Adam
weight_decay: 0.04
momentum: null
other_params: {"lr":0.6,"betas":[0.7,0.95]}
Adam
weight_decay: null
momentum: null
other_params: {"lr":0.035,"betas":[0.7,0.95]}
Adam
weight_decay: null
momentum: null
other_params: {"lr":0.025,"betas":[0.9,0.95]}

Novel Contributions

  • Rerun and comparison of the latest fetched PR #1089 head using the executable submission wrapper
  • Documentation of the rerun environment and exact seed-42 outputs in-repo
  • Evidence that the executable code path still reserves 14000ms despite README-only mention of 9000ms
  • Mixed-precision GPTQ pipeline with dynamic bit allocation across tensor groups
  • Turbo-Muon optimizer with AOL preconditioning, Polar Express coefficients, and post-normalization
  • EngramLite hash embeddings for bigram and trigram context
  • Parameter banking for batched orthogonalization and reduced optimizer overhead
  • Selective pruning plus brotli/byte-shuffle compression to fit the artifact budget