PR #1126

open

review: Rerun of PR #1089

by AnirudhRahulView on GitHub

val_bpb

1.1091

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.3 MB

Training Techniques

Architecture

Turbo-Muon

Muon variant with AOL preconditioning, Polar Express coefficients, and post-NS row/col normalization to reduce Newton-Schulz iterations.

parameters: {"newton_schulz_iterations":4}

EngramLite

Multi-head prime-based hash embeddings capturing bigram and trigram statistics.

parameters: {"heads":2,"orders":2,"buckets":8192}

Parameter Banking

Stores per-layer linear weights in contiguous banks to enable batched orthogonalization and reduce optimizer overhead.

parameters: null

U-Net skip connections

Encoder/decoder skip connections with learned sigmoid gates.

parameters: null

ValueEmbedding

Reinjects token identity into attention values at deep layers.

parameters: {"layers":[9,10]}

SmearGate

Causal shift blending each token with its predecessor using padding-based mixing.

parameters: null

XSA

Cross-sequence attention applied to all layers, subtracting self-value projection from attention output.

parameters: {"layers":11}

Mimetic V-O initialization

Output projections initialized as a small negative multiple of value projections per head.

parameters: {"alpha":0.05}

Partial RoPE

Applies rotary position embeddings to only part of the head dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

GQA

Grouped Query Attention with fewer KV heads than query heads.

parameters: {"query_heads":8,"kv_heads":4}

weight tying

Input and output embeddings share weights.

parameters: null

LeakyReLU

Uses LeakyReLU squared in the MLP.

parameters: {"negative_slope":0.3}

Regularization

LN scale

parameters: {"scale":"1/sqrt(layer_idx + 1)"}

logit softcap

parameters: {"softcap":30}

magnitude pruning

parameters: {"threshold":"|q| <= 2"}

Quantization

GPTQ

bits: null

scope: weights

late QAT

bits: null

scope: weights

mixed int5/int6/int7

bits: null

scope: weights

Compression

brotli

level: 11

lzma

level: null

Weight Averaging

SWA

parameters: {"interval":50,"start_fraction":0.2}

EMA

parameters: {"decay":0.997}

Optimizer

Muon

weight_decay: 0.04

momentum: 0.99

other_params: {"lr":0.025}

Adam

weight_decay: 0.04

momentum: null

other_params: {"lr":0.6,"betas":[0.7,0.95]}

Adam

weight_decay: null

momentum: null

other_params: {"lr":0.035,"betas":[0.7,0.95]}

Adam

weight_decay: null

momentum: null

other_params: {"lr":0.025,"betas":[0.9,0.95]}

Novel Contributions

Rerun and comparison of the latest fetched PR #1089 head using the executable submission wrapper
Documentation of the rerun environment and exact seed-42 outputs in-repo
Evidence that the executable code path still reserves 14000ms despite README-only mention of 9000ms
Mixed-precision GPTQ pipeline with dynamic bit allocation across tensor groups
Turbo-Muon optimizer with AOL preconditioning, Polar Express coefficients, and post-normalization
EngramLite hash embeddings for bigram and trigram context
Parameter banking for batched orthogonalization and reduced optimizer overhead
Selective pruning plus brotli/byte-shuffle compression to fit the artifact budget