PR #1401

open

Submission/epsilon flashonly 2026 04 06

by teerthsharmaView on GitHub

val_bpb

1.1100

Architecture

Transformer

Optimizer

Muon

Artifact Size

15.9 MB

Training Techniques

Architecture

GQA

Grouped query attention with 8 attention heads and 4 KV heads in a GPT-style 11-layer model.

parameters: {"layers":11,"dimensions":512,"heads":8,"kv_heads":4}

XSA

XSA applied on the last 11 layers.

parameters: {"layers":11}

BigramHash

Bigram hash feature used in the model.

parameters: null

SmearGate

SmearGate gating mechanism used in the model.

parameters: null

Value enhancement / VE component used in the architecture.

parameters: null

Optimizer

Muon

weight_decay: null

momentum: null

other_params: {"split":"Muon/AdamW"}

AdamW

weight_decay: null

momentum: null

other_params: {"split":"Muon/AdamW"}

Weight Averaging

EMA + SWA

parameters: null

Quantization

GPTQ

bits: 6

scope: artifact

Regularization

magnitude pruning

parameters: {"magnitude_prune":"4%"}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: null

Novel Contributions

AETHER sparse block-pruning with Lyapunov-stable governor
Cauchy-Schwarz block pruning with Lean4 proof of soundness
Lyapunov-stable geometric governor for adaptive sparsity
Chebyshev GC guard with Lean4 proof of bounded false reclamation
12-layer GPT architecture within the same wall-clock budget via block-sparse attention