PR #1401

open

Submission/epsilon flashonly 2026 04 06

by teerthsharmaView on GitHub
val_bpb
1.1100
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.9 MB

Training Techniques

Architecture
GQA
Grouped query attention with 8 attention heads and 4 KV heads in a GPT-style 11-layer model.
parameters: {"layers":11,"dimensions":512,"heads":8,"kv_heads":4}
XSA
XSA applied on the last 11 layers.
parameters: {"layers":11}
BigramHash
Bigram hash feature used in the model.
parameters: null
SmearGate
SmearGate gating mechanism used in the model.
parameters: null
VE
Value enhancement / VE component used in the architecture.
parameters: null
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"split":"Muon/AdamW"}
AdamW
weight_decay: null
momentum: null
other_params: {"split":"Muon/AdamW"}
Weight Averaging
EMA + SWA
parameters: null
Quantization
GPTQ
bits: 6
scope: artifact
Regularization
magnitude pruning
parameters: {"magnitude_prune":"4%"}
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: null

Novel Contributions

  • AETHER sparse block-pruning with Lyapunov-stable governor
  • Cauchy-Schwarz block pruning with Lean4 proof of soundness
  • Lyapunov-stable geometric governor for adaptive sparsity
  • Chebyshev GC guard with Lean4 proof of bounded false reclamation
  • 12-layer GPT architecture within the same wall-clock budget via block-sparse attention