val_bpb
1.1711
Architecture
Transformer
Optimizer
Muon
Artifact Size
14.84 MB
Training Techniques
Architecture
MLP3x
11-layer model with 3x MLP width (1536 hidden).
parameters: {"layers":11,"mlp_multiplier":3,"hidden_size":1536}
LeakyReLU
LeakyReLU squared activation.
parameters: {"variant":"squared"}
Partial RoPE
Partial rotary positional embeddings applied to a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
XSA
XSA applied to all layers.
parameters: {"layers":11}
SmearGate
SmearGate mechanism added to the model.
parameters: null
BigramHash
BigramHash embedding with specified vocabulary and dimension.
parameters: {"vocab_size":3072,"dimension":112}
Regularization
LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
weight decay
parameters: {"value":0.04}
gradient clipping
parameters: {"clip_norm":0.3}
Weight Averaging
EMA
parameters: {"decay":0.997}
Quantization
late QAT
bits: null
scope: all
GPTQ
bits: 7
scope: all
Compression
lzma
level: 9
Evaluation
sliding window eval
parameters: {"stride":64}
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
LR Schedule
warmdown
parameters: {"warmdown_steps":3500}
Other
other
Full Hessian-based GPTQ with Cholesky error feedback collected via forward hooks on CastedLinear layers.
parameters: null
other
Optional depth recurrence that reruns the 11 physical layers multiple times with fresh U-Net skip connections.
parameters: {"recurrence":2}
Novel Contributions
- Full Hessian-based GPTQ post-training quantization
- Int7 quantization with LZMA-9 artifact compression
- BigramHash with 3072 vocabulary and 112 dimensions
- XSA applied across all layers
- SmearGate integration
- Partial RoPE and LN scale modifications
- EMA weight averaging with decay 0.997
- Late QAT and sliding window evaluation