PR #1471

open

[Record] SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 — val_bpb 1.0866

by X-Abhishek-XView on GitHub
val_bpb
1.0866
Architecture
Transformer
Optimizer
Muon
Artifact Size
~15.98 MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: MLP + attention
GPTQ
bits: 8
scope: embeddings
Architecture
depth recurrence
3-layer depth recurrence with repeated virtual layers
parameters: {"layers":[3,4,5],"virtual_layers":14}
weight tying
Tied embeddings
parameters: null
GQA
8 attention heads with 4 KV heads
parameters: {"heads":8,"kv_heads":4}
XSA
XSA applied to all layers
parameters: {"layers":11}
LeakyReLU
LeakyReLU squared activation
parameters: {"slope":0.5}
VE128
Shared Value Embedding in later layers
parameters: {"dimension":128,"layers":[9,10]}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Regularization
weight decay
parameters: {"weight_decay":0.095}
logit softcap
parameters: {"value":30}
LR Schedule
warmdown
parameters: {"fraction":0.72}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: 2048
eval_length: null
Compression
Brotli
level: null

Novel Contributions

  • SP8192 tokenizer integration
  • SDClip standard-deviation-based clipping for quantization
  • Zero selective pruning while fitting under 16MB
  • 3-layer depth recurrence with EMA 0.9965
  • Improved rate-distortion via direct artifact-size-aware clipping