PR #1471

open

[Record] SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 — val_bpb 1.0866

by X-Abhishek-XView on GitHub

val_bpb

1.0866

Architecture

Transformer

Optimizer

Muon

Artifact Size

~15.98 MB

Training Techniques

Quantization

GPTQ

bits: 6

scope: MLP + attention

GPTQ

bits: 8

scope: embeddings

Architecture

depth recurrence

3-layer depth recurrence with repeated virtual layers

parameters: {"layers":[3,4,5],"virtual_layers":14}

weight tying

Tied embeddings

parameters: null

GQA

8 attention heads with 4 KV heads

parameters: {"heads":8,"kv_heads":4}

XSA

XSA applied to all layers

parameters: {"layers":11}

LeakyReLU

LeakyReLU squared activation

parameters: {"slope":0.5}

VE128

Shared Value Embedding in later layers

parameters: {"dimension":128,"layers":[9,10]}

Weight Averaging

EMA

parameters: {"decay":0.9965}

Regularization

weight decay

parameters: {"weight_decay":0.095}

logit softcap

parameters: {"value":30}

LR Schedule

warmdown

parameters: {"fraction":0.72}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: 2048

eval_length: null

Compression

Brotli

level: null

Novel Contributions

SP8192 tokenizer integration
SDClip standard-deviation-based clipping for quantization
Zero selective pruning while fitting under 16MB
3-layer depth recurrence with EMA 0.9965
Improved rate-distortion via direct artifact-size-aware clipping