val_bpb
1.1130
Architecture
Transformer
Optimizer
—
Artifact Size
15,998,200 bytes
Training Techniques
Architecture
U-Net skip connections
Symmetric skip connections between encoder and decoder blocks in an 11-layer U-Net Transformer.
parameters: {"layers":11,"skip_pairs":["0->5","1->6","2->7","3->8","4->9"]}
LeakyReLU
Uses LeakyReLU(0.5)^2 instead of standard ReLU^2 to avoid dead neurons and improve gradient flow.
parameters: {"slope":0.5}
XSA
Exclusive Self Attention applied in the last 4 layers to subtract attention components aligned with token embeddings.
parameters: {"layers":4}
Partial RoPE
Applies RoPE only to the first 16 dimensions of query/key heads, leaving the remaining dimensions position-free.
parameters: {"rope_dims":16,"total_dims":64}
VE128
Injects shared 128-dimensional value embeddings into the final blocks to stabilize logit projections.
parameters: {"dimensions":128,"blocks":[9,10]}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
magnitude pruning
parameters: {"prune_fraction":0.03}
Weight Averaging
EMA + SWA
parameters: {"ema_decay":0.997,"swa_interval":50,"swa_start_fraction":0.5}
Quantization
STE QAT
bits: 6
scope: mixed; MLP int5, attention int6
GPTQ-lite
bits: 6
scope: per-row
Test-Time Training
full TTT
parameters: {"window_size":32768,"optimizer":"SGD"}
Novel Contributions
- 11-layer U-Net Transformer with symmetric skip connections
- LeakyReLU(0.5)^2 activation
- Exclusive Self Attention in the final 4 layers
- Partial RoPE applied to only the first 16 dimensions
- Layerwise LN scaling by 1/sqrt(layer+1)
- VE128 value embeddings in the last blocks
- Mixed int5/int6 quantization with late STE QAT
- EMA combined with SWA
- Test-time training over 32K-token windows
- GPTQ-lite per-row quantization
- Magnitude pruning before compression