PR #1324

open

Record: val_bpb 0.8275 (3-seed mean) — SLOT-28 + VRL + QK-Gain 4.0 + XSA-11

by yahya010View on GitHub

val_bpb

0.8275

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.5-15.8 MB

Training Techniques

Architecture

Value Residual

Adds value residual learning with sigmoid-gated interpolation.

parameters: {"init":-1.5}

XSA

Cross-head subtraction attention applied to all 11 layers.

parameters: {"layers":11}

BigramHash

Uses bigram hash embeddings.

parameters: {"dimensions":[1024,128]}

Partial RoPE

Applies partial rotary positional embeddings.

parameters: {"train":16,"eval":64}

LeakyReLU

Uses LeakyReLU squared MLP activation.

parameters: {"squared":true,"negative_slope":0.5}

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

VE128

Uses VE128 on selected layers.

parameters: {"layers":[9,10]}

U-Net skip connections

Adds U-Net style skip connections.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

SWA

parameters: null

Quantization

late QAT

bits: null

scope: all

GPTQ

bits: 6

scope: all

Optimizer

AdamW

weight_decay: 0.04

momentum: null

other_params: {"slot_steps":28}

LR Schedule

cosine decay

parameters: {"start":0.012,"end":0.001}

Evaluation

sliding window eval

parameters: {"stride":96}

Regularization

LN scale

parameters: null

Compression

lzma

level: 9

Novel Contributions

SLOT-28 eval-time optimization with 28 AdamW steps
Value Residual Learning with sigmoid-gated interpolation
Per-head learnable query scaling (QK-Gain 4.0)
Cross-head subtraction attention across all 11 layers
Frozen-model SLOT evaluation with per-window throwaway delta and logit bias