PR #1321

open

Record: SLOT-48 — val_bpb 0.7406 (3-seed mean)

by anthony-maioView on GitHub

val_bpb

0.7406

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.75-15.82 MB

Training Techniques

Evaluation

sliding window eval

parameters: {"stride":96}

Architecture

SLOT

Per-sample test-time optimization of a hidden delta and logit bias with frozen model weights.

parameters: {"hidden_delta_shape":"[bsz, 1, 512]","logit_bias_shape":"[bsz, 1, 1024]"}

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

Leaky ReLU activation used in the MLP.

parameters: {"slope":0.5}

MLP3x

Three-layer MLP.

parameters: {"layers":3}

VE128

Value residual / value embedding enhancement with 128-dimensional setting.

parameters: {"dimensions":128}

BigramHash

Bigram hash embedding with 1024 buckets.

parameters: {"buckets":1024}

XSA

XSA applied across all layers.

parameters: {"layers":11}

Partial RoPE

Partial rotary positional embedding.

parameters: {"partial":"16/64"}

SmearGate

SmearGate gating mechanism.

parameters: null

U-Net skip connections

U-Net style skip connections.

parameters: null

Optimizer

AdamW

weight_decay: 1e-8

momentum: null

other_params: {"steps":48}

Compression

lzma

level: null

Quantization

late QAT

bits: 6

scope: all

Weight Averaging

EMA

parameters: {"decay":0.997}

Regularization

LN scale

parameters: null

weight decay

parameters: {"value":0.04}

Sequence Length

sequence_length

train_length: null

eval_length: null

LR Schedule

cosine decay

parameters: {"start_lr":0.012,"end_lr":0.001}

Novel Contributions

SLOT-48 test-time optimization with 48 AdamW steps
Improved val_bpb to 0.7406 using the same model and training as PR #1313
Scaling SLOT from 24 to 48 steps produced a large BPB gain
Frozen-model evaluation with per-window throwaway hidden delta and logit bias
Scored-position masking during SLOT evaluation