PR #1313

open

Record: SLOT-24 Aggressive — val_bpb 0.8637 (3-seed mean)

by anthony-maioView on GitHub

val_bpb

0.8637

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15.7-15.8 MB

Training Techniques

Quantization

late QAT

bits: null

scope: all

late QAT

bits: 6

scope: all

Architecture

GQA

Grouped query attention with 8 attention heads and 4 KV heads.

parameters: {"heads":8,"kv_heads":4}

LeakyReLU

LeakyReLU activation used in the MLP.

parameters: {"slope":0.5}

MLP3x

Three-layer MLP.

parameters: {"layers":3}

VE128

Value residual with VE128.

parameters: {"dimensions":128}

BigramHash

Bigram hash embedding with 1024 buckets.

parameters: {"buckets":1024}

XSA

XSA applied across all 11 layers.

parameters: {"layers":11}

SmearGate

SmearGate component used in the architecture.

parameters: null

U-Net skip connections

U-Net style skip connections.

parameters: null

Partial RoPE

Partial rotary positional embeddings.

parameters: {"train_eval_ratio":"16/64"}

LN Scale

Layer norm scale modification.

parameters: null

Weight Averaging

EMA

parameters: {"decay":0.997}

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":96}

Test-Time Training

score-first TTT

parameters: {"steps":24,"learning_rate":0.012}

Regularization

weight decay

parameters: {"weight_decay":1e-8}

LR Schedule

cosine decay

parameters: {"start_lr":0.012,"end_lr":0.001}

Novel Contributions

SLOT-24 eval-time hyperparameter tuning
24-step score-first SLOT adaptation
Per-sample hidden delta plus logit bias optimization
Scored-position masking with stride 96
3-seed mean validation improvement to 0.8637 bpb