PR #1368

open

non-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB)

by JKSNSView on GitHub

val_bpb

0.8503

Architecture

Transformer

Optimizer

AdamW

Artifact Size

13.3 MB

Training Techniques

Architecture

depth recurrence

Layers 4 and 5 are executed twice per forward pass, creating 13 virtual layers from 11 physical layers with learned per-iteration conditioning.

parameters: {"layers":[4,5],"virtual_layers":13,"physical_layers":11}

iter_embed

Learned per-iteration conditioning signal used for repeated layer passes.

parameters: null

iter_gate

Learned gate controlling repeated layer passes, initialized to -2.0.

parameters: {"init":-2}

GQA

Grouped query attention with 8 query heads and 4 KV heads.

parameters: {"query_heads":8,"kv_heads":4}

XSA

XSA used in all layers.

parameters: null

BigramHash

Bigram hash embeddings.

parameters: {"vocab":1024,"dim":128}

Partial RoPE

RoPE applied to a subset of dimensions.

parameters: {"dimensions":16,"total_dimensions":64}

U-Net skip connections

U-Net style skip connections with learned gates.

parameters: null

MLP3x

Three-times wider MLP with 1536 hidden units.

parameters: {"hidden":1536}

LeakyReLU

LeakyReLU activation with squared application.

parameters: {"slope":0.5,"squared":true}

Regularization

label smoothing

parameters: {"value":0.1}

Quantization

GPTQ

bits: 6

scope: all

Compression

lzma

level: null

Evaluation

sliding window eval

parameters: {"stride":96}

Other

other

Mean-delta SLOT warm start that carries the decayed mean of previous batch deltas forward to initialize the next batch's SLOT optimization.

parameters: {"alpha":0.9,"steps":32}

Novel Contributions

Mean-delta warm start for SLOT using the decayed mean of previous batch deltas
Depth recurrence by repeating layers 4 and 5 to create 13 virtual layers from 11 physical layers
Learned per-iteration conditioning with iter_embed and iter_gate
Identification of a label smoothing configuration error that degraded short-horizon training