PR #1368
opennon-record submission: mean-delta warm start + depth recurrence for SLOT (0.8503 BPB)
by JKSNSView on GitHub
val_bpb
0.8503
Architecture
Transformer
Optimizer
AdamW
Artifact Size
13.3 MB
Training Techniques
Architecture
depth recurrence
Layers 4 and 5 are executed twice per forward pass, creating 13 virtual layers from 11 physical layers with learned per-iteration conditioning.
parameters: {"layers":[4,5],"virtual_layers":13,"physical_layers":11}
iter_embed
Learned per-iteration conditioning signal used for repeated layer passes.
parameters: null
iter_gate
Learned gate controlling repeated layer passes, initialized to -2.0.
parameters: {"init":-2}
GQA
Grouped query attention with 8 query heads and 4 KV heads.
parameters: {"query_heads":8,"kv_heads":4}
XSA
XSA used in all layers.
parameters: null
BigramHash
Bigram hash embeddings.
parameters: {"vocab":1024,"dim":128}
Partial RoPE
RoPE applied to a subset of dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
U-Net skip connections
U-Net style skip connections with learned gates.
parameters: null
MLP3x
Three-times wider MLP with 1536 hidden units.
parameters: {"hidden":1536}
LeakyReLU
LeakyReLU activation with squared application.
parameters: {"slope":0.5,"squared":true}
Regularization
label smoothing
parameters: {"value":0.1}
Quantization
GPTQ
bits: 6
scope: all
Compression
lzma
level: null
Evaluation
sliding window eval
parameters: {"stride":96}
Other
other
Mean-delta SLOT warm start that carries the decayed mean of previous batch deltas forward to initialize the next batch's SLOT optimization.
parameters: {"alpha":0.9,"steps":32}
Novel Contributions
- Mean-delta warm start for SLOT using the decayed mean of previous batch deltas
- Depth recurrence by repeating layers 4 and 5 to create 13 virtual layers from 11 physical layers
- Learned per-iteration conditioning with iter_embed and iter_gate
- Identification of a label smoothing configuration error that degraded short-horizon training