PR #784

open

Non-record: Depth Recurrence + XSA + LeakyReLU² (val_bpb 1.2065)

by iverbovoyView on GitHub
val_bpb
1.2065
Architecture
Transformer
Optimizer
Muon
Artifact Size
15.87MB

Training Techniques

Architecture
depth recurrence
Replaces unique blocks with shared blocks repeated across depth, creating effective layers via recurrence.
parameters: {"blocks":3,"repeats":4,"effective_layers":12,"dim":832}
Cross-Repeat Skip
Adds a learned weighted residual from the previous repeat to make depth recurrence stateful.
parameters: {"repeat_scales":"learned per-repeat"}
XSA
Exclusive Self-Attention applied to the last 4 effective layers.
parameters: {"layers":4}
Value Embeddings
Adds extra embedding tables mixed into the residual stream at each effective layer.
parameters: {"tables":2}
Loop Embedding
Adds a learned per-layer vector before each block as depth-wise positional encoding.
parameters: null
LeakyReLU^2
Uses LeakyReLU(0.5)^2 instead of relu^2.
parameters: {"negative_slope":0.5}
Quantization
GPTQ-lite
bits: 8
scope: all
Optimizer
Muon
weight_decay: 0.04
momentum: null
other_params: null
Weight Averaging
SWA
parameters: null
Compression
zstd
level: 22
Evaluation
sliding window eval
parameters: {"stride":256,"window":1024}
Sequence Length
sequence_length
train_length: 1024
eval_length: 1024
LR Schedule
warmdown
parameters: {"warmdown_iters":3000}
Regularization
weight decay
parameters: {"weight_decay":0.04}

Novel Contributions

  • Depth recurrence with Cross-Repeat Skip to turn stateless weight sharing into stateful recurrence
  • Exclusive Self-Attention on the last 4 effective layers
  • LeakyReLU(0.5)^2 activation replacing relu^2
  • Value Embeddings mixed into the residual stream
  • Loop Embedding as depth-wise positional encoding
  • GPTQ-lite post-training quantization with best-of-5 clip percentiles
  • zstd-22 compression and SWA for artifact optimization