PR #1415

open

Record: SP4096 + 3-Layer Recurrence + GPTQ Embeddings + SDClip + ETLB — val_bpb 1.0913 (3-seed mean)

val_bpb

1.0913

Architecture

Transformer

Optimizer

Muon

Artifact Size

~14.75 MB

Training Techniques

Quantization

GPTQ

bits: 8

scope: embeddings

GPTQ

bits: 6

scope: all

Compression

lzma

level: null

Architecture

depth recurrence

3-layer depth recurrence applied to layers 3, 4, and 5

parameters: {"layers":[3,4,5]}

Evaluation

sliding window eval

parameters: {"stride":64}

Other

other

Eval-time logit bias optimized on context tokens during sliding-window evaluation

parameters: {"method":"ETLB","steps":5,"learning_rate":0.05,"clip":3,"warm_start":true}

Regularization

weight decay

parameters: {"weight_decay":0.095}

LR Schedule

higher LR compensation

parameters: {"matrix_lr":0.022}