PR #2028

open

[codex] Non-record: SP8192 Past-only Delta Geo Ruler - val_bpb 1.08979

by Arnie016View on GitHub
val_bpb
1.0898
Architecture
Transformer
Optimizer
Artifact Size
15,995,193 bytes

Training Techniques

Architecture
depth recurrence
3-layer depth recurrence over middle layers of the model.
parameters: {"layers":[3,4,5]}
weight tying
Tied input and output embeddings.
parameters: null
Partial RoPE
Uses partial rotary positional embeddings.
parameters: null
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
Uses LeakyReLU squared MLP activation.
parameters: {"slope":0.5}
parallel residuals
Adds parallel residual connections from later layers.
parameters: {"layer":7}
Quantization
GPTQ
bits: 6
scope: matrices
int8
bits: 8
scope: embeddings
Compression
Brotli
level: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
TTT
parameters: {"enabled":false}
Regularization
logit softcap
parameters: null

Novel Contributions

  • Strictly past-only delta geo ruler using geometric offsets from earlier hidden states
  • Avoids future anchors, same-block averaging, eval-built caches, and score-after-update adaptation
  • Single-seed non-record submission that improves the author's prior local sliding-window result
  • Uses a tiny late-layer ruler injected into layers 9 and 10