PR #2028
open[codex] Non-record: SP8192 Past-only Delta Geo Ruler - val_bpb 1.08979
by Arnie016View on GitHub
val_bpb
1.0898
Architecture
Transformer
Optimizer
—
Artifact Size
15,995,193 bytes
Training Techniques
Architecture
depth recurrence
3-layer depth recurrence over middle layers of the model.
parameters: {"layers":[3,4,5]}
weight tying
Tied input and output embeddings.
parameters: null
Partial RoPE
Uses partial rotary positional embeddings.
parameters: null
GQA
Uses grouped query attention with fewer KV heads than attention heads.
parameters: {"heads":8,"kv_heads":4}
LeakyReLU
Uses LeakyReLU squared MLP activation.
parameters: {"slope":0.5}
parallel residuals
Adds parallel residual connections from later layers.
parameters: {"layer":7}
Quantization
GPTQ
bits: 6
scope: matrices
int8
bits: 8
scope: embeddings
Compression
Brotli
level: null
Evaluation
sliding window eval
parameters: null
Test-Time Training
TTT
parameters: {"enabled":false}
Regularization
logit softcap
parameters: null
Novel Contributions
- Strictly past-only delta geo ruler using geometric offsets from earlier hidden states
- Avoids future anchors, same-block averaging, eval-built caches, and score-after-update adaptation
- Single-seed non-record submission that improves the author's prior local sliding-window result
- Uses a tiny late-layer ruler injected into layers 9 and 10