PR #1755
openRecord: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT + CaseOps Tokenizer — val_bpb 1.07462 (3-seed mean)
by OE-GODView on GitHub
val_bpb
1.0746
Architecture
Transformer
Optimizer
Muon
Artifact Size
15,991,629 bytes
Training Techniques
Architecture
weight tying
Tied input and output embeddings.
parameters: null
depth recurrence
Uses recurrent layer traversal with repeated layers in encoder/decoder paths.
parameters: {"layers":3}
U-Net skip connections
Skip-gated U-Net style connections are used in the model.
parameters: null
Partial RoPE
Applies rotary position embeddings to only part of the dimensions.
parameters: {"dimensions":16,"total_dimensions":64}
LeakyReLU
Uses LeakyReLU squared activation in the MLP.
parameters: {"negative_slope":0.5}
parallel residuals
Attention and MLP branches operate on the same pre-residual input in later layers.
parameters: {"start_layer":7}
Regularization
logit softcap
parameters: {"cap":30}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: 0.095
momentum: null
other_params: {"variant":"MuonEq-R","newton_schulz_steps":5}
AdamW
weight_decay: 0.02
momentum: null
other_params: {"scope":"embeddings/scalars"}
Quantization
GPTQ
bits: 6
scope: block weights
GPTQ
bits: 8
scope: embeddings
Evaluation
sliding window eval
parameters: null
Test-Time Training
score-first TTT
parameters: {"optimizer":"SGD"}
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
Compression
lzma
level: null
brotli
level: 11
Other
other
Uses a lossless CaseOps tokenizer with a byte sidecar for original UTF-8 byte accounting during BPB computation.
parameters: null
Novel Contributions
- Integrates a lossless CaseOps tokenizer and byte-sidecar BPB accounting into the merged legal-TTT stack.
- Adds byte-sidecar handling to validation evaluation functions for accurate original-byte BPB computation.
- Excludes pre-quant TTT to preserve score-before-update compliance.
- Fixes validation token loading to ignore byte-sidecar files and avoid double-counting.