PR #1906
openRecord: PR #1797 reproduction — val_bpb 1.06136 (3-seed mean)
by AayushBaniya2006View on GitHub
val_bpb
1.0614
Architecture
Transformer
Optimizer
—
Artifact Size
15,950,662 bytes
Training Techniques
Quantization
GPTQ
bits: 6
scope: model weights
mixed int2/int4
bits: null
scope: LQER factors
Architecture
SmearGate
Smear gate applied with a windowed gate mechanism.
parameters: {"window":12}
Gated Attention
Quantized gated attention used in the stack.
parameters: null
depth recurrence
Three-layer depth recurrence / looped residual structure.
parameters: {"layers":3}
Test-Time Training
score-first TTT
parameters: {"phases":3}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: null
eval_length: null
Regularization
logit softcap
parameters: null
Compression
lzma
level: null
Novel Contributions
- Independent 3-seed reproduction of PR #1797's stack
- Achieved 1.06136 val_bpb mean with 0.00059 std
- Reported >5σ improvement over PR #1797 baseline
- Included Smear Gate and LQER Asymmetric on top of PR #1787 base
- Documented Gram Newton-Schulz ablation and dropped it after regression
- Used score-first phased TTT with sliding-window evaluation and CaseOps byte sidecar