PR #1906

open

Record: PR #1797 reproduction — val_bpb 1.06136 (3-seed mean)

by AayushBaniya2006View on GitHub
val_bpb
1.0614
Architecture
Transformer
Optimizer
Artifact Size
15,950,662 bytes

Training Techniques

Quantization
GPTQ
bits: 6
scope: model weights
mixed int2/int4
bits: null
scope: LQER factors
Architecture
SmearGate
Smear gate applied with a windowed gate mechanism.
parameters: {"window":12}
Gated Attention
Quantized gated attention used in the stack.
parameters: null
depth recurrence
Three-layer depth recurrence / looped residual structure.
parameters: {"layers":3}
Test-Time Training
score-first TTT
parameters: {"phases":3}
Evaluation
sliding window eval
parameters: {"stride":64}
Sequence Length
sequence_length
train_length: null
eval_length: null
Regularization
logit softcap
parameters: null
Compression
lzma
level: null

Novel Contributions

  • Independent 3-seed reproduction of PR #1797's stack
  • Achieved 1.06136 val_bpb mean with 0.00059 std
  • Reported >5σ improvement over PR #1797 baseline
  • Included Smear Gate and LQER Asymmetric on top of PR #1787 base
  • Documented Gram Newton-Schulz ablation and dropped it after regression
  • Used score-first phased TTT with sliding-window evaluation and CaseOps byte sidecar