PR #1906

open

Record: PR #1797 reproduction — val_bpb 1.06136 (3-seed mean)

by AayushBaniya2006View on GitHub

val_bpb

1.0614

Architecture

Transformer

Optimizer

—

Artifact Size

15,950,662 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: model weights

mixed int2/int4

bits: null

scope: LQER factors

Architecture

SmearGate

Smear gate applied with a windowed gate mechanism.

parameters: {"window":12}

Gated Attention

Quantized gated attention used in the stack.

parameters: null

depth recurrence

Three-layer depth recurrence / looped residual structure.

parameters: {"layers":3}

Test-Time Training

score-first TTT

parameters: {"phases":3}

Evaluation

sliding window eval

parameters: {"stride":64}

Sequence Length

sequence_length

train_length: null

eval_length: null

Regularization

logit softcap

parameters: null

Compression

lzma

level: null

Independent 3-seed reproduction of PR #1797's stack
Achieved 1.06136 val_bpb mean with 0.00059 std
Reported >5σ improvement over PR #1797 baseline
Included Smear Gate and LQER Asymmetric on top of PR #1787 base
Documented Gram Newton-Schulz ablation and dropped it after regression
Used score-first phased TTT with sliding-window evaluation and CaseOps byte sidecar