PR #1914

open

Non-record: PR #1797 reproduction + EMBED_CLIP relax + 5 ablation studies

val_bpb

1.0612

Architecture

Transformer

Optimizer

SGD

Artifact Size

15,861,545 bytes

Training Techniques

Quantization

GPTQ

bits: 6

scope: MLP

int6

bits: 6

scope: MLP

Architecture

Gated Attention

Attention uses a learned scalar out-gate per head with quant-gate enabled.

parameters: {"num_heads":8,"num_kv_heads":4}

depth recurrence

Loop4-5 recurrent depth structure with parallel residual start layer.

parameters: {"loop_start":3,"loop_end":5,"parallel_start_layer":8}

weight tying

Not explicitly stated in the submission text.

parameters: null

Test-Time Training

score-first TTT

parameters: {"phases":3,"prefix_docs":2000}

Evaluation

sliding window eval

parameters: {"stride":64,"context_length":2048}

Sequence Length

sequence_length

train_length: null

eval_length: 2048

Regularization

logit softcap

parameters: {"value":30}

Compression

brotli

level: 11

3-seed reproduction of PR #1797-style stack under the 16 MB cap with statistically equivalent performance
EMBED_CLIP_SIGMAS relaxation from 14 to 20 to reduce artifact size and fit under cap without lrzip
Five orthogonal ablations with first-principles explanations for why they do not transfer to this regime
Empirical comparison showing brotli q=11 compresses PR #1797-style quantized weight blobs better than lzma
Documentation of legality proofs against Issue #1017 conditions for the TTT pipeline