PR #1914
openNon-record: PR #1797 reproduction + EMBED_CLIP relax + 5 ablation studies
by FijaView on GitHub
val_bpb
1.0612
Architecture
Transformer
Optimizer
SGD
Artifact Size
15,861,545 bytes
Training Techniques
Quantization
GPTQ
bits: 6
scope: MLP
int6
bits: 6
scope: MLP
Architecture
Gated Attention
Attention uses a learned scalar out-gate per head with quant-gate enabled.
parameters: {"num_heads":8,"num_kv_heads":4}
depth recurrence
Loop4-5 recurrent depth structure with parallel residual start layer.
parameters: {"loop_start":3,"loop_end":5,"parallel_start_layer":8}
weight tying
Not explicitly stated in the submission text.
parameters: null
Test-Time Training
score-first TTT
parameters: {"phases":3,"prefix_docs":2000}
Evaluation
sliding window eval
parameters: {"stride":64,"context_length":2048}
Sequence Length
sequence_length
train_length: null
eval_length: 2048
Regularization
logit softcap
parameters: {"value":30}
Compression
brotli
level: 11
Novel Contributions
- 3-seed reproduction of PR #1797-style stack under the 16 MB cap with statistically equivalent performance
- EMBED_CLIP_SIGMAS relaxation from 14 to 20 to reduce artifact size and fit under cap without lrzip
- Five orthogonal ablations with first-principles explanations for why they do not transfer to this regime
- Empirical comparison showing brotli q=11 compresses PR #1797-style quantized weight blobs better than lzma
- Documentation of legality proofs against Issue #1017 conditions for the TTT pipeline