PR #1958

open

Record: PreQuantTTT + Sliding Window on PR #1855 stack, val_bpb=1.01355 (3-seed)

val_bpb
1.0135
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15,912,740 bytes

Training Techniques

Architecture
SmearGate
Causal context mixer with BOS-fixed mask on the embedding stream.
parameters: null
weight tying
Tied embedding-style parameter sharing is referenced in the base stack.
parameters: null
Quantization
GPTQ
bits: 6
scope: Q/K/V/O and MLP
GPTQ
bits: 7
scope: embeddings
mixed int6/int7
bits: null
scope: model weights
Optimizer
AdamW
weight_decay: 0
momentum: null
other_params: {"lr":0.0005,"lr_min":0.00005,"federated_avg":true,"freeze_blocks":2,"freeze_embeddings":true,"grad_clip":1,"epochs":21}
Evaluation
sliding window eval
parameters: {"stride":64}
Test-Time Training
full TTT
parameters: {"learning_rate":0.0005,"epochs":21,"freeze_blocks":2,"freeze_embeddings":true}
Compression
lrzip
level: 9
Sequence Length
sequence_length
train_length: null
eval_length: 1024
LR Schedule
cosine decay
parameters: {"start_lr":0.0005,"end_lr":0.00005,"epochs":21}
Regularization
logit softcap
parameters: {"value":15}
weight decay
parameters: {"value":0.5}

Novel Contributions

  • Stacks pre-quantization test-time training on top of the PR #1855 SOTA base.
  • Uses sliding-window evaluation with stride 64 to improve the final quantized score.
  • Applies per-group lrzip compression to fit the artifact under the 16 MB cap.
  • Reports a new 3-seed record val_bpb of 1.01355 on track_10min_16mb.
  • Demonstrates legality-compliant pre-quant TTT by training only after the validation tokens were already graded.