PR #1958

open

Record: PreQuantTTT + Sliding Window on PR #1855 stack, val_bpb=1.01355 (3-seed)

by okezueView on GitHub

val_bpb

1.0135

Architecture

Transformer

Optimizer

AdamW

Artifact Size

15,912,740 bytes

Training Techniques

Architecture

SmearGate

Causal context mixer with BOS-fixed mask on the embedding stream.

parameters: null

weight tying

Tied embedding-style parameter sharing is referenced in the base stack.

parameters: null

Quantization

GPTQ

bits: 6

scope: Q/K/V/O and MLP

GPTQ

bits: 7

scope: embeddings

mixed int6/int7

bits: null

scope: model weights

Optimizer

AdamW

weight_decay: 0

momentum: null

other_params: {"lr":0.0005,"lr_min":0.00005,"federated_avg":true,"freeze_blocks":2,"freeze_embeddings":true,"grad_clip":1,"epochs":21}

Evaluation

sliding window eval

parameters: {"stride":64}

Test-Time Training

full TTT

parameters: {"learning_rate":0.0005,"epochs":21,"freeze_blocks":2,"freeze_embeddings":true}

Compression

lrzip

level: 9

Sequence Length

sequence_length

train_length: null

eval_length: 1024

LR Schedule

cosine decay

parameters: {"start_lr":0.0005,"end_lr":0.00005,"epochs":21}

Regularization

logit softcap

parameters: {"value":15}

weight decay

parameters: {"value":0.5}

Novel Contributions

Stacks pre-quantization test-time training on top of the PR #1855 SOTA base.
Uses sliding-window evaluation with stride 64 to improve the final quantized score.
Applies per-group lrzip compression to fit the artifact under the 16 MB cap.
Reports a new 3-seed record val_bpb of 1.01355 on track_10min_16mb.
Demonstrates legality-compliant pre-quant TTT by training only after the validation tokens were already graded.