PR #589
closedRecord: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178
by RoyiRaView on GitHub
val_bpb
1.1178
Architecture
Transformer
Optimizer
SGD
Artifact Size
~15.75 MB
Training Techniques
Quantization
QAT
bits: 6
scope: all
Architecture
MLP3x
Three-layer MLP stack using LeakyReLU(0.5)^2 activation.
parameters: {"layers":3}
BigramHash
BigramHash component used in the model stack.
parameters: {"size":3072}
XSA
XSA applied to the last 4 layers.
parameters: {"layers":4}
RoPE
Partial rotary positional embeddings.
parameters: {"dimensions":16,"total_dimensions":64}
Regularization
layerwise LN scale
parameters: {"scale":"1/sqrt(layer+1)"}
Weight Averaging
EMA
parameters: {"decay":0.997}
SWA
parameters: {"frequency":50,"description":"tight SWA every 50 steps"}
Compression
zstd
level: null
Evaluation
sliding window eval
parameters: {"stride":64,"seq_len":2048}
Test-Time Training
score-first TTT
parameters: {"chunk_size":32768,"optimizer":"SGD","learning_rate":0.002,"momentum":0.9,"epochs":3,"grad_clip":1,"frozen_blocks":null}
Optimizer
SGD
weight_decay: null
momentum: 0.9
other_params: {"learning_rate":0.002,"cosine_decay":true,"grad_clip":1}
LR Schedule
cosine decay
parameters: {"learning_rate":0.002,"applied_to":"TTT across chunks"}
Other
other
Late soft-round QAT using a temperature-controlled sigmoid-interpolated surrogate in the backward pass while keeping hard quantized forward pass.
parameters: {"tau":0.1,"warmdown_scale_threshold":0.02}
Novel Contributions
- Late Soft-Round QAT
- Score-First Backward-Looking TTT
- Temperature-controlled soft-round surrogate for bin-aware gradients near quantization boundaries
- Backward-looking chunk-wise test-time training where each chunk is scored before being trained on