PR #1929

open

Record: SP8192 + SLOT scored-position + cross-batch EMA warmup: val_bpb=0.94569

by davie2009khView on GitHub
val_bpb
0.9457
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.87MB

Training Techniques

Quantization
GPTQ
bits: 6
scope: model weights
Architecture
depth recurrence
3-layer depth recurrence stack
parameters: {"layers":3}
parallel residuals
Parallel residual connections in upper layers
parameters: null
QK-Gain
QK-Gain scaling applied to attention
parameters: {"gain":5.25}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"Huber weight decay":true}
AdamW
weight_decay: null
momentum: null
other_params: {"steps":24,"learning_rate_start":0.008,"learning_rate_end":0.0008}
Test-Time Training
score-first TTT
parameters: {"per_sample_delta":true,"per_sample_logit_bias":true,"cross_batch_ema_warmup":true,"warmup_decay":0.5,"scored_positions_only":true}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
cosine decay
parameters: {"start":0.008,"end":0.0008}
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
Compression
brotli
level: null

Novel Contributions

  • Scored-position SLOT applied only to scored past tokens during evaluation
  • Per-sample delta and logit_bias optimization in fp32
  • Cross-batch EMA warmup that carries converged delta/logit_bias means to the next batch
  • AdamW-based SLOT optimization with a 24-step cosine schedule
  • SLOT restricted to eval_val_sliding on the quantized model without changing training