PR #1929
openRecord: SP8192 + SLOT scored-position + cross-batch EMA warmup: val_bpb=0.94569
by davie2009khView on GitHub
val_bpb
0.9457
Architecture
Transformer
Optimizer
AdamW
Artifact Size
15.87MB
Training Techniques
Quantization
GPTQ
bits: 6
scope: model weights
Architecture
depth recurrence
3-layer depth recurrence stack
parameters: {"layers":3}
parallel residuals
Parallel residual connections in upper layers
parameters: null
QK-Gain
QK-Gain scaling applied to attention
parameters: {"gain":5.25}
Weight Averaging
EMA
parameters: {"decay":0.9965}
Optimizer
Muon
weight_decay: null
momentum: null
other_params: {"Huber weight decay":true}
AdamW
weight_decay: null
momentum: null
other_params: {"steps":24,"learning_rate_start":0.008,"learning_rate_end":0.0008}
Test-Time Training
score-first TTT
parameters: {"per_sample_delta":true,"per_sample_logit_bias":true,"cross_batch_ema_warmup":true,"warmup_decay":0.5,"scored_positions_only":true}
Evaluation
sliding window eval
parameters: {"stride":64}
LR Schedule
cosine decay
parameters: {"start":0.008,"end":0.0008}
Sequence Length
sequence_length
train_length: 8192
eval_length: 8192
Compression
brotli
level: null
Novel Contributions
- Scored-position SLOT applied only to scored past tokens during evaluation
- Per-sample delta and logit_bias optimization in fp32
- Cross-batch EMA warmup that carries converged delta/logit_bias means to the next batch
- AdamW-based SLOT optimization with a 24-step cosine schedule
- SLOT restricted to eval_val_sliding on the quantized model without changing training